Sum of integers optimization

Tue Oct 31 23:42:56 UTC 2017

Hi Ionut,

tl;dr: C2's infrastructure for optimizing loops can be made a lot stronger,
but from the current directions we can see around the OpenJDK community,
it's very unlikely for C2 to receive a major infrastructural upgrade in the
future. If you'd like to contribute to Graal to help optimize this kind of
code, I'm sure a lot of us in the community would love that.

You're right about the code produced by C2. Just ran your example on
JDK9/macOS and the main loop produced by C2 is:

  0x0000000118ee6640: movslq %r11d,%r10         ;*i2l {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 12 (line 7)

  0x0000000118ee6643: add    %r10,%rax
  0x0000000118ee6646: add    %r10,%rax
  0x0000000118ee6649: add    %r10,%rax
  0x0000000118ee664c: add    %r10,%rax
  0x0000000118ee664f: add    %r10,%rax
  0x0000000118ee6652: add    %r10,%rax
  0x0000000118ee6655: add    %r10,%rax
  0x0000000118ee6658: add    %r10,%rax
  0x0000000118ee665b: add    %r10,%rax
  0x0000000118ee665e: add    %r10,%rax
  0x0000000118ee6661: add    %r10,%rax
  0x0000000118ee6664: add    %r10,%rax
  0x0000000118ee6667: add    %r10,%rax
  0x0000000118ee666a: add    %r10,%rax
  0x0000000118ee666d: add    %r10,%rax
  0x0000000118ee6670: add    %r10,%rax
  0x0000000118ee6673: add    $0x78,%rax         ;*ladd {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 13 (line 7)

  0x0000000118ee6677: add    $0x10,%r11d        ;*iinc {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 15 (line 6)

Pretty much the same as what you saw.

It's certainly possible to tweak C2 or some other JIT compiler to make it
more optimized for this test case. I don't have a copy of Zing right now
but I believe its Falcon compiler will compile this down to the N*(N-1)/2
form that you expected, since the LLVM it's based on can compile this piece
of C code:

#include <stdint.h>

int64_t sum(int n) {
  int64_t sum = 0;
  for (int32_t i = 1; i <= n; i++) {
    sum += i;
  }
  return sum;
}

Down to:

sum:
  test edi, edi
  jle .LBB0_1
  lea eax, [rdi - 1]
  add edi, -2
  imul rdi, rax
  shr rdi
  lea rax, [rdi + 2*rax + 1]
  ret
.LBB0_1:
  xor eax, eax
  ret

For this test case, C2 could at least do a few things to generate better
code:
1. A better expression canonicalizer that flattens expression trees. The
chain of adds you see in the resulting code is because of the 16x unrolled sum
+= 1 is turned into:

// 120 == 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 +
15
sum = ((((((((((((((((sum + i) + i) + i) + i) + i) + i) + i) + i) + i) + i)
+ i) + i) + i) + i) + i) + i) + 120

See how the additions involving i are skewed to the left, effectively
degenerating an expression tree into a "linked list of additions". C2's
value number, on its own, doesn't recognize that it can reassociate the
expression into a flatter tree, e.g.

((((i + i) + (i + i)) + ((i + i) + (i + i))) + (((i + i) + (i + i)) + ((i +
i) + (i + i)))) + sum + 120

in which case C2's value number will be able to turn into:

t1 = i + i
t2 = t1 + t1
t3 = t2 + t2
t4 = t3 + t3
sum = t4 + sum + 120

and then into sum = (i << 4) + sum + 120.

This kind of reassociation will at least help make the loop body better,
without involving any complicated loop optimizations. The "tree flattening"
reassociation can actually be implemented by directly linearizing an
expression tree into a C0*X + C1*Y + ... + C2 form.

To get to the end goal of optimizing the whole loop into the N*(N-1)/2
form, you'd need more advanced loop analysis, e.g. something akin to LLVM's
SCEV, to recognize how "sum" is related to the loop induction variable.

BTW, Graal from graalvm-0.22 generates a straightforward loop for this case:

XYZ.sum (null)  [0x000000010cc091e0, 0x000000010cc09230]  80 bytes
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x000000011ebfd768} 'sum' '()J' in 'XYZ'
  #           [sp+0x10]  (sp of caller)
  0x000000010cc091e0: nopl   0x0(%rax,%rax,1)
  0x000000010cc091e5: mov    $0x1,%r10d
  0x000000010cc091eb: mov    $0x0,%rax
  0x000000010cc091f2: jmpq   0x000000010cc0920f
  0x000000010cc091f7: nopw   0x0(%rax,%rax,1)   ;*if_icmpgt {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 7 (line 6)

  0x000000010cc09200: mov    %r10d,%r11d
  0x000000010cc09203: inc    %r11d              ;*iinc {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 15 (line 6)

  0x000000010cc09206: movslq %r10d,%r10         ;*i2l {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 12 (line 7)

  0x000000010cc09209: add    %r10,%rax          ;*ladd {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 13 (line 7)

  0x000000010cc0920c: mov    %r11d,%r10d        ;*goto {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 18 (line 6)

  0x000000010cc0920f: cmp    $0x186a1,%r10d
  0x000000010cc09216: jl     0x000000010cc09200  ;*if_icmpgt {reexecute=0
rethrow=0 return_oop=0}
                                                ; - XYZ::sum at 7 (line 6)

  0x000000010cc09218: test   %eax,-0x1d69218(%rip)        #
0x000000010aea0006
                                                ;   {poll_return}
  0x000000010cc0921e: vzeroupper
  0x000000010cc09221: retq

- Kris

On Tue, Oct 31, 2017 at 2:59 PM, Ionut <ionutb83 at yahoo.com> wrote:

> Hello All,
>
>   I am playing with below example (very trivial, just computing a sum of
> 1...N integers):
>
> *@Benchmark*
> *public long sum() {*
> *    long sum = 0;*
> *    for (int i = 1; i <= N; i++) {*
> *   sum += i;*
> * }*
> *    return sum;*
> *}*
>
>
> Generated asm on my machine (snapshot from the main scalar loop):
>                                 ..............................
> ........................
>                             ↗  0x00007f4779bff060: movsxd r10,r11d
>                            │  0x00007f4779bff063: add    rax,r10
>   7.67%   24.83% │  0x00007f4779bff066: add    rax,r10
>   6.11%    3.64%  │  0x00007f4779bff069: add    rax,r10
>   4.54%    3.71%  │  0x00007f4779bff06c: add    rax,r10
>   6.12%    5.85%  │  0x00007f4779bff06f: add    rax,r10
>   5.75%    4.21%  │  0x00007f4779bff072: add    rax,r10
>   5.96%    4.38%  │  0x00007f4779bff075: add    rax,r10
>   4.23%    3.63%  │  0x00007f4779bff078: add    rax,r10
>   6.70%    6.32%  │  0x00007f4779bff07b: add    rax,r10
>   7.40%    4.56%  │  0x00007f4779bff07e: add    rax,r10
>   4.61%    3.31%  │  0x00007f4779bff081: add    rax,r10
>   5.45%    5.24%  │  0x00007f4779bff084: add    rax,r10
>   5.99%    5.14%  │  0x00007f4779bff087: add    rax,r10
>   7.70%    5.36%  │  0x00007f4779bff08a: add    rax,r10
>   5.17%    4.16%  │  0x00007f4779bff08d: add    rax,r10
>   3.97%    3.83%  │  0x00007f4779bff090: add    rax,r10
>   4.80%    3.97%  │  0x00007f4779bff093: add    rax,0x78
>
>   5.92%    5.97%  │  0x00007f4779bff097: add    r11d,0x10
>            0.01%      │  0x00007f4779bff09b: cmp    r11d,0x5f5e0f2
>                           ╰  0x00007f4779bff0a2: jl     0x00007f4779bff060
>                                 ..............................
> ........................
>
> *Questions*:
> - Would it be possible for JIT C2 to perform a better optimization in this
> context, as for example replacing the main loop (which might be costly) by
> a reduction formula as N*(N-1)/2 (in this specific case)?
> - Is there any context where JIT C2 can perform such optimization but I am
> missing?
> - If not, what prevents it for doing this?
>
> Thanks
> Ionut
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20171031/661d08d6/attachment-0001.html>