Sum of integers optimization
Krystal Mok
rednaxelafx at gmail.com
Tue Oct 31 23:42:56 UTC 2017
Hi Ionut,
tl;dr: C2's infrastructure for optimizing loops can be made a lot stronger,
but from the current directions we can see around the OpenJDK community,
it's very unlikely for C2 to receive a major infrastructural upgrade in the
future. If you'd like to contribute to Graal to help optimize this kind of
code, I'm sure a lot of us in the community would love that.
You're right about the code produced by C2. Just ran your example on
JDK9/macOS and the main loop produced by C2 is:
0x0000000118ee6640: movslq %r11d,%r10 ;*i2l {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 12 (line 7)
0x0000000118ee6643: add %r10,%rax
0x0000000118ee6646: add %r10,%rax
0x0000000118ee6649: add %r10,%rax
0x0000000118ee664c: add %r10,%rax
0x0000000118ee664f: add %r10,%rax
0x0000000118ee6652: add %r10,%rax
0x0000000118ee6655: add %r10,%rax
0x0000000118ee6658: add %r10,%rax
0x0000000118ee665b: add %r10,%rax
0x0000000118ee665e: add %r10,%rax
0x0000000118ee6661: add %r10,%rax
0x0000000118ee6664: add %r10,%rax
0x0000000118ee6667: add %r10,%rax
0x0000000118ee666a: add %r10,%rax
0x0000000118ee666d: add %r10,%rax
0x0000000118ee6670: add %r10,%rax
0x0000000118ee6673: add $0x78,%rax ;*ladd {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 13 (line 7)
0x0000000118ee6677: add $0x10,%r11d ;*iinc {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 15 (line 6)
Pretty much the same as what you saw.
It's certainly possible to tweak C2 or some other JIT compiler to make it
more optimized for this test case. I don't have a copy of Zing right now
but I believe its Falcon compiler will compile this down to the N*(N-1)/2
form that you expected, since the LLVM it's based on can compile this piece
of C code:
#include <stdint.h>
int64_t sum(int n) {
int64_t sum = 0;
for (int32_t i = 1; i <= n; i++) {
sum += i;
}
return sum;
}
Down to:
sum:
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
add edi, -2
imul rdi, rax
shr rdi
lea rax, [rdi + 2*rax + 1]
ret
.LBB0_1:
xor eax, eax
ret
For this test case, C2 could at least do a few things to generate better
code:
1. A better expression canonicalizer that flattens expression trees. The
chain of adds you see in the resulting code is because of the 16x unrolled sum
+= 1 is turned into:
// 120 == 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 +
15
sum = ((((((((((((((((sum + i) + i) + i) + i) + i) + i) + i) + i) + i) + i)
+ i) + i) + i) + i) + i) + i) + 120
See how the additions involving i are skewed to the left, effectively
degenerating an expression tree into a "linked list of additions". C2's
value number, on its own, doesn't recognize that it can reassociate the
expression into a flatter tree, e.g.
((((i + i) + (i + i)) + ((i + i) + (i + i))) + (((i + i) + (i + i)) + ((i +
i) + (i + i)))) + sum + 120
in which case C2's value number will be able to turn into:
t1 = i + i
t2 = t1 + t1
t3 = t2 + t2
t4 = t3 + t3
sum = t4 + sum + 120
and then into sum = (i << 4) + sum + 120.
This kind of reassociation will at least help make the loop body better,
without involving any complicated loop optimizations. The "tree flattening"
reassociation can actually be implemented by directly linearizing an
expression tree into a C0*X + C1*Y + ... + C2 form.
To get to the end goal of optimizing the whole loop into the N*(N-1)/2
form, you'd need more advanced loop analysis, e.g. something akin to LLVM's
SCEV, to recognize how "sum" is related to the loop induction variable.
BTW, Graal from graalvm-0.22 generates a straightforward loop for this case:
XYZ.sum (null) [0x000000010cc091e0, 0x000000010cc09230] 80 bytes
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000000011ebfd768} 'sum' '()J' in 'XYZ'
# [sp+0x10] (sp of caller)
0x000000010cc091e0: nopl 0x0(%rax,%rax,1)
0x000000010cc091e5: mov $0x1,%r10d
0x000000010cc091eb: mov $0x0,%rax
0x000000010cc091f2: jmpq 0x000000010cc0920f
0x000000010cc091f7: nopw 0x0(%rax,%rax,1) ;*if_icmpgt {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 7 (line 6)
0x000000010cc09200: mov %r10d,%r11d
0x000000010cc09203: inc %r11d ;*iinc {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 15 (line 6)
0x000000010cc09206: movslq %r10d,%r10 ;*i2l {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 12 (line 7)
0x000000010cc09209: add %r10,%rax ;*ladd {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 13 (line 7)
0x000000010cc0920c: mov %r11d,%r10d ;*goto {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 18 (line 6)
0x000000010cc0920f: cmp $0x186a1,%r10d
0x000000010cc09216: jl 0x000000010cc09200 ;*if_icmpgt {reexecute=0
rethrow=0 return_oop=0}
; - XYZ::sum at 7 (line 6)
0x000000010cc09218: test %eax,-0x1d69218(%rip) #
0x000000010aea0006
; {poll_return}
0x000000010cc0921e: vzeroupper
0x000000010cc09221: retq
- Kris
On Tue, Oct 31, 2017 at 2:59 PM, Ionut <ionutb83 at yahoo.com> wrote:
> Hello All,
>
> I am playing with below example (very trivial, just computing a sum of
> 1...N integers):
>
> *@Benchmark*
> *public long sum() {*
> * long sum = 0;*
> * for (int i = 1; i <= N; i++) {*
> * sum += i;*
> * }*
> * return sum;*
> *}*
>
>
> Generated asm on my machine (snapshot from the main scalar loop):
> ..............................
> ........................
> ↗ 0x00007f4779bff060: movsxd r10,r11d
> │ 0x00007f4779bff063: add rax,r10
> 7.67% 24.83% │ 0x00007f4779bff066: add rax,r10
> 6.11% 3.64% │ 0x00007f4779bff069: add rax,r10
> 4.54% 3.71% │ 0x00007f4779bff06c: add rax,r10
> 6.12% 5.85% │ 0x00007f4779bff06f: add rax,r10
> 5.75% 4.21% │ 0x00007f4779bff072: add rax,r10
> 5.96% 4.38% │ 0x00007f4779bff075: add rax,r10
> 4.23% 3.63% │ 0x00007f4779bff078: add rax,r10
> 6.70% 6.32% │ 0x00007f4779bff07b: add rax,r10
> 7.40% 4.56% │ 0x00007f4779bff07e: add rax,r10
> 4.61% 3.31% │ 0x00007f4779bff081: add rax,r10
> 5.45% 5.24% │ 0x00007f4779bff084: add rax,r10
> 5.99% 5.14% │ 0x00007f4779bff087: add rax,r10
> 7.70% 5.36% │ 0x00007f4779bff08a: add rax,r10
> 5.17% 4.16% │ 0x00007f4779bff08d: add rax,r10
> 3.97% 3.83% │ 0x00007f4779bff090: add rax,r10
> 4.80% 3.97% │ 0x00007f4779bff093: add rax,0x78
>
> 5.92% 5.97% │ 0x00007f4779bff097: add r11d,0x10
> 0.01% │ 0x00007f4779bff09b: cmp r11d,0x5f5e0f2
> ╰ 0x00007f4779bff0a2: jl 0x00007f4779bff060
> ..............................
> ........................
>
> *Questions*:
> - Would it be possible for JIT C2 to perform a better optimization in this
> context, as for example replacing the main loop (which might be costly) by
> a reduction formula as N*(N-1)/2 (in this specific case)?
> - Is there any context where JIT C2 can perform such optimization but I am
> missing?
> - If not, what prevents it for doing this?
>
> Thanks
> Ionut
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20171031/661d08d6/attachment-0001.html>
More information about the hotspot-compiler-dev
mailing list