SuperWord loop optimization lost after method inlining
Nicolas Heutte
nhe at activeviam.com
Wed Feb 10 17:24:27 UTC 2021
Hi all,
I am encountering a performance issue caused by the interaction between
method inlining and automatic vectorization.
Our application aggregates arrays intensively using a method named
ArrayFloatToArrayFloatVectorBinding.plus() with the following code:
for (int i = 0; i < srcLen; ++i) {
dstArray[i] += srcArray[i];
}
When we microbenchmark this method we observe fast performance close to the
practical memory bandwidth and when we print the assembly code we observe
loop unrolling and automatic vectorization with SIMD instructions.
0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
0x000001ef4600ac05: movslq %r13d,%r11
0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4) ;*fastore
{reexecute=0 rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
(line 41)
0x000001ef4600acbf: add $0x40,%r13d ;*iinc {reexecute=0
rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
(line 40)
0x000001ef4600acc3: cmp %eax,%r13d
0x000001ef4600acc6: jl 0x000001ef4600abf0 ;*goto {reexecute=0
rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
(line 40)
In the real application, this method is actually inlined in a higher level
method named AVector.plus(). Unfortunately, the inlined version of the
aggregation code is not vectorized anymore:
0x000001ef460180a0: cmp %ebx,%r11d
0x000001ef460180a3: jae 0x000001ef460180e6
0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1 ;*faload {reexecute=0
rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
(line 41)
; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)
0x000001ef460180ac: cmp %ecx,%r11d
0x000001ef460180af: jae 0x000001ef46018104
0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4) ;*fastore {reexecute=0
rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
(line 41)
; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)
0x000001ef460180bf: inc %r11d ;*iinc {reexecute=0
rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
(line 40)
; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)
0x000001ef460180c2: cmp %r10d,%r11d
0x000001ef460180c5: jl 0x000001ef460180a0 ;*goto {reexecute=0
rethrow=0 return_oop=0}
; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
(line 40)
; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)
This causes a significant performance drop, compared to a run where we
explicitly disable the inlining and observe automatically vectorized code
again (
-XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
).
How do you guys explain that behavior of the JIT compiler? Is this a known
and tracked issue, could it be fixed in the JVM? Can we do something in the
java code to prevent this from happening?
Best regards,
Nicolas Heutte
More information about the hotspot-dev
mailing list