SuperWord loop optimization lost after method inlining

Wed Feb 10 17:24:27 UTC 2021

Hi all,

I am encountering a performance issue caused by the interaction between
method inlining and automatic vectorization.

Our application aggregates arrays intensively using a method named
ArrayFloatToArrayFloatVectorBinding.plus() with the following code:

    for (int i = 0; i < srcLen; ++i) {

            dstArray[i] += srcArray[i];

    }

When we microbenchmark this method we observe fast performance close to the
practical memory bandwidth and when we print the assembly code we observe
loop unrolling and automatic vectorization with SIMD instructions.

  0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0

  0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0

  0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)

  0x000001ef4600ac05: movslq %r13d,%r11

  0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0

  0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)

  0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0

  0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)

  0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0

  0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)

  0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0

  0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)

  0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0

  0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)

  0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0

  0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)

  0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0

  0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
{reexecute=0 rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
(line 41)

  0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
(line 40)

  0x000001ef4600acc3: cmp    %eax,%r13d

  0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
(line 40)

In the real application, this method is actually inlined in a higher level
method named AVector.plus(). Unfortunately, the inlined version of the
aggregation code is not vectorized anymore:

  0x000001ef460180a0: cmp    %ebx,%r11d

  0x000001ef460180a3: jae    0x000001ef460180e6

  0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
(line 41)

                                                ; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)

  0x000001ef460180ac: cmp    %ecx,%r11d

  0x000001ef460180af: jae    0x000001ef46018104

  0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1

  0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
(line 41)

                                                ; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)

  0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
(line 40)

                                                ; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)

  0x000001ef460180c2: cmp    %r10d,%r11d

  0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
(line 40)

                                                ; -
com.qfs.vector.impl.AVector::plus at 17 (line 204)

This causes a significant performance drop, compared to a run where we
explicitly disable the inlining and observe automatically vectorized code
again (
-XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
).

How do you guys explain that behavior of the JIT compiler? Is this a known
and tracked issue, could it be fixed in the JVM? Can we do something in the
java code to prevent this from happening?

Best regards,

Nicolas Heutte