SuperWord loop optimization lost after method inlining

Wed Feb 10 18:35:53 UTC 2021

Hi, Nicolas

Looks like, when inlined, the loop from ArrayFloatToArrayFloatVectorBinding::plus() was not optimized at all: it is not 
unrolled and has range checks. Such loops are not vectorized (you need unrolling and no checks).

What Java version you are running? What HotSpot VM flags you are using when running application?

Run application with -XX:+LogCompilation and look on compilation data in hotspot_pid<PID>.log file for caller 
AVector::plus().

VM also has several flags to trace loop optimizations but they are only available in debug VM build. If you have access 
to such build run with -XX:+PrintCompilation -XX:+TraceLoopOpts flags.

Thanks,
Vladimir K

On 2/10/21 9:24 AM, Nicolas Heutte wrote:
> Hi all,
> 
> I am encountering a performance issue caused by the interaction between
> method inlining and automatic vectorization.
> 
> Our application aggregates arrays intensively using a method named
> ArrayFloatToArrayFloatVectorBinding.plus() with the following code:
> 
>      for (int i = 0; i < srcLen; ++i) {
> 
>              dstArray[i] += srcArray[i];
> 
>      }
> 
> When we microbenchmark this method we observe fast performance close to the
> practical memory bandwidth and when we print the assembly code we observe
> loop unrolling and automatic vectorization with SIMD instructions.
> 
>    0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
> 
>    0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
> 
>    0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
> 
>    0x000001ef4600ac05: movslq %r13d,%r11
> 
>    0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
> 
>    0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
> 
>    0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
> 
>    0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
> 
>    0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
> 
>    0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
> 
>    0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
> 
>    0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
> 
>    0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
> {reexecute=0 rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> (line 41)
> 
>    0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
> rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> (line 40)
> 
>    0x000001ef4600acc3: cmp    %eax,%r13d
> 
>    0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
> rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> (line 40)
> 
> 
> 
> In the real application, this method is actually inlined in a higher level
> method named AVector.plus(). Unfortunately, the inlined version of the
> aggregation code is not vectorized anymore:
> 
> 
> 
>    0x000001ef460180a0: cmp    %ebx,%r11d
> 
>    0x000001ef460180a3: jae    0x000001ef460180e6
> 
>    0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload {reexecute=0
> rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
> (line 41)
> 
>                                                  ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> 
>    0x000001ef460180ac: cmp    %ecx,%r11d
> 
>    0x000001ef460180af: jae    0x000001ef46018104
> 
>    0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
> 
>    0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore {reexecute=0
> rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> (line 41)
> 
>                                                  ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> 
>    0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
> rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> (line 40)
> 
>                                                  ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> 
>    0x000001ef460180c2: cmp    %r10d,%r11d
> 
>    0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
> rethrow=0 return_oop=0}
> 
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> (line 40)
> 
>                                                  ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> 
> 
> 
> This causes a significant performance drop, compared to a run where we
> explicitly disable the inlining and observe automatically vectorized code
> again (
> -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
> ).
> 
> 
> How do you guys explain that behavior of the JIT compiler? Is this a known
> and tracked issue, could it be fixed in the JVM? Can we do something in the
> java code to prevent this from happening?
> 
> 
> Best regards,
> 
> Nicolas Heutte
>