SuperWord loop optimization lost after method inlining

Thu Feb 11 14:28:58 UTC 2021

Hi Vladimir,

Thank you for your help.

I'm currently running Java 11.0.9, and I did not use any VM flag of note.

I checked the content of the compilation log, and it seems that
ArrayFloatToArrayFloatVectorBinding::plus() was deoptimized in order to
allow AVector::plus() to be compiled:

<writer thread='11576'/>
<task_queued compile_id='17280' method='com.qfs.vector.impl.AVector plus
(Lcom/qfs/vector/IVector;)V' bytes='23' count='916' iicount='916' level='3'
stamp='7394.056' comment='tiered' hot_count='896'/>
<writer thread='15784'/>
<deoptimized thread='15784' reason='constraint' pc='0x00000296d0785b94'
compile_id='17257' compiler='c1' level='3'>
<jvms bci='65'
method='com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding
plus (Lcom/qfs/vector/IVector;Lcom/qfs/vector/IVector;)V' bytes='69'
count='909' backedge_count='155602' iicount='910'/>
</deoptimized>

The last compilation entry for AVector::plus() is:

<writer thread='16380'/>
<nmethod compile_id='17317' compiler='c2' level='4'
entry='0x00000296d6af32c0' size='1960' address='0x00000296d6af3110'
relocation_offset='376' insts_offset='432' stub_offset='1040'
scopes_data_offset='1152' scopes_pcs_offset='1592'
dependencies_offset='1880' nul_chk_table_offset='1896' oops_offset='1064'
metadata_offset='1072' method='com.qfs.vector.impl.AVector plus
(Lcom/qfs/vector/IVector;)V' bytes='23' count='172425' iicount='172425'
stamp='7394.199'/>
<make_not_entrant thread='16380' compile_id='17280' compiler='c1' level='2'
stamp='7394.199'/>
                              @ 1
com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4 bytes)   inline
(hot)
                               \-> TypeProfile (14552/14552 counts) =
com/qfs/vector/array/impl/ArrayFloatVector
                              @ 7
com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4 bytes)   inline
(hot)
                               \-> TypeProfile (14150/14150 counts) =
com/qfs/vector/array/impl/ArrayFloatVector
                              @ 10
com.qfs.vector.binding.impl.VectorBindings::getBinding (9 bytes)   inline
(hot)
                                @ 5
com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::getBinding
(22 bytes)   inline (hot)
                                  @ 3
com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::hasBinding
(34 bytes)   inline (hot)
                              @ 17
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
bytes)   inline (hot)
                               \-> TypeProfile (14054/14054 counts) =
com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
                                @ 12
com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)   inline (hot)
                                @ 22
com.qfs.vector.impl.AVector::checkIndex (37 bytes)   inline (hot)
                                  @ 6
com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)   inline (hot)
                                @ 27
com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying (5 bytes)
accessor
                                @ 34
com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying (5 bytes)
accessor
<writer thread='15896'/>

Unfortunately, I do not have access to a debug VM build, so I cannot run
the second test you recommend.

Best regards,
Nicolas Heutte

On Wed, Feb 10, 2021 at 7:36 PM Vladimir Kozlov <vladimir.kozlov at oracle.com>
wrote:

> Hi, Nicolas
>
> Looks like, when inlined, the loop from
> ArrayFloatToArrayFloatVectorBinding::plus() was not optimized at all: it is
> not
> unrolled and has range checks. Such loops are not vectorized (you need
> unrolling and no checks).
>
> What Java version you are running? What HotSpot VM flags you are using
> when running application?
>
> Run application with -XX:+LogCompilation and look on compilation data in
> hotspot_pid<PID>.log file for caller
> AVector::plus().
>
> VM also has several flags to trace loop optimizations but they are only
> available in debug VM build. If you have access
> to such build run with -XX:+PrintCompilation -XX:+TraceLoopOpts flags.
>
> Thanks,
> Vladimir K
>
> On 2/10/21 9:24 AM, Nicolas Heutte wrote:
> > Hi all,
> >
> > I am encountering a performance issue caused by the interaction between
> > method inlining and automatic vectorization.
> >
> > Our application aggregates arrays intensively using a method named
> > ArrayFloatToArrayFloatVectorBinding.plus() with the following code:
> >
> >      for (int i = 0; i < srcLen; ++i) {
> >
> >              dstArray[i] += srcArray[i];
> >
> >      }
> >
> > When we microbenchmark this method we observe fast performance close to
> the
> > practical memory bandwidth and when we print the assembly code we observe
> > loop unrolling and automatic vectorization with SIMD instructions.
> >
> >    0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
> >
> >    0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
> >
> >    0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
> >
> >    0x000001ef4600ac05: movslq %r13d,%r11
> >
> >    0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
> >
> >    0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
> >
> >    0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
> >
> >    0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
> >
> >    0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
> >
> >    0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
> >
> >    0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
> > {reexecute=0 rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> > (line 41)
> >
> >    0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> > (line 40)
> >
> >    0x000001ef4600acc3: cmp    %eax,%r13d
> >
> >    0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> > (line 40)
> >
> >
> >
> > In the real application, this method is actually inlined in a higher
> level
> > method named AVector.plus(). Unfortunately, the inlined version of the
> > aggregation code is not vectorized anymore:
> >
> >
> >
> >    0x000001ef460180a0: cmp    %ebx,%r11d
> >
> >    0x000001ef460180a3: jae    0x000001ef460180e6
> >
> >    0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload
> {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
> > (line 41)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >
> >    0x000001ef460180ac: cmp    %ecx,%r11d
> >
> >    0x000001ef460180af: jae    0x000001ef46018104
> >
> >    0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
> >
> >    0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore
> {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> > (line 41)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >
> >    0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> > (line 40)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >
> >    0x000001ef460180c2: cmp    %r10d,%r11d
> >
> >    0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> > (line 40)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >
> >
> >
> > This causes a significant performance drop, compared to a run where we
> > explicitly disable the inlining and observe automatically vectorized code
> > again (
> >
> -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
> > ).
> >
> >
> > How do you guys explain that behavior of the JIT compiler? Is this a
> known
> > and tracked issue, could it be fixed in the JVM? Can we do something in
> the
> > java code to prevent this from happening?
> >
> >
> > Best regards,
> >
> > Nicolas Heutte
> >
>