[External] : Re: SuperWord loop optimization lost after method inlining
Nicolas Heutte
nhe at activeviam.com
Wed Feb 17 10:34:04 UTC 2021
Hi Vladimir,
I have rerun the test with the appropriate options, the obtained logs are
in this folder:
https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing
Best regards,
Nicolas Heutte
On Tue, Feb 16, 2021 at 11:35 PM Vladimir Kozlov <vladimir.kozlov at oracle.com>
wrote:
> Hi Nicolas,
>
> The file you shared has only assembler code. Yes, it shows that when
> ArrayFloatToArrayFloatVectorBinding::plus() is
> inlined into AVector::plus() it is not vectorized.
>
> But I asked for an other file (hotspot_pid<PID>.log) which is generated
> when you run app with
> -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation flags. It should start
> with:
>
> <?xml version='1.0' encoding='UTF-8'?>
> <hotspot_log version='160 1' process='2302014' time_ms='1613514688748'>
> <vm_version>
> <name>
> Java HotSpot(TM) 64-Bit Server VM
> </name>
> <release>
> 11.0.9+7-LTS
> </release>
>
> Thanks,
> Vladimir K
>
> On 2/15/21 5:19 AM, Nicolas Heutte wrote:
> > Hi Vladimir,
> >
> > I've tried disabling tiered compilation, as you requested. It seems that
> the inlining was performed slightly
> > differently, but the issue remains. As you can see in this excerpt, the
> main loop isn't properly vectorized:
> >
> > 0x00000254b0d4bf54: cmp %r11d,%r8d
> > 0x00000254b0d4bf57: jae 0x00000254b0d4c19e
> > 0x00000254b0d4bf5d: vmovss 0x10(%rcx,%r8,4),%xmm9 ;*faload
> {reexecute=0 rethrow=0 return_oop=0}
> > ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
> (line 41)
> > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2 (line 103)
> > ; -
> > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> > ; -
> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27 (line 118)
> >
> > 0x00000254b0d4bf64: cmp %ebx,%r8d
> > 0x00000254b0d4bf67: jae 0x00000254b0d4c1ec
> > 0x00000254b0d4bf6d: vaddss 0x10(%rdi,%r8,4),%xmm9,%xmm9
> > 0x00000254b0d4bf74: vmovss %xmm9,0x10(%rcx,%r8,4) ;*fastore
> {reexecute=0 rethrow=0 return_oop=0}
> > ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> (line 41)
> > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2 (line 103)
> > ; -
> > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> > ; -
> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27 (line 118)
> >
> > 0x00000254b0d4bf7b: inc %r8d ;*iinc {reexecute=0
> rethrow=0 return_oop=0}
> > ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> (line 40)
> > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2 (line 103)
> > ; -
> > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> > ; -
> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27 (line 118)
> >
> > 0x00000254b0d4bf7e: cmp %r9d,%r8d
> > 0x00000254b0d4bf81: jl 0x00000254b0d4bf54 ;*goto {reexecute=0
> rethrow=0 return_oop=0}
> > ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> (line 40)
> > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2 (line 103)
> > ; -
> > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> > ; -
> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27 (line 118)
> >
> >
> >
> > Here is the link to the full log, should you want to take a look at it:
> >
> https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing
> > <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$
> >
> >
> > Best regards,
> > Nicolas Heutte
> >
> > On Thu, Feb 11, 2021 at 7:05 PM Vladimir Kozlov <
> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
> >
> > Changing wide mailing list to JIT compiler only.
> >
> > This deoptimization is normal in Tiered Compilation - it switched
> from profiling code (level='3') generated by C1
> > compiler to new code generated by C2 (level='4') which does loop
> optimizations.
> >
> > Thank you for posting inlining information:
> >
> > @ 17
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
> bytes) inline (hot)
> > \-> TypeProfile (14054/14054 counts) =
> com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
> >
> > I thought before that may be call site is not hot but it is not the
> case.
> >
> > You can do an other experiment to collect log with disabled Tiered
> Compilation (only C2 is used): -XX:-TieredCompilation
> > Also print assembler code (as you did before) for final compilation
> to see if loop is still not vectorized.
> >
> > Is it possible to post log file (on GitHub?) for me to look?
> >
> > Thanks,
> > Vladimir K
> >
> > On 2/11/21 6:28 AM, Nicolas Heutte wrote:
> > > Hi Vladimir,
> > >
> > > Thank you for your help.
> > >
> > > I'm currently running Java 11.0.9, and I did not use any VM flag
> of note.
> > >
> > > I checked the content of the compilation log, and it seems that
> ArrayFloatToArrayFloatVectorBinding::plus() was
> > > deoptimized in order to allow AVector::plus() to be compiled:
> > >
> > > <writer thread='11576'/>
> > > <task_queued compile_id='17280'
> method='com.qfs.vector.impl.AVector plus (Lcom/qfs/vector/IVector;)V'
> bytes='23'
> > > count='916' iicount='916' level='3' stamp='7394.056'
> comment='tiered' hot_count='896'/>
> > > <writer thread='15784'/>
> > > <deoptimized thread='15784' reason='constraint'
> pc='0x00000296d0785b94' compile_id='17257' compiler='c1' level='3'>
> > > <jvms bci='65'
> method='com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding plus
> > > (Lcom/qfs/vector/IVector;Lcom/qfs/vector/IVector;)V' bytes='69'
> count='909' backedge_count='155602' iicount='910'/>
> > > </deoptimized>
> > >
> > > The last compilation entry for AVector::plus() is:
> > >
> > > <writer thread='16380'/>
> > > <nmethod compile_id='17317' compiler='c2' level='4'
> entry='0x00000296d6af32c0' size='1960'
> > address='0x00000296d6af3110'
> > > relocation_offset='376' insts_offset='432' stub_offset='1040'
> scopes_data_offset='1152' scopes_pcs_offset='1592'
> > > dependencies_offset='1880' nul_chk_table_offset='1896'
> oops_offset='1064' metadata_offset='1072'
> > > method='com.qfs.vector.impl.AVector plus
> (Lcom/qfs/vector/IVector;)V' bytes='23' count='172425' iicount='172425'
> > > stamp='7394.199'/>
> > > <make_not_entrant thread='16380' compile_id='17280' compiler='c1'
> level='2' stamp='7394.199'/>
> > > @ 1
> com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4 bytes) inline
> > (hot)
> > > \-> TypeProfile (14552/14552
> counts) = com/qfs/vector/array/impl/ArrayFloatVector
> > > @ 7
> com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4 bytes) inline
> > (hot)
> > > \-> TypeProfile (14150/14150
> counts) = com/qfs/vector/array/impl/ArrayFloatVector
> > > @ 10
> com.qfs.vector.binding.impl.VectorBindings::getBinding (9 bytes) inline
> (hot)
> > > @ 5
> >
> com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::getBinding
> (22
> > > bytes) inline (hot)
> > > @ 3
> >
> com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::hasBinding
> > > (34 bytes) inline (hot)
> > > @ 17
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
> > bytes)
> > > inline (hot)
> > > \-> TypeProfile (14054/14054
> counts) =
> > > com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
> > > @ 12
> com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes) inline (hot)
> > > @ 22
> com.qfs.vector.impl.AVector::checkIndex (37 bytes) inline (hot)
> > > @ 6
> com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes) inline (hot)
> > > @ 27
> com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying (5 bytes)
> > accessor
> > > @ 34
> com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying (5 bytes)
> > accessor
> > > <writer thread='15896'/>
> > >
> > > Unfortunately, I do not have access to a debug VM build, so I
> cannot run the second test you recommend.
> > >
> > > Best regards,
> > > Nicolas Heutte
> > >
> > > On Wed, Feb 10, 2021 at 7:36 PM Vladimir Kozlov <
> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
> > <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com>>> wrote:
> > >
> > > Hi, Nicolas
> > >
> > > Looks like, when inlined, the loop from
> ArrayFloatToArrayFloatVectorBinding::plus() was not optimized at all:
> > it is not
> > > unrolled and has range checks. Such loops are not vectorized
> (you need unrolling and no checks).
> > >
> > > What Java version you are running? What HotSpot VM flags you
> are using when running application?
> > >
> > > Run application with -XX:+LogCompilation and look on
> compilation data in hotspot_pid<PID>.log file for caller
> > > AVector::plus().
> > >
> > > VM also has several flags to trace loop optimizations but
> they are only available in debug VM build. If you
> > have access
> > > to such build run with -XX:+PrintCompilation
> -XX:+TraceLoopOpts flags.
> > >
> > > Thanks,
> > > Vladimir K
> > >
> > > On 2/10/21 9:24 AM, Nicolas Heutte wrote:
> > > > Hi all,
> > > >
> > > > I am encountering a performance issue caused by the
> interaction between
> > > > method inlining and automatic vectorization.
> > > >
> > > > Our application aggregates arrays intensively using a
> method named
> > > > ArrayFloatToArrayFloatVectorBinding.plus() with the
> following code:
> > > >
> > > > for (int i = 0; i < srcLen; ++i) {
> > > >
> > > > dstArray[i] += srcArray[i];
> > > >
> > > > }
> > > >
> > > > When we microbenchmark this method we observe fast
> performance close to the
> > > > practical memory bandwidth and when we print the assembly
> code we observe
> > > > loop unrolling and automatic vectorization with SIMD
> instructions.
> > > >
> > > > 0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
> > > >
> > > > 0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
> > > >
> > > > 0x000001ef4600ac05: movslq %r13d,%r11
> > > >
> > > > 0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
> > > >
> > > > 0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
> > > >
> > > > 0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
> > > >
> > > > 0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
> > > >
> > > > 0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
> > > >
> > > > 0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
> > > >
> > > > 0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
> > > >
> > > > 0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
> > > >
> > > > 0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)
> ;*fastore
> > > > {reexecute=0 rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> > > > (line 41)
> > > >
> > > > 0x000001ef4600acbf: add $0x40,%r13d ;*iinc
> {reexecute=0
> > > > rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> > > > (line 40)
> > > >
> > > > 0x000001ef4600acc3: cmp %eax,%r13d
> > > >
> > > > 0x000001ef4600acc6: jl 0x000001ef4600abf0 ;*goto
> {reexecute=0
> > > > rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> > > > (line 40)
> > > >
> > > >
> > > >
> > > > In the real application, this method is actually inlined
> in a higher level
> > > > method named AVector.plus(). Unfortunately, the inlined
> version of the
> > > > aggregation code is not vectorized anymore:
> > > >
> > > >
> > > >
> > > > 0x000001ef460180a0: cmp %ebx,%r11d
> > > >
> > > > 0x000001ef460180a3: jae 0x000001ef460180e6
> > > >
> > > > 0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1
> ;*faload {reexecute=0
> > > > rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
> > > > (line 41)
> > > >
> > > > ; -
> > > > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > > >
> > > > 0x000001ef460180ac: cmp %ecx,%r11d
> > > >
> > > > 0x000001ef460180af: jae 0x000001ef46018104
> > > >
> > > > 0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
> > > >
> > > > 0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)
> ;*fastore {reexecute=0
> > > > rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> > > > (line 41)
> > > >
> > > > ; -
> > > > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > > >
> > > > 0x000001ef460180bf: inc %r11d ;*iinc
> {reexecute=0
> > > > rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> > > > (line 40)
> > > >
> > > > ; -
> > > > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > > >
> > > > 0x000001ef460180c2: cmp %r10d,%r11d
> > > >
> > > > 0x000001ef460180c5: jl 0x000001ef460180a0 ;*goto
> {reexecute=0
> > > > rethrow=0 return_oop=0}
> > > >
> > > > ; -
> > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> > > > (line 40)
> > > >
> > > > ; -
> > > > com.qfs.vector.impl.AVector::plus at 17 (line 204)
> > > >
> > > >
> > > >
> > > > This causes a significant performance drop, compared to a
> run where we
> > > > explicitly disable the inlining and observe automatically
> vectorized code
> > > > again (
> > > >
> -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
> > > > ).
> > > >
> > > >
> > > > How do you guys explain that behavior of the JIT compiler?
> Is this a known
> > > > and tracked issue, could it be fixed in the JVM? Can we do
> something in the
> > > > java code to prevent this from happening?
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Nicolas Heutte
> > > >
> > >
> >
>
More information about the hotspot-compiler-dev
mailing list