[External] : Re: SuperWord loop optimization lost after method inlining
Nicolas Heutte
nhe at activeviam.com
Mon Feb 22 13:12:10 UTC 2021
Hello Vladimir,
Thanks for filing the bug. It's indeed a very surprising one.
I've also added the missing file you needed to the shared folder.
Best regards,
Nicolas Heutte
On Sat, Feb 20, 2021 at 2:38 AM Vladimir Kozlov <vladimir.kozlov at oracle.com>
wrote:
> BTW, I filed bug to collect information:
>
> https://bugs.openjdk.java.net/browse/JDK-8262067
>
> This is very weird case which I can't reproduce with small test. It
> reminds me one case (Loop did not transform into
> Counted loop) which was fixed in JDK 11.0.3:
>
> https://bugs.openjdk.java.net/browse/JDK-8211451
>
> Thanks,
> Vladimir K
>
> On 2/19/21 12:06 PM, Vladimir Kozlov wrote:
> > I need an other file
> C:\Users\NicolasHeutte\AppData\Local\Temp\\hs_c10212_pid15016.log created
> from second C2 compiler
> > thread. It should have data for standalone
> ArrayFloatToArrayFloatVectorBinding::plus() method compilation. To see what
> > is going on I have to compare these data.
> >
> > Thanks,
> > Vladimir K
> >
> > On 2/19/21 8:54 AM, Nicolas Heutte wrote:
> >> Hello Vladimir,
> >>
> >> I've added the requested log to the shared folder
> >> (
> https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing
> >> <
> https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!IeK3TGC8Qf5eu0KH3LEkt26TEAWCirLIkTuJ2iAAmTkfBK4_Vnnr6gkOuOydizE_lFhl4g$>).
>
> >> I've also tried disabling the strip mining optimization as you
> suggested, but there was no significant performance
> >> change.
> >>
> >> Best regards,
> >> Nicolas Heutte
> >>
> >> On Wed, Feb 17, 2021 at 9:05 PM Vladimir Kozlov <
> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
> >>
> >> Unfortunately it is still not the file I am looking for.
> >>
> >> First, remove -XX:+PrintAssembly flag from command line. I have
> already files with assembler code.
> >>
> >> Second, I see link to the file I am looking for:
> >> <thread_logfile thread='16812'
> filename='C:\Users\NicolasHeutte\AppData\Local\Temp\\hs_c16812_pid15016.log'/>
> >>
> >> If you still have it, please send it. If application stopped before
> normal exit that file is not merged into
> >> hotspot_pid<PID>.log file.
> >>
> >> If you don't have it - do an other run with -XX:CICompilerCount=1
> to use only one C2 compiler thread with Tiered
> >> off. It
> >> will simplify ordering of log.
> >>
> >> You can also do an other experiment without collecting log. Run app
> with next flags to disable loop strip minning
> >> optimization: -XX:-UseCountedLoopSafepoints
> -XX:LoopStripMiningIter=0
> >>
> >> Thanks,
> >> Vladimir K
> >>
> >> On 2/17/21 2:34 AM, Nicolas Heutte wrote:
> >> > Hi Vladimir,
> >> >
> >> > I have rerun the test with the appropriate options, the obtained
> logs are in this folder:
> >> >
> https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing
> >>
> >> <
> https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!IeK3TGC8Qf5eu0KH3LEkt26TEAWCirLIkTuJ2iAAmTkfBK4_Vnnr6gkOuOydizE_lFhl4g$>
>
> >>
> >>
> >> >
> >>
> >> <
> https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvvSUuoKhQ$
> >>
> >>
> >> <
> https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvvSUuoKhQ$>>
>
> >>
> >> >
> >> > Best regards,
> >> > Nicolas Heutte
> >> >
> >> > On Tue, Feb 16, 2021 at 11:35 PM Vladimir Kozlov <
> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
> >> <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com>>> wrote:
> >> >
> >> > Hi Nicolas,
> >> >
> >> > The file you shared has only assembler code. Yes, it shows
> that when
> >> ArrayFloatToArrayFloatVectorBinding::plus() is
> >> > inlined into AVector::plus() it is not vectorized.
> >> >
> >> > But I asked for an other file (hotspot_pid<PID>.log) which
> is generated when you run app with
> >> > -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation flags. It
> should start with:
> >> >
> >> > <?xml version='1.0' encoding='UTF-8'?>
> >> > <hotspot_log version='160 1' process='2302014'
> time_ms='1613514688748'>
> >> > <vm_version>
> >> > <name>
> >> > Java HotSpot(TM) 64-Bit Server VM
> >> > </name>
> >> > <release>
> >> > 11.0.9+7-LTS
> >> > </release>
> >> >
> >> > Thanks,
> >> > Vladimir K
> >> >
> >> > On 2/15/21 5:19 AM, Nicolas Heutte wrote:
> >> > > Hi Vladimir,
> >> > >
> >> > > I've tried disabling tiered compilation, as you
> requested. It seems that the inlining was performed
> >> slightly
> >> > > differently, but the issue remains. As you can see in
> this excerpt, the main loop isn't properly
> >> vectorized:
> >> > >
> >> > > 0x00000254b0d4bf54: cmp %r11d,%r8d
> >> > > 0x00000254b0d4bf57: jae 0x00000254b0d4c19e
> >> > > 0x00000254b0d4bf5d: vmovss 0x10(%rcx,%r8,4),%xmm9
> ;*faload {reexecute=0 rethrow=0 return_oop=0}
> >> > > ; -
> >> > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
> (line 41)
> >> > > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >> > > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
> >> (line 103)
> >> > > ; -
> >> > >
> com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> >> > > ; -
> >> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
> >> > (line 118)
> >> > >
> >> > > 0x00000254b0d4bf64: cmp %ebx,%r8d
> >> > > 0x00000254b0d4bf67: jae 0x00000254b0d4c1ec
> >> > > 0x00000254b0d4bf6d: vaddss 0x10(%rdi,%r8,4),%xmm9,%xmm9
> >> > > 0x00000254b0d4bf74: vmovss %xmm9,0x10(%rcx,%r8,4)
> ;*fastore {reexecute=0 rethrow=0 return_oop=0}
> >> > > ; -
> >> > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> (line 41)
> >> > > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >> > > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
> >> (line 103)
> >> > > ; -
> >> > >
> com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> >> > > ; -
> >> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
> >> > (line 118)
> >> > >
> >> > > 0x00000254b0d4bf7b: inc %r8d ;*iinc
> {reexecute=0 rethrow=0 return_oop=0}
> >> > > ; -
> >> > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> (line 40)
> >> > > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >> > > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
> >> (line 103)
> >> > > ; -
> >> > >
> com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> >> > > ; -
> >> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
> >> > (line 118)
> >> > >
> >> > > 0x00000254b0d4bf7e: cmp %r9d,%r8d
> >> > > 0x00000254b0d4bf81: jl 0x00000254b0d4bf54 ;*goto
> {reexecute=0 rethrow=0 return_oop=0}
> >> > > ; -
> >> > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> (line 40)
> >> > > ; -
> com.qfs.vector.impl.AVector::plus at 17 (line 204)
> >> > > ; -
> com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
> >> (line 103)
> >> > > ; -
> >> > >
> com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70
> (line 66)
> >> > > ; -
> >> com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
> >> > (line 118)
> >> > >
> >> > >
> >> > >
> >> > > Here is the link to the full log, should you want to take
> a look at it:
> >> > >
> https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing
> >>
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!IeK3TGC8Qf5eu0KH3LEkt26TEAWCirLIkTuJ2iAAmTkfBK4_Vnnr6gkOuOydizGvKpVqaQ$>
>
> >>
> >> >
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvuMpg6inQ$
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvuMpg6inQ$>>
>
> >>
> >> >
> >> > >
> >> >
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$>
>
> >>
> >> >
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$
> >> <
> https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$>>>
>
> >>
> >> > >
> >> > > Best regards,
> >> > > Nicolas Heutte
> >> > >
> >> > > On Thu, Feb 11, 2021 at 7:05 PM Vladimir Kozlov <
> vladimir.kozlov at oracle.com
> >> <mailto:vladimir.kozlov at oracle.com> <mailto:
> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>
> >> > <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com> <mailto:vladimir.kozlov at oracle.com
> >> <mailto:vladimir.kozlov at oracle.com>>>> wrote:
> >> > >
> >> > > Changing wide mailing list to JIT compiler only.
> >> > >
> >> > > This deoptimization is normal in Tiered Compilation -
> it switched from profiling code (level='3')
> >> generated by C1
> >> > > compiler to new code generated by C2 (level='4')
> which does loop optimizations.
> >> > >
> >> > > Thank you for posting inlining information:
> >> > >
> >> > > @ 17
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
> bytes) inline
> >> (hot)
> >> > > \-> TypeProfile (14054/14054 counts) =
> >> com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
> >> > >
> >> > > I thought before that may be call site is not hot but
> it is not the case.
> >> > >
> >> > > You can do an other experiment to collect log with
> disabled Tiered Compilation (only C2 is used):
> >> > -XX:-TieredCompilation
> >> > > Also print assembler code (as you did before) for
> final compilation to see if loop is still not
> >> vectorized.
> >> > >
> >> > > Is it possible to post log file (on GitHub?) for me
> to look?
> >> > >
> >> > > Thanks,
> >> > > Vladimir K
> >> > >
> >> > > On 2/11/21 6:28 AM, Nicolas Heutte wrote:
> >> > > > Hi Vladimir,
> >> > > >
> >> > > > Thank you for your help.
> >> > > >
> >> > > > I'm currently running Java 11.0.9, and I did not
> use any VM flag of note.
> >> > > >
> >> > > > I checked the content of the compilation log, and
> it seems that
> >> > ArrayFloatToArrayFloatVectorBinding::plus() was
> >> > > > deoptimized in order to allow AVector::plus() to
> be compiled:
> >> > > >
> >> > > > <writer thread='11576'/>
> >> > > > <task_queued compile_id='17280'
> method='com.qfs.vector.impl.AVector plus
> >> (Lcom/qfs/vector/IVector;)V'
> >> > bytes='23'
> >> > > > count='916' iicount='916' level='3'
> stamp='7394.056' comment='tiered' hot_count='896'/>
> >> > > > <writer thread='15784'/>
> >> > > > <deoptimized thread='15784' reason='constraint'
> pc='0x00000296d0785b94' compile_id='17257'
> >> compiler='c1'
> >> > level='3'>
> >> > > > <jvms bci='65'
> method='com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding plus
> >> > > >
> (Lcom/qfs/vector/IVector;Lcom/qfs/vector/IVector;)V' bytes='69' count='909'
> backedge_count='155602'
> >> > iicount='910'/>
> >> > > > </deoptimized>
> >> > > >
> >> > > > The last compilation entry for AVector::plus() is:
> >> > > >
> >> > > > <writer thread='16380'/>
> >> > > > <nmethod compile_id='17317' compiler='c2'
> level='4' entry='0x00000296d6af32c0' size='1960'
> >> > > address='0x00000296d6af3110'
> >> > > > relocation_offset='376' insts_offset='432'
> stub_offset='1040' scopes_data_offset='1152'
> >> > scopes_pcs_offset='1592'
> >> > > > dependencies_offset='1880'
> nul_chk_table_offset='1896' oops_offset='1064' metadata_offset='1072'
> >> > > > method='com.qfs.vector.impl.AVector plus
> (Lcom/qfs/vector/IVector;)V' bytes='23' count='172425'
> >> > iicount='172425'
> >> > > > stamp='7394.199'/>
> >> > > > <make_not_entrant thread='16380'
> compile_id='17280' compiler='c1' level='2' stamp='7394.199'/>
> >> > > > @ 1
> com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4
> >> bytes)
> >> > inline
> >> > > (hot)
> >> > > > \-> TypeProfile
> (14552/14552 counts) =
> >> > com/qfs/vector/array/impl/ArrayFloatVector
> >> > > > @ 7
> com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4
> >> bytes)
> >> > inline
> >> > > (hot)
> >> > > > \-> TypeProfile
> (14150/14150 counts) =
> >> > com/qfs/vector/array/impl/ArrayFloatVector
> >> > > > @ 10
> com.qfs.vector.binding.impl.VectorBindings::getBinding (9
> >> bytes)
> >> > inline (hot)
> >> > > > @ 5
> >> > >
> com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::getBinding
> (22
> >> > > > bytes) inline (hot)
> >> > > > @ 3
> >> > >
> com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::hasBinding
> >> > > > (34 bytes) inline (hot)
> >> > > > @ 17
> >> >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
> >> > > bytes)
> >> > > > inline (hot)
> >> > > > \-> TypeProfile
> (14054/14054 counts) =
> >> > > >
> com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
> >> > > > @ 12
> com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)
> >> > inline (hot)
> >> > > > @ 22
> com.qfs.vector.impl.AVector::checkIndex (37 bytes) inline
> >> (hot)
> >> > > > @ 6
> com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)
> >> > inline (hot)
> >> > > > @ 27
> com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying
> >> (5 bytes)
> >> > > accessor
> >> > > > @ 34
> com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying
> >> (5 bytes)
> >> > > accessor
> >> > > > <writer thread='15896'/>
> >> > > >
> >> > > > Unfortunately, I do not have access to a debug VM
> build, so I cannot run the second test you
> >> recommend.
> >> > > >
> >> > > > Best regards,
> >> > > > Nicolas Heutte
> >> > > >
> >> > > > On Wed, Feb 10, 2021 at 7:36 PM Vladimir Kozlov <
> vladimir.kozlov at oracle.com
> >> <mailto:vladimir.kozlov at oracle.com>
> >> > <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com>> <mailto:vladimir.kozlov at oracle.com
> >> <mailto:vladimir.kozlov at oracle.com> <mailto:
> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>>
> >> > > <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com>
> >> <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com>> <mailto:vladimir.kozlov at oracle.com
> >> <mailto:vladimir.kozlov at oracle.com>
> >> > <mailto:vladimir.kozlov at oracle.com <mailto:
> vladimir.kozlov at oracle.com>>>>> wrote:
> >> > > >
> >> > > > Hi, Nicolas
> >> > > >
> >> > > > Looks like, when inlined, the loop from
> ArrayFloatToArrayFloatVectorBinding::plus() was not
> >> optimized
> >> > at all:
> >> > > it is not
> >> > > > unrolled and has range checks. Such loops are
> not vectorized (you need unrolling and no checks).
> >> > > >
> >> > > > What Java version you are running? What
> HotSpot VM flags you are using when running application?
> >> > > >
> >> > > > Run application with -XX:+LogCompilation and
> look on compilation data in hotspot_pid<PID>.log
> >> file for
> >> > caller
> >> > > > AVector::plus().
> >> > > >
> >> > > > VM also has several flags to trace loop
> optimizations but they are only available in debug VM
> >> build.
> >> > If you
> >> > > have access
> >> > > > to such build run with -XX:+PrintCompilation
> -XX:+TraceLoopOpts flags.
> >> > > >
> >> > > > Thanks,
> >> > > > Vladimir K
> >> > > >
> >> > > > On 2/10/21 9:24 AM, Nicolas Heutte wrote:
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I am encountering a performance issue
> caused by the interaction between
> >> > > > > method inlining and automatic vectorization.
> >> > > > >
> >> > > > > Our application aggregates arrays
> intensively using a method named
> >> > > > > ArrayFloatToArrayFloatVectorBinding.plus()
> with the following code:
> >> > > > >
> >> > > > > for (int i = 0; i < srcLen; ++i) {
> >> > > > >
> >> > > > > dstArray[i] += srcArray[i];
> >> > > > >
> >> > > > > }
> >> > > > >
> >> > > > > When we microbenchmark this method we
> observe fast performance close to the
> >> > > > > practical memory bandwidth and when we
> print the assembly code we observe
> >> > > > > loop unrolling and automatic vectorization
> with SIMD instructions.
> >> > > > >
> >> > > > > 0x000001ef4600abf0: vmovdqu
> 0x10(%r14,%r13,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600abf7: vaddps
> 0x10(%rcx,%r13,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600abfe: vmovdqu
> %ymm0,0x10(%r14,%r13,4)
> >> > > > >
> >> > > > > 0x000001ef4600ac05: movslq %r13d,%r11
> >> > > > >
> >> > > > > 0x000001ef4600ac08: vmovdqu
> 0x30(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac0f: vaddps
> 0x30(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac16: vmovdqu
> %ymm0,0x30(%r14,%r11,4)
> >> > > > >
> >> > > > > 0x000001ef4600ac1d: vmovdqu
> 0x50(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac24: vaddps
> 0x50(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac2b: vmovdqu
> %ymm0,0x50(%r14,%r11,4)
> >> > > > >
> >> > > > > 0x000001ef4600ac32: vmovdqu
> 0x70(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac39: vaddps
> 0x70(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac40: vmovdqu
> %ymm0,0x70(%r14,%r11,4)
> >> > > > >
> >> > > > > 0x000001ef4600ac47: vmovdqu
> 0x90(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac51: vaddps
> 0x90(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac5b: vmovdqu
> %ymm0,0x90(%r14,%r11,4)
> >> > > > >
> >> > > > > 0x000001ef4600ac65: vmovdqu
> 0xb0(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac6f: vaddps
> 0xb0(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac79: vmovdqu
> %ymm0,0xb0(%r14,%r11,4)
> >> > > > >
> >> > > > > 0x000001ef4600ac83: vmovdqu
> 0xd0(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac8d: vaddps
> 0xd0(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600ac97: vmovdqu
> %ymm0,0xd0(%r14,%r11,4)
> >> > > > >
> >> > > > > 0x000001ef4600aca1: vmovdqu
> 0xf0(%r14,%r11,4),%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600acab: vaddps
> 0xf0(%rcx,%r11,4),%ymm0,%ymm0
> >> > > > >
> >> > > > > 0x000001ef4600acb5: vmovdqu
> %ymm0,0xf0(%r14,%r11,4) ;*fastore
> >> > > > > {reexecute=0 rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> >> > > > > (line 41)
> >> > > > >
> >> > > > > 0x000001ef4600acbf: add $0x40,%r13d
> ;*iinc {reexecute=0
> >> > > > > rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> >> > > > > (line 40)
> >> > > > >
> >> > > > > 0x000001ef4600acc3: cmp %eax,%r13d
> >> > > > >
> >> > > > > 0x000001ef4600acc6: jl
> 0x000001ef4600abf0 ;*goto {reexecute=0
> >> > > > > rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> >> > > > > (line 40)
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > In the real application, this method is
> actually inlined in a higher level
> >> > > > > method named AVector.plus(). Unfortunately,
> the inlined version of the
> >> > > > > aggregation code is not vectorized anymore:
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > 0x000001ef460180a0: cmp %ebx,%r11d
> >> > > > >
> >> > > > > 0x000001ef460180a3: jae
> 0x000001ef460180e6
> >> > > > >
> >> > > > > 0x000001ef460180a5: vmovss
> 0x10(%r8,%r11,4),%xmm1 ;*faload {reexecute=0
> >> > > > > rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
> >> > > > > (line 41)
> >> > > > >
> >> > > > >
> ; -
> >> > > > > com.qfs.vector.impl.AVector::plus at 17 (line
> 204)
> >> > > > >
> >> > > > > 0x000001ef460180ac: cmp %ecx,%r11d
> >> > > > >
> >> > > > > 0x000001ef460180af: jae
> 0x000001ef46018104
> >> > > > >
> >> > > > > 0x000001ef460180b1: vaddss
> 0x10(%r9,%r11,4),%xmm1,%xmm1
> >> > > > >
> >> > > > > 0x000001ef460180b8: vmovss
> %xmm1,0x10(%r8,%r11,4) ;*fastore {reexecute=0
> >> > > > > rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
> >> > > > > (line 41)
> >> > > > >
> >> > > > >
> ; -
> >> > > > > com.qfs.vector.impl.AVector::plus at 17 (line
> 204)
> >> > > > >
> >> > > > > 0x000001ef460180bf: inc %r11d
> ;*iinc {reexecute=0
> >> > > > > rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
> >> > > > > (line 40)
> >> > > > >
> >> > > > >
> ; -
> >> > > > > com.qfs.vector.impl.AVector::plus at 17 (line
> 204)
> >> > > > >
> >> > > > > 0x000001ef460180c2: cmp %r10d,%r11d
> >> > > > >
> >> > > > > 0x000001ef460180c5: jl
> 0x000001ef460180a0 ;*goto {reexecute=0
> >> > > > > rethrow=0 return_oop=0}
> >> > > > >
> >> > > > >
> ; -
> >> > > > >
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
> >> > > > > (line 40)
> >> > > > >
> >> > > > >
> ; -
> >> > > > > com.qfs.vector.impl.AVector::plus at 17 (line
> 204)
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > This causes a significant performance drop,
> compared to a run where we
> >> > > > > explicitly disable the inlining and observe
> automatically vectorized code
> >> > > > > again (
> >> > > > >
> >>
> -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
> >> > > > > ).
> >> > > > >
> >> > > > >
> >> > > > > How do you guys explain that behavior of
> the JIT compiler? Is this a known
> >> > > > > and tracked issue, could it be fixed in the
> JVM? Can we do something in the
> >> > > > > java code to prevent this from happening?
> >> > > > >
> >> > > > >
> >> > > > > Best regards,
> >> > > > >
> >> > > > > Nicolas Heutte
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
More information about the hotspot-compiler-dev
mailing list