[External] : Re: SuperWord loop optimization lost after method inlining

Vladimir Kozlov vladimir.kozlov at oracle.com
Sat Feb 20 01:38:39 UTC 2021


BTW, I filed bug to collect information:

https://bugs.openjdk.java.net/browse/JDK-8262067

This is very weird case which I can't reproduce with small test. It reminds me one case (Loop did not transform into 
Counted loop) which was fixed in JDK 11.0.3:

https://bugs.openjdk.java.net/browse/JDK-8211451

Thanks,
Vladimir K

On 2/19/21 12:06 PM, Vladimir Kozlov wrote:
> I need an other file C:\Users\NicolasHeutte\AppData\Local\Temp\\hs_c10212_pid15016.log created from second C2 compiler 
> thread. It should have data for standalone ArrayFloatToArrayFloatVectorBinding::plus() method compilation. To see what 
> is going on I have to compare these data.
> 
> Thanks,
> Vladimir K
> 
> On 2/19/21 8:54 AM, Nicolas Heutte wrote:
>> Hello Vladimir,
>>
>> I've added the requested log to the shared folder 
>> (https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing 
>> <https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!IeK3TGC8Qf5eu0KH3LEkt26TEAWCirLIkTuJ2iAAmTkfBK4_Vnnr6gkOuOydizE_lFhl4g$>). 
>> I've also tried disabling the strip mining optimization as you suggested, but there was no significant performance 
>> change.
>>
>> Best regards,
>> Nicolas Heutte
>>
>> On Wed, Feb 17, 2021 at 9:05 PM Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>>     Unfortunately it is still not the file I am looking for.
>>
>>     First, remove -XX:+PrintAssembly flag from command line. I have already files with assembler code.
>>
>>     Second, I see link to the file I am looking for:
>>     <thread_logfile thread='16812' filename='C:\Users\NicolasHeutte\AppData\Local\Temp\\hs_c16812_pid15016.log'/>
>>
>>     If you still have it, please send it. If application stopped before normal exit that file is not merged into
>>     hotspot_pid<PID>.log file.
>>
>>     If you don't have it - do an other run with -XX:CICompilerCount=1 to use only one C2 compiler thread with Tiered
>>     off. It
>>     will simplify ordering of log.
>>
>>     You can also do an other experiment without collecting log. Run app with next flags to disable loop strip minning
>>     optimization:  -XX:-UseCountedLoopSafepoints -XX:LoopStripMiningIter=0
>>
>>     Thanks,
>>     Vladimir K
>>
>>     On 2/17/21 2:34 AM, Nicolas Heutte wrote:
>>      > Hi Vladimir,
>>      >
>>      > I have rerun the test with the appropriate options, the obtained logs are in this folder:
>>      > https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing
>>     
>> <https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!IeK3TGC8Qf5eu0KH3LEkt26TEAWCirLIkTuJ2iAAmTkfBK4_Vnnr6gkOuOydizE_lFhl4g$> 
>>
>>
>>      >
>>     
>> <https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvvSUuoKhQ$ 
>>
>>     
>> <https://urldefense.com/v3/__https://drive.google.com/drive/folders/1UczOggtTYp6TZ0QnBiwMxwdTBl3zuvqF?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvvSUuoKhQ$>> 
>>
>>      >
>>      > Best regards,
>>      > Nicolas Heutte
>>      >
>>      > On Tue, Feb 16, 2021 at 11:35 PM Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
>>     <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>> wrote:
>>      >
>>      >     Hi Nicolas,
>>      >
>>      >     The file you shared has only assembler code. Yes, it shows that when
>>     ArrayFloatToArrayFloatVectorBinding::plus() is
>>      >     inlined into AVector::plus() it is not vectorized.
>>      >
>>      >     But I asked for an other file (hotspot_pid<PID>.log) which is generated when you run app with
>>      >     -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation flags. It should start with:
>>      >
>>      >     <?xml version='1.0' encoding='UTF-8'?>
>>      >     <hotspot_log version='160 1' process='2302014' time_ms='1613514688748'>
>>      >     <vm_version>
>>      >     <name>
>>      >     Java HotSpot(TM) 64-Bit Server VM
>>      >     </name>
>>      >     <release>
>>      >     11.0.9+7-LTS
>>      >     </release>
>>      >
>>      >     Thanks,
>>      >     Vladimir K
>>      >
>>      >     On 2/15/21 5:19 AM, Nicolas Heutte wrote:
>>      >      > Hi Vladimir,
>>      >      >
>>      >      > I've tried disabling tiered compilation, as you requested. It seems that the inlining was performed 
>> slightly
>>      >      > differently, but the issue remains. As you can see in this excerpt, the main loop isn't properly 
>> vectorized:
>>      >      >
>>      >      >    0x00000254b0d4bf54: cmp    %r11d,%r8d
>>      >      >    0x00000254b0d4bf57: jae    0x00000254b0d4c19e
>>      >      >    0x00000254b0d4bf5d: vmovss 0x10(%rcx,%r8,4),%xmm9  ;*faload {reexecute=0 rethrow=0 return_oop=0}
>>      >      >                                                  ; -
>>      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54 (line 41)
>>      >      >                                                  ; - com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >                                                  ; - com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
>>     (line 103)
>>      >      >                                                  ; -
>>      >      > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70 (line 66)
>>      >      >                                                  ; -
>>     com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
>>      >     (line 118)
>>      >      >
>>      >      >    0x00000254b0d4bf64: cmp    %ebx,%r8d
>>      >      >    0x00000254b0d4bf67: jae    0x00000254b0d4c1ec
>>      >      >    0x00000254b0d4bf6d: vaddss 0x10(%rdi,%r8,4),%xmm9,%xmm9
>>      >      >    0x00000254b0d4bf74: vmovss %xmm9,0x10(%rcx,%r8,4)  ;*fastore {reexecute=0 rethrow=0 return_oop=0}
>>      >      >                                                  ; -
>>      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61 (line 41)
>>      >      >                                                  ; - com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >                                                  ; - com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
>>     (line 103)
>>      >      >                                                  ; -
>>      >      > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70 (line 66)
>>      >      >                                                  ; -
>>     com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
>>      >     (line 118)
>>      >      >
>>      >      >    0x00000254b0d4bf7b: inc    %r8d               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>>      >      >                                                  ; -
>>      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62 (line 40)
>>      >      >                                                  ; - com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >                                                  ; - com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
>>     (line 103)
>>      >      >                                                  ; -
>>      >      > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70 (line 66)
>>      >      >                                                  ; -
>>     com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
>>      >     (line 118)
>>      >      >
>>      >      >    0x00000254b0d4bf7e: cmp    %r9d,%r8d
>>      >      >    0x00000254b0d4bf81: jl     0x00000254b0d4bf54  ;*goto {reexecute=0 rethrow=0 return_oop=0}
>>      >      >                                                  ; -
>>      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65 (line 40)
>>      >      >                                                  ; - com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >                                                  ; - com.qfs.agg.impl.SumVectorAggregationBinding::plus at 2
>>     (line 103)
>>      >      >                                                  ; -
>>      >      > com.qfs.agg.impl.SumVectorAggregationBinding::safeVectorAggregate at 70 (line 66)
>>      >      >                                                  ; -
>>     com.qfs.agg.impl.AVectorAggregationBinding::safeAggregate at 27
>>      >     (line 118)
>>      >      >
>>      >      >
>>      >      >
>>      >      > Here is the link to the full log, should you want to take a look at it:
>>      >      > https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing
>>     
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!IeK3TGC8Qf5eu0KH3LEkt26TEAWCirLIkTuJ2iAAmTkfBK4_Vnnr6gkOuOydizGvKpVqaQ$> 
>>
>>      >      
>>  <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvuMpg6inQ$ 
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBl1ZdyC5xtmVS0QG3dxZxEen0D1LP-mBM0KnvmRVbQXpL_VPOQ9OD-pVGBqNvuMpg6inQ$>> 
>>
>>      >
>>      >      >
>>      >      
>>  <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$ 
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$> 
>>
>>      >      
>>  <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$ 
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1KQU7wI8NjeElFv6RrQmUsUPRMnAefzhb/view?usp=sharing__;!!GqivPVa7Brio!PBuP6MfDNWUOTe23SSXA0V5wn_VHjv2sjI8POWRwp6mr0wVdIzFhNoVZANb4FqCYKwzapw$>>> 
>>
>>      >      >
>>      >      > Best regards,
>>      >      > Nicolas Heutte
>>      >      >
>>      >      > On Thu, Feb 11, 2021 at 7:05 PM Vladimir Kozlov <vladimir.kozlov at oracle.com
>>     <mailto:vladimir.kozlov at oracle.com> <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>
>>      >     <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com> <mailto:vladimir.kozlov at oracle.com
>>     <mailto:vladimir.kozlov at oracle.com>>>> wrote:
>>      >      >
>>      >      >     Changing wide mailing list to JIT compiler only.
>>      >      >
>>      >      >     This deoptimization is normal in Tiered Compilation - it switched from profiling code (level='3')
>>     generated by C1
>>      >      >     compiler to new code generated by C2 (level='4') which does loop optimizations.
>>      >      >
>>      >      >     Thank you for posting inlining information:
>>      >      >
>>      >      >           @ 17   com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69 bytes) inline 
>> (hot)
>>      >      >              \-> TypeProfile (14054/14054 counts) =
>>     com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
>>      >      >
>>      >      >     I thought before that may be call site is not hot but it is not the case.
>>      >      >
>>      >      >     You can do an other experiment to collect log with disabled Tiered Compilation (only C2 is used):
>>      >     -XX:-TieredCompilation
>>      >      >     Also print assembler code (as you did before) for final compilation to see if loop is still not
>>     vectorized.
>>      >      >
>>      >      >     Is it possible to post log file (on GitHub?) for me to look?
>>      >      >
>>      >      >     Thanks,
>>      >      >     Vladimir K
>>      >      >
>>      >      >     On 2/11/21 6:28 AM, Nicolas Heutte wrote:
>>      >      >      > Hi Vladimir,
>>      >      >      >
>>      >      >      > Thank you for your help.
>>      >      >      >
>>      >      >      > I'm currently running Java 11.0.9, and I did not use any VM flag of note.
>>      >      >      >
>>      >      >      > I checked the content of the compilation log, and it seems that
>>      >     ArrayFloatToArrayFloatVectorBinding::plus() was
>>      >      >      > deoptimized in order to allow AVector::plus() to be compiled:
>>      >      >      >
>>      >      >      > <writer thread='11576'/>
>>      >      >      > <task_queued compile_id='17280' method='com.qfs.vector.impl.AVector plus 
>> (Lcom/qfs/vector/IVector;)V'
>>      >     bytes='23'
>>      >      >      > count='916' iicount='916' level='3' stamp='7394.056' comment='tiered' hot_count='896'/>
>>      >      >      > <writer thread='15784'/>
>>      >      >      > <deoptimized thread='15784' reason='constraint' pc='0x00000296d0785b94' compile_id='17257'
>>     compiler='c1'
>>      >     level='3'>
>>      >      >      > <jvms bci='65' method='com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding plus
>>      >      >      > (Lcom/qfs/vector/IVector;Lcom/qfs/vector/IVector;)V' bytes='69' count='909' backedge_count='155602'
>>      >     iicount='910'/>
>>      >      >      > </deoptimized>
>>      >      >      >
>>      >      >      > The last compilation entry for AVector::plus() is:
>>      >      >      >
>>      >      >      > <writer thread='16380'/>
>>      >      >      > <nmethod compile_id='17317' compiler='c2' level='4' entry='0x00000296d6af32c0' size='1960'
>>      >      >     address='0x00000296d6af3110'
>>      >      >      > relocation_offset='376' insts_offset='432' stub_offset='1040' scopes_data_offset='1152'
>>      >     scopes_pcs_offset='1592'
>>      >      >      > dependencies_offset='1880' nul_chk_table_offset='1896' oops_offset='1064' metadata_offset='1072'
>>      >      >      > method='com.qfs.vector.impl.AVector plus (Lcom/qfs/vector/IVector;)V' bytes='23' count='172425'
>>      >     iicount='172425'
>>      >      >      > stamp='7394.199'/>
>>      >      >      > <make_not_entrant thread='16380' compile_id='17280' compiler='c1' level='2' stamp='7394.199'/>
>>      >      >      >                                @ 1   com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4
>>     bytes)
>>      >     inline
>>      >      >     (hot)
>>      >      >      >                                 \-> TypeProfile (14552/14552 counts) =
>>      >     com/qfs/vector/array/impl/ArrayFloatVector
>>      >      >      >                                @ 7   com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4
>>     bytes)
>>      >     inline
>>      >      >     (hot)
>>      >      >      >                                 \-> TypeProfile (14150/14150 counts) =
>>      >     com/qfs/vector/array/impl/ArrayFloatVector
>>      >      >      >                                @ 10   com.qfs.vector.binding.impl.VectorBindings::getBinding (9 
>> bytes)
>>      >     inline (hot)
>>      >      >      >                                  @ 5
>>      >      >     com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::getBinding (22
>>      >      >      > bytes)   inline (hot)
>>      >      >      >                                    @ 3
>>      >      >     com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::hasBinding
>>      >      >      > (34 bytes)   inline (hot)
>>      >      >      >                                @ 17
>>      >     com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
>>      >      >     bytes)
>>      >      >      > inline (hot)
>>      >      >      >                                 \-> TypeProfile (14054/14054 counts) =
>>      >      >      > com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
>>      >      >      >                                  @ 12   com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)
>>      >     inline (hot)
>>      >      >      >                                  @ 22   com.qfs.vector.impl.AVector::checkIndex (37 bytes)   inline
>>     (hot)
>>      >      >      >                                    @ 6   com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)
>>      >     inline (hot)
>>      >      >      >                                  @ 27   com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying
>>     (5 bytes)
>>      >      >     accessor
>>      >      >      >                                  @ 34   com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying
>>     (5 bytes)
>>      >      >     accessor
>>      >      >      > <writer thread='15896'/>
>>      >      >      >
>>      >      >      > Unfortunately, I do not have access to a debug VM build, so I cannot run the second test you 
>> recommend.
>>      >      >      >
>>      >      >      > Best regards,
>>      >      >      > Nicolas Heutte
>>      >      >      >
>>      >      >      > On Wed, Feb 10, 2021 at 7:36 PM Vladimir Kozlov <vladimir.kozlov at oracle.com
>>     <mailto:vladimir.kozlov at oracle.com>
>>      >     <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> <mailto:vladimir.kozlov at oracle.com
>>     <mailto:vladimir.kozlov at oracle.com> <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>>
>>      >      >     <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
>>     <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> <mailto:vladimir.kozlov at oracle.com
>>     <mailto:vladimir.kozlov at oracle.com>
>>      >     <mailto:vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>>>> wrote:
>>      >      >      >
>>      >      >      >     Hi, Nicolas
>>      >      >      >
>>      >      >      >     Looks like, when inlined, the loop from ArrayFloatToArrayFloatVectorBinding::plus() was not
>>     optimized
>>      >     at all:
>>      >      >     it is not
>>      >      >      >     unrolled and has range checks. Such loops are not vectorized (you need unrolling and no checks).
>>      >      >      >
>>      >      >      >     What Java version you are running? What HotSpot VM flags you are using when running application?
>>      >      >      >
>>      >      >      >     Run application with -XX:+LogCompilation and look on compilation data in hotspot_pid<PID>.log
>>     file for
>>      >     caller
>>      >      >      >     AVector::plus().
>>      >      >      >
>>      >      >      >     VM also has several flags to trace loop optimizations but they are only available in debug VM
>>     build.
>>      >     If you
>>      >      >     have access
>>      >      >      >     to such build run with -XX:+PrintCompilation -XX:+TraceLoopOpts flags.
>>      >      >      >
>>      >      >      >     Thanks,
>>      >      >      >     Vladimir K
>>      >      >      >
>>      >      >      >     On 2/10/21 9:24 AM, Nicolas Heutte wrote:
>>      >      >      >      > Hi all,
>>      >      >      >      >
>>      >      >      >      > I am encountering a performance issue caused by the interaction between
>>      >      >      >      > method inlining and automatic vectorization.
>>      >      >      >      >
>>      >      >      >      > Our application aggregates arrays intensively using a method named
>>      >      >      >      > ArrayFloatToArrayFloatVectorBinding.plus() with the following code:
>>      >      >      >      >
>>      >      >      >      >      for (int i = 0; i < srcLen; ++i) {
>>      >      >      >      >
>>      >      >      >      >              dstArray[i] += srcArray[i];
>>      >      >      >      >
>>      >      >      >      >      }
>>      >      >      >      >
>>      >      >      >      > When we microbenchmark this method we observe fast performance close to the
>>      >      >      >      > practical memory bandwidth and when we print the assembly code we observe
>>      >      >      >      > loop unrolling and automatic vectorization with SIMD instructions.
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac05: movslq %r13d,%r11
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
>>      >      >      >      > {reexecute=0 rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
>>      >      >      >      > (line 41)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
>>      >      >      >      > rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
>>      >      >      >      > (line 40)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600acc3: cmp    %eax,%r13d
>>      >      >      >      >
>>      >      >      >      >    0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
>>      >      >      >      > rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
>>      >      >      >      > (line 40)
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      > In the real application, this method is actually inlined in a higher level
>>      >      >      >      > method named AVector.plus(). Unfortunately, the inlined version of the
>>      >      >      >      > aggregation code is not vectorized anymore:
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180a0: cmp    %ebx,%r11d
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180a3: jae    0x000001ef460180e6
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload {reexecute=0
>>      >      >      >      > rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 54
>>      >      >      >      > (line 41)
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180ac: cmp    %ecx,%r11d
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180af: jae    0x000001ef46018104
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore {reexecute=0
>>      >      >      >      > rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 61
>>      >      >      >      > (line 41)
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
>>      >      >      >      > rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 62
>>      >      >      >      > (line 40)
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180c2: cmp    %r10d,%r11d
>>      >      >      >      >
>>      >      >      >      >    0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
>>      >      >      >      > rethrow=0 return_oop=0}
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus at 65
>>      >      >      >      > (line 40)
>>      >      >      >      >
>>      >      >      >      >                                                  ; -
>>      >      >      >      > com.qfs.vector.impl.AVector::plus at 17 (line 204)
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      > This causes a significant performance drop, compared to a run where we
>>      >      >      >      > explicitly disable the inlining and observe automatically vectorized code
>>      >      >      >      > again (
>>      >      >      >      >
>>     -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
>>      >      >      >      > ).
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      > How do you guys explain that behavior of the JIT compiler? Is this a known
>>      >      >      >      > and tracked issue, could it be fixed in the JVM? Can we do something in the
>>      >      >      >      > java code to prevent this from happening?
>>      >      >      >      >
>>      >      >      >      >
>>      >      >      >      > Best regards,
>>      >      >      >      >
>>      >      >      >      > Nicolas Heutte
>>      >      >      >      >
>>      >      >      >
>>      >      >
>>      >
>>


More information about the hotspot-compiler-dev mailing list