RFR: 8283232: x86: Improve vector broadcast operations [v8]

Thu Jul 28 18:20:00 UTC 2022

On Tue, 26 Jul 2022 12:48:16 GMT, Quan Anh Mai <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4388:
>> 
>>> 4386: 
>>> 4387: void MacroAssembler::vallones(XMMRegister dst, int vector_len) {
>>> 4388:   // vpcmpeqd has special dependency treatment so it should be preferred to vpternlogd
>> 
>> Comment is not clear, adding relevant reference will add more value.
>
> I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification.
> 
> The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline).
> 
> I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake
> 
>     pcmpeqd xmm0, xmm0
>     paddd xmm0, xmm1
>     paddd xmm0, xmm1
>     paddd xmm0, xmm1
> 
> This version has the throughput of 4 cycles/iteration
> 
>     vpcmpeqd xmm0, xmm0, xmm0
>     vpaddd xmm0, xmm1, xmm0
>     vpaddd xmm0, xmm1, xmm0
>     vpaddd xmm0, xmm1, xmm0
> 
> Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction.
> 
> Thanks.

Both the above JIT sequences have true dependency chain,  there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms.

-------------

PR: https://git.openjdk.org/jdk/pull/7832