RFR: 8283232: x86: Improve vector broadcast operations [v8]
Jatin Bhateja
jbhateja at openjdk.org
Thu Jul 28 18:20:00 UTC 2022
On Tue, 26 Jul 2022 12:48:16 GMT, Quan Anh Mai <duke at openjdk.org> wrote:
>> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4388:
>>
>>> 4386:
>>> 4387: void MacroAssembler::vallones(XMMRegister dst, int vector_len) {
>>> 4388: // vpcmpeqd has special dependency treatment so it should be preferred to vpternlogd
>>
>> Comment is not clear, adding relevant reference will add more value.
>
> I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification.
>
> The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline).
>
> I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake
>
> pcmpeqd xmm0, xmm0
> paddd xmm0, xmm1
> paddd xmm0, xmm1
> paddd xmm0, xmm1
>
> This version has the throughput of 4 cycles/iteration
>
> vpcmpeqd xmm0, xmm0, xmm0
> vpaddd xmm0, xmm1, xmm0
> vpaddd xmm0, xmm1, xmm0
> vpaddd xmm0, xmm1, xmm0
>
> Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction.
>
> Thanks.
Both the above JIT sequences have true dependency chain, there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms.
-------------
PR: https://git.openjdk.org/jdk/pull/7832
More information about the hotspot-compiler-dev
mailing list