RFR: 8283232: x86: Improve vector broadcast operations [v8]
Quan Anh Mai
duke at openjdk.org
Fri Jul 29 03:47:34 UTC 2022
On Thu, 28 Jul 2022 18:17:27 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification.
>>
>> The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline).
>>
>> I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake
>>
>> pcmpeqd xmm0, xmm0
>> paddd xmm0, xmm1
>> paddd xmm0, xmm1
>> paddd xmm0, xmm1
>>
>> This version has the throughput of 4 cycles/iteration
>>
>> vpcmpeqd xmm0, xmm0, xmm0
>> vpaddd xmm0, xmm1, xmm0
>> vpaddd xmm0, xmm1, xmm0
>> vpaddd xmm0, xmm1, xmm0
>>
>> Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction.
>>
>> Thanks.
>
> Both the above JIT sequences have true dependency chain, there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms.
I'm sorry I don't quite understand what do you mean here, what I meant is that while `pcmpeqd xmmk, xmmk` is a dep-breaking idiom, `vpcmpeqd xmmk, xmmk, xmmk` seems to not be. As a result, I reverted that change and in this context, the only change is I added a branch for non-AVX machines. Please have a review for this patch. Thank you very much.
-------------
PR: https://git.openjdk.org/jdk/pull/7832
More information about the hotspot-compiler-dev
mailing list