RFR: 8283232: x86: Improve vector broadcast operations [v8]

Fri Jul 29 03:47:34 UTC 2022

On Thu, 28 Jul 2022 18:17:27 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification.
>> 
>> The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline).
>> 
>> I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake
>> 
>>     pcmpeqd xmm0, xmm0
>>     paddd xmm0, xmm1
>>     paddd xmm0, xmm1
>>     paddd xmm0, xmm1
>> 
>> This version has the throughput of 4 cycles/iteration
>> 
>>     vpcmpeqd xmm0, xmm0, xmm0
>>     vpaddd xmm0, xmm1, xmm0
>>     vpaddd xmm0, xmm1, xmm0
>>     vpaddd xmm0, xmm1, xmm0
>> 
>> Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction.
>> 
>> Thanks.
>
> Both the above JIT sequences have true dependency chain,  there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms.

I'm sorry I don't quite understand what do you mean here, what I meant is that while `pcmpeqd xmmk, xmmk` is a dep-breaking idiom, `vpcmpeqd xmmk, xmmk, xmmk` seems to not be. As a result, I reverted that change and in this context, the only change is I added a branch for non-AVX machines. Please have a review for this patch. Thank you very much.

-------------

PR: https://git.openjdk.org/jdk/pull/7832