RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Dean Long dlong at openjdk.java.net
Fri Apr 9 05:05:15 UTC 2021


On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>> 
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. 
>> 
>> As a consequence, in an auto-vectorized loop it will lead to:
>> - 1 instruction per loop iteration (multiplied by unrolling factor);
>> - no way to hoist the negation of loop invariant index.
>
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>> 
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>> 
>> As a consequence, in an auto-vectorized loop it will lead to:
>> 
>> * 1 instruction per loop iteration (multiplied by unrolling factor);
>> * no way to hoist the negation of loop invariant index.
> 
> Thanks for your feedback!
> 
> I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.
> 
> Before:
> 0xf46c8338:   add fp, r7, fp
> 0xf46c833c:   vshl.u16  d9, d9, d8
> 0xf46c8340:   vstr  d9, [fp, #12]
> 0xf46c8344:   vshl.u16  d9, d10, d8
> 0xf46c8348:   vstr  d9, [fp, #20]
> 0xf46c834c:   vshl.u16  d9, d11, d8
> 0xf46c8350:   vstr  d9, [fp, #28]
> 
> After:
> 0xf4aa1328:   add fp, r7, fp
> 0xf4aa132c:   vneg.s8 d13, d8
> 0xf4aa1330:   vshl.u16  d9, d9, d13
> 0xf4aa1334:   vstr  d9, [fp, #12]
> 0xf4aa1338:   vneg.s8 d9, d8
> 0xf4aa133c:   vshl.u16  d9, d10, d9
> 0xf4aa1340:   vstr  d9, [fp, #20]
> 0xf4aa1344:   vneg.s8 d9, d8
> 0xf4aa1348:   vshl.u16  d9, d11, d9
> 0xf4aa134c:   vstr  d9, [fp, #28]  
> 
> I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86  to what AArch32 dose, or merging them and accept this defect on AArch32.

@theRealELiu @iwanowww Yes, but I was thinking of the `vshiftcnt` rules and didn't realize the `replicate` rules are virtually identical.  Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule).  So why is it that x86 can share the shift register, but aarch64 cannot?  Is it because x86 has combined left- and right-shift into the same rule?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371


More information about the hotspot-compiler-dev mailing list