RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Thu Apr 8 13:41:20 UTC 2021

On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>> 
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. 
>> 
>> As a consequence, in an auto-vectorized loop it will lead to:
>> - 1 instruction per loop iteration (multiplied by unrolling factor);
>> - no way to hoist the negation of loop invariant index.
>
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>> 
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>> 
>> As a consequence, in an auto-vectorized loop it will lead to:
>> 
>> * 1 instruction per loop iteration (multiplied by unrolling factor);
>> * no way to hoist the negation of loop invariant index.
> 
> Thanks for your feedback!
> 
> I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.
> 
> Before:
> 0xf46c8338:   add fp, r7, fp
> 0xf46c833c:   vshl.u16  d9, d9, d8
> 0xf46c8340:   vstr  d9, [fp, #12]
> 0xf46c8344:   vshl.u16  d9, d10, d8
> 0xf46c8348:   vstr  d9, [fp, #20]
> 0xf46c834c:   vshl.u16  d9, d11, d8
> 0xf46c8350:   vstr  d9, [fp, #28]
> 
> After:
> 0xf4aa1328:   add fp, r7, fp
> 0xf4aa132c:   vneg.s8 d13, d8
> 0xf4aa1330:   vshl.u16  d9, d9, d13
> 0xf4aa1334:   vstr  d9, [fp, #12]
> 0xf4aa1338:   vneg.s8 d9, d8
> 0xf4aa133c:   vshl.u16  d9, d10, d9
> 0xf4aa1340:   vstr  d9, [fp, #20]
> 0xf4aa1344:   vneg.s8 d9, d8
> 0xf4aa1348:   vshl.u16  d9, d11, d9
> 0xf4aa134c:   vstr  d9, [fp, #28]  
> 
> I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86  to what AArch32 dose, or merging them and accept this defect on AArch32.

`LShiftCntV`/`RShiftCntV` were added specifically for AArch32 and other platforms don't need/benefit from such separation.

Ultimately, I'd like to hear what AArch32 port maintainers think about it, but if there are concerns about performance impact expressed, as an alternative a platform-specific logic can be introduced that adjusts Ideal IR shape for `URShiftV`/`RShiftV` nodes accordingly.

(Not sure whether it is a good trade-off w.r.t. code complexity though.)

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371