RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node
Eric Liu
eliu at openjdk.java.net
Fri Apr 9 07:35:15 UTC 2021
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <eliu at openjdk.org> wrote:
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>> - 1 instruction per loop iteration (multiplied by unrolling factor);
>> - no way to hoist the negation of loop invariant index.
>
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>>
>> * 1 instruction per loop iteration (multiplied by unrolling factor);
>> * no way to hoist the negation of loop invariant index.
>
> Thanks for your feedback!
>
> I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.
>
> Before:
> 0xf46c8338: add fp, r7, fp
> 0xf46c833c: vshl.u16 d9, d9, d8
> 0xf46c8340: vstr d9, [fp, #12]
> 0xf46c8344: vshl.u16 d9, d10, d8
> 0xf46c8348: vstr d9, [fp, #20]
> 0xf46c834c: vshl.u16 d9, d11, d8
> 0xf46c8350: vstr d9, [fp, #28]
>
> After:
> 0xf4aa1328: add fp, r7, fp
> 0xf4aa132c: vneg.s8 d13, d8
> 0xf4aa1330: vshl.u16 d9, d9, d13
> 0xf4aa1334: vstr d9, [fp, #12]
> 0xf4aa1338: vneg.s8 d9, d8
> 0xf4aa133c: vshl.u16 d9, d10, d9
> 0xf4aa1340: vstr d9, [fp, #20]
> 0xf4aa1344: vneg.s8 d9, d8
> 0xf4aa1348: vshl.u16 d9, d11, d9
> 0xf4aa134c: vstr d9, [fp, #28]
>
> I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86 to what AArch32 dose, or merging them and accept this defect on AArch32.
> @theRealELiu @iwanowww Yes, but I was thinking of the `vshiftcnt` rules and didn't realize the `replicate` rules are virtually identical. Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule). So why is it that x86 can share the shift register, but aarch64 cannot? Is it because x86 has combined left- and right-shift into the same rule?
I suppose x86 has the different kinds of vector shift instructions[1], but AArch64 and AArch32 only have the left one. For this reason, arm should use a "neg-leftshift" pair to implement right shift.
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L1234
-------------
PR: https://git.openjdk.java.net/jdk/pull/3371
More information about the hotspot-compiler-dev
mailing list