RFR: 8265263: AArch64: Combine vneg with right shift count
Andrew Dinn
adinn at openjdk.java.net
Mon Mar 7 11:12:04 UTC 2022
On Mon, 7 Mar 2022 08:46:12 GMT, Hao Sun <haosun at openjdk.org> wrote:
> ### Implementation
>
> In AArch64 NEON, vector shift right is implemented by vector shift left
> instructions (SSHL[1] and USHL[2]) with negative shift count value. In
> C2 backend, we generate a `neg` to given shift value followed by `sshl`
> or `ushl` instruction.
>
> For vector shift right, the vector shift count has two origins:
> 1) it can be duplicated from scalar variable/immediate(case-1),
> 2) it can be loaded directly from one vector(case-2).
>
> This patch aims to optimize case-1. Specifically, we move the negate
> from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
> hoisted outside of the loop if it's a loop invariant.
>
> In this patch,
> 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
> shift left and shift right respectively. Compared to vslcnt* rules, the
> negate is conducted in vsrcnt*.
> 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
> and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
> vsra*_var and vsrl*_var rules to handle case-2. Note that
> ShiftVNode::is_var_shift() can be used to distinguish case-1 from
> case-2.
> 3) we add one assertion for the vs*_imm rules as we have done on
> ARM32[3].
> 4) several style issues are resolved.
>
> ### Example
>
> Take function `rShiftInt()` in the newly added micro benchmark
> VectorShiftRight.java as an example.
>
>
> public void rShiftInt() {
> for (int i = 0; i < SIZE; i++) {
> intsB[i] = intsA[i] >> count;
> }
> }
>
>
> Arithmetic shift right is conducted inside a big loop. The following
> code snippet shows the disassembly code generated by auto-vectorization
> before we apply current patch. We can see that `neg` is conducted in the
> loop body.
>
>
> 0x0000ffff89057a64: dup v16.16b, w13 <-- dup
> 0x0000ffff89057a68: mov w12, #0x7d00 // #32000
> 0x0000ffff89057a6c: sub w13, w2, w10
> 0x0000ffff89057a70: cmp w2, w10
> 0x0000ffff89057a74: csel w13, wzr, w13, lt
> 0x0000ffff89057a78: mov w8, #0x7d00 // #32000
> 0x0000ffff89057a7c: cmp w13, w8
> 0x0000ffff89057a80: csel w13, w12, w13, hi
> 0x0000ffff89057a84: add w14, w13, w10
> 0x0000ffff89057a88: nop
> 0x0000ffff89057a8c: nop
> 0x0000ffff89057a90: sbfiz x13, x10, #2, #32 <-- loop entry
> 0x0000ffff89057a94: add x15, x17, x13
> 0x0000ffff89057a98: ldr q17, [x15,#16]
> 0x0000ffff89057a9c: add x13, x0, x13
> 0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg
> 0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right
> 0x0000ffff89057aa8: str q17, [x13,#16]
> 0x0000ffff89057aac: ...
> 0x0000ffff89057b1c: add w10, w10, #0x20
> 0x0000ffff89057b20: cmp w10, w14
> 0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end
>
>
> Here is the disassembly code after we apply current patch. We can see
> that the negate is no longer conducted inside the loop, and it is
> hoisted to the outside.
>
>
> 0x0000ffff8d053a68: neg w14, w13 <---- neg
> 0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup
> 0x0000ffff8d053a70: sub w14, w2, w10
> 0x0000ffff8d053a74: cmp w2, w10
> 0x0000ffff8d053a78: csel w14, wzr, w14, lt
> 0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000
> 0x0000ffff8d053a80: cmp w14, w8
> 0x0000ffff8d053a84: csel w14, w12, w14, hi
> 0x0000ffff8d053a88: add w13, w14, w10
> 0x0000ffff8d053a8c: nop
> 0x0000ffff8d053a90: sbfiz x14, x10, #2, #32 <-- loop entry
> 0x0000ffff8d053a94: add x15, x17, x14
> 0x0000ffff8d053a98: ldr q17, [x15,#16]
> 0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right
> 0x0000ffff8d053aa0: add x14, x0, x14
> 0x0000ffff8d053aa4: str q17, [x14,#16]
> 0x0000ffff8d053aa8: ...
> 0x0000ffff8d053afc: add w10, w10, #0x20
> 0x0000ffff8d053b00: cmp w10, w13
> 0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end
>
>
> ### Testing
>
> Tier1~3 tests passed on Linux/AArch64 platform.
>
> ### Performance Evaluation
>
> - Auto-vectorization
>
> One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
> in order to evaluate the optimization on vector shift right.
>
> The following table shows the result. Column `Score-1` shows the score
> before we apply current patch, and column `Score-2` shows the score when
> we apply current patch.
>
> We witness about 30% ~ 53% improvement on microbenchmarks.
>
>
> Benchmark Units Score-1 Score-2
> VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353
> VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941
> VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247
> VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762
> VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*)
> VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478
> VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928
> VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991
>
> *: Logical shift right for Byte type(urShiftByte) is not vectorized, as
> disscussed in [4].
>
>
> - VectorAPI
>
> Furthermore, we also evaluate the impact of this patch on VectorAPI
> benchmarks, e.g., [5]. Details can be found in the table below. Columns
> `Score-1` and `Score-2` show the scores before and after applying
> current patch.
>
>
> Benchmark Units Score-1 Score-2
> Byte128Vector.LSHL ops/ms 10867.666 10873.993
> Byte128Vector.LSHLShift ops/ms 10945.729 10945.741
> Byte128Vector.LSHR ops/ms 8629.305 8629.343
> Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <--
> Byte128Vector.ASHR ops/ms 8619.691 8629.438
> Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <--
> Int128Vector.LSHL ops/ms 3104.213 3103.702
> Int128Vector.LSHLShift ops/ms 3114.354 3114.371
> Int128Vector.LSHR ops/ms 2380.717 2380.693
> Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <--
> Int128Vector.ASHR ops/ms 2380.668 2380.647
> Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <--
> Long128Vector.LSHL ops/ms 1586.907 1587.591
> Long128Vector.LSHLShift ops/ms 1589.469 1589.540
> Long128Vector.LSHR ops/ms 1209.754 1209.687
> Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <--
> Long128Vector.ASHR ops/ms 1209.713 1209.669
> Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <--
> Short128Vector.LSHL ops/ms 5945.542 5943.770
> Short128Vector.LSHLShift ops/ms 5984.743 5984.640
> Short128Vector.LSHR ops/ms 4613.378 4613.577
> Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <--
> Short128Vector.ASHR ops/ms 4613.389 4613.478
> Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <--
>
>
> 1) For logical shift left(LSHL and LSHLShift), and shift right with
> variable vector shift count(LSHR and ASHR) cases, we didn't find much
> changes, which is expected.
>
> 2) For shift right with scalar shift count(LSHRShift and ASHRShift)
> case, about 25% ~ 30% improvement can be observed, and this benefit is
> introduced by current patch.
>
> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
> [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
> [3] https://github.com/openjdk/jdk18/pull/41
> [4] https://github.com/openjdk/jdk/pull/1087
> [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
This looks fine. Very nice work.
-------------
Marked as reviewed by adinn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/7724
More information about the hotspot-compiler-dev
mailing list