RFR: 8265263: AArch64: Combine vneg with right shift count

Mon Mar 7 11:12:04 UTC 2022

On Mon, 7 Mar 2022 08:46:12 GMT, Hao Sun <haosun at openjdk.org> wrote:

> ### Implementation
> 
> In AArch64 NEON, vector shift right is implemented by vector shift left
> instructions (SSHL[1] and USHL[2]) with negative shift count value. In
> C2 backend, we generate a `neg` to given shift value followed by `sshl`
> or `ushl` instruction.
> 
> For vector shift right, the vector shift count has two origins:
> 1) it can be duplicated from scalar variable/immediate(case-1),
> 2) it can be loaded directly from one vector(case-2).
> 
> This patch aims to optimize case-1. Specifically, we move the negate
> from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
> hoisted outside of the loop if it's a loop invariant.
> 
> In this patch,
> 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
> shift left and shift right respectively. Compared to vslcnt* rules, the
> negate is conducted in vsrcnt*.
> 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
> and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
> vsra*_var and vsrl*_var rules to handle case-2. Note that
> ShiftVNode::is_var_shift() can be used to distinguish case-1 from
> case-2.
> 3) we add one assertion for the vs*_imm rules as we have done on
> ARM32[3].
> 4) several style issues are resolved.
> 
> ### Example
> 
> Take function `rShiftInt()` in the newly added micro benchmark
> VectorShiftRight.java as an example.
> 
> 
> public void rShiftInt() {
>     for (int i = 0; i < SIZE; i++) {
>         intsB[i] = intsA[i] >> count;
>     }
> }
> 
> 
> Arithmetic shift right is conducted inside a big loop. The following
> code snippet shows the disassembly code generated by auto-vectorization
> before we apply current patch. We can see that `neg` is conducted in the
> loop body.
> 
> 
> 0x0000ffff89057a64:   dup     v16.16b, w13              <-- dup
> 0x0000ffff89057a68:   mov     w12, #0x7d00                    // #32000
> 0x0000ffff89057a6c:   sub     w13, w2, w10
> 0x0000ffff89057a70:   cmp     w2, w10
> 0x0000ffff89057a74:   csel    w13, wzr, w13, lt
> 0x0000ffff89057a78:   mov     w8, #0x7d00                     // #32000
> 0x0000ffff89057a7c:   cmp     w13, w8
> 0x0000ffff89057a80:   csel    w13, w12, w13, hi
> 0x0000ffff89057a84:   add     w14, w13, w10
> 0x0000ffff89057a88:   nop
> 0x0000ffff89057a8c:   nop
> 0x0000ffff89057a90:   sbfiz   x13, x10, #2, #32         <-- loop entry
> 0x0000ffff89057a94:   add     x15, x17, x13
> 0x0000ffff89057a98:   ldr     q17, [x15,#16]
> 0x0000ffff89057a9c:   add     x13, x0, x13
> 0x0000ffff89057aa0:   neg     v18.16b, v16.16b          <-- neg
> 0x0000ffff89057aa4:   sshl    v17.4s, v17.4s, v18.4s    <-- shift right
> 0x0000ffff89057aa8:   str     q17, [x13,#16]
> 0x0000ffff89057aac:   ...
> 0x0000ffff89057b1c:   add     w10, w10, #0x20
> 0x0000ffff89057b20:   cmp     w10, w14
> 0x0000ffff89057b24:   b.lt    0x0000ffff89057a90        <-- loop end
> 
> 
> Here is the disassembly code after we apply current patch. We can see
> that the negate is no longer conducted inside the loop, and it is
> hoisted to the outside.
> 
> 
> 0x0000ffff8d053a68:   neg     w14, w13                  <---- neg
> 0x0000ffff8d053a6c:   dup     v16.16b, w14              <---- dup
> 0x0000ffff8d053a70:   sub     w14, w2, w10
> 0x0000ffff8d053a74:   cmp     w2, w10
> 0x0000ffff8d053a78:   csel    w14, wzr, w14, lt
> 0x0000ffff8d053a7c:   mov     w8, #0x7d00                     // #32000
> 0x0000ffff8d053a80:   cmp     w14, w8
> 0x0000ffff8d053a84:   csel    w14, w12, w14, hi
> 0x0000ffff8d053a88:   add     w13, w14, w10
> 0x0000ffff8d053a8c:   nop
> 0x0000ffff8d053a90:   sbfiz   x14, x10, #2, #32         <-- loop entry
> 0x0000ffff8d053a94:   add     x15, x17, x14
> 0x0000ffff8d053a98:   ldr     q17, [x15,#16]
> 0x0000ffff8d053a9c:   sshl    v17.4s, v17.4s, v16.4s    <-- shift right
> 0x0000ffff8d053aa0:   add     x14, x0, x14
> 0x0000ffff8d053aa4:   str     q17, [x14,#16]
> 0x0000ffff8d053aa8:   ...
> 0x0000ffff8d053afc:   add     w10, w10, #0x20
> 0x0000ffff8d053b00:   cmp     w10, w13
> 0x0000ffff8d053b04:   b.lt    0x0000ffff8d053a90        <-- loop end
> 
> 
> ### Testing
> 
> Tier1~3 tests passed on Linux/AArch64 platform.
> 
> ### Performance Evaluation
> 
> - Auto-vectorization
> 
> One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
> in order to evaluate the optimization on vector shift right.
> 
> The following table shows the result. Column `Score-1` shows the score
> before we apply current patch, and column `Score-2` shows the score when
> we apply current patch.
> 
> We witness about 30% ~ 53% improvement on microbenchmarks.
> 
> 
> Benchmark                      Units    Score-1    Score-2
> VectorShiftRight.rShiftByte   ops/ms  10601.980  13816.353
> VectorShiftRight.rShiftInt    ops/ms   3592.831   5502.941
> VectorShiftRight.rShiftLong   ops/ms   1584.012   2425.247
> VectorShiftRight.rShiftShort  ops/ms   6643.414   9728.762
> VectorShiftRight.urShiftByte  ops/ms   2066.965   2048.336 (*)
> VectorShiftRight.urShiftChar  ops/ms   6660.805   9728.478
> VectorShiftRight.urShiftInt   ops/ms   3592.909   5514.928
> VectorShiftRight.urShiftLong  ops/ms   1583.995   2422.991
> 
> *: Logical shift right for Byte type(urShiftByte) is not vectorized, as
> disscussed in [4].
> 
> 
> - VectorAPI
> 
> Furthermore, we also evaluate the impact of this patch on VectorAPI
> benchmarks, e.g., [5]. Details can be found in the table below. Columns
> `Score-1` and `Score-2` show the scores before and after applying
> current patch.
> 
> 
> Benchmark                  Units    Score-1    Score-2
> Byte128Vector.LSHL        ops/ms  10867.666  10873.993
> Byte128Vector.LSHLShift   ops/ms  10945.729  10945.741
> Byte128Vector.LSHR        ops/ms   8629.305   8629.343
> Byte128Vector.LSHRShift   ops/ms   8245.864  10303.521   <--
> Byte128Vector.ASHR        ops/ms   8619.691   8629.438
> Byte128Vector.ASHRShift   ops/ms   8245.860  10305.027   <--
> Int128Vector.LSHL         ops/ms   3104.213   3103.702
> Int128Vector.LSHLShift    ops/ms   3114.354   3114.371
> Int128Vector.LSHR         ops/ms   2380.717   2380.693
> Int128Vector.LSHRShift    ops/ms   2312.871   2992.377   <--
> Int128Vector.ASHR         ops/ms   2380.668   2380.647
> Int128Vector.ASHRShift    ops/ms   2312.894   2992.332   <--
> Long128Vector.LSHL        ops/ms   1586.907   1587.591
> Long128Vector.LSHLShift   ops/ms   1589.469   1589.540
> Long128Vector.LSHR        ops/ms   1209.754   1209.687
> Long128Vector.LSHRShift   ops/ms   1174.718   1527.502   <--
> Long128Vector.ASHR        ops/ms   1209.713   1209.669
> Long128Vector.ASHRShift   ops/ms   1174.712   1527.174   <--
> Short128Vector.LSHL       ops/ms   5945.542   5943.770
> Short128Vector.LSHLShift  ops/ms   5984.743   5984.640
> Short128Vector.LSHR       ops/ms   4613.378   4613.577
> Short128Vector.LSHRShift  ops/ms   4486.023   5746.466   <--
> Short128Vector.ASHR       ops/ms   4613.389   4613.478
> Short128Vector.ASHRShift  ops/ms   4486.019   5746.368   <--
> 
> 
> 1) For logical shift left(LSHL and LSHLShift), and shift right with
> variable vector shift count(LSHR and ASHR) cases, we didn't find much
> changes, which is expected.
> 
> 2) For shift right with scalar shift count(LSHRShift and ASHRShift)
> case, about 25% ~ 30% improvement can be observed, and this benefit is
> introduced by current patch.
> 
> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
> [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
> [3] https://github.com/openjdk/jdk18/pull/41
> [4] https://github.com/openjdk/jdk/pull/1087
> [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509

This looks fine. Very nice work.

-------------

Marked as reviewed by adinn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/7724