RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3]

Fri Apr 22 11:09:09 UTC 2022

> public short[] vectorUnsignedShiftRight(short[] shorts) {
>     short[] res = new short[SIZE];
>     for (int i = 0; i < SIZE; i++) {
>         res[i] = (short) (shorts[i] >>> 3);
>     }
>     return res;
> }
> 
> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
> 
> Taking unsigned right shift on short type as an example,
> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png)
> 
> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png)
> 
> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
> 
> ...
> sbfiz   x13, x10, #1, #32
> add     x15, x11, x13
> ldr     q16, [x15, #16]
> sshr    v16.8h, v16.8h, #3
> add     x13, x17, x13
> str     q16, [x13, #16]
> ...
> 
> 
> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
> 
> The perf data on AArch64:
> Before the patch:
> Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
> urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
> urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op
> 
> after the patch:
> Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
> urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
> urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op
> 
> The perf data on X86:
> Before the patch:
> Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
> urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
> urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op
> 
> After the patch:
> Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
> urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
> urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op
> 
> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
> [2] https://github.com/jpountz/decode-128-ints-benchmark/

Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:

 - Rewrite the scalar calculation to avoid inline

   Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723
 - Merge branch 'master' into fg8283307

   Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303
 - Remove related comments in some test files

   Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8
 - Merge branch 'master' into fg8283307

   Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a
 - 8283307: Vectorize unsigned shift right on signed subword types

   ```
   public short[] vectorUnsignedShiftRight(short[] shorts) {
       short[] res = new short[SIZE];
       for (int i = 0; i < SIZE; i++) {
           res[i] = (short) (shorts[i] >>> 3);
       }
       return res;
   }
   ```
   In C2's SLP, vectorization of unsigned shift right on signed
   subword types (byte/short) like the case above is intentionally
   disabled[1]. Because the vector unsigned shift on signed
   subword types behaves differently from the Java spec. It's
   worthy to vectorize more cases in quite low cost. Also,
   unsigned shift right on signed subword is not uncommon and we
   may find similar cases in Lucene benchmark[2].

   Taking unsigned right shift on short type as an example,

   Short:
       | <- 16 bits  -> |  <- 16 bits ->  |
       | 1 1 1 ... 1  1 |      data       |

   when the shift amount is a constant not greater than the number
   of sign extended bits, 16 higher bits for short type shown like
   above, the unsigned shift on signed subword types can be
   transformed into a signed shift and hence becomes vectorizable.
   Here is the transformation:

   For T_SHORT (shift <= 16):
     src    RShiftCntV shift          src    RShiftCntV shift
      \      /                  ==>    \       /
      URShiftVS                         RShiftVS

   This patch does the transformation in SuperWord::implemented() and
   SuperWord::output(). It helps vectorize the short cases above. We
   can handle unsigned right shift on byte type in a similar way. The
   generated assembly code for one iteration on aarch64 is like:
   ```
   ...
   sbfiz   x13, x10, #1, #32
   add     x15, x11, x13
   ldr     q16, [x15, #16]
   sshr    v16.8h, v16.8h, #3
   add     x13, x17, x13
   str     q16, [x13, #16]
   ...
   ```

   Here is the performance data for micro-benchmark before and after
   this patch on both AArch64 and x64 machines. We can observe about
   ~80% improvement with this patch.

   The perf data on AArch64:
   Before the patch:
   Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
   urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
   urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op

   after the patch:
   Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
   urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
   urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op

   The perf data on X86:
   Before the patch:
   Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
   urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
   urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op

   After the patch:
   Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
   urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
   urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op

   [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
   [2] https://github.com/jpountz/decode-128-ints-benchmark/

   Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7979/files
  - new: https://git.openjdk.java.net/jdk/pull/7979/files/907b14cb..1f0570a3

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=01-02

  Stats: 12620 lines in 905 files changed: 7681 ins; 1918 del; 3021 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7979.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7979/head:pull/7979

PR: https://git.openjdk.java.net/jdk/pull/7979