RFR: 8283307: Vectorize unsigned shift right on signed subword types [v2]
Fei Gao
fgao at openjdk.java.net
Wed Apr 20 01:25:53 UTC 2022
> public short[] vectorUnsignedShiftRight(short[] shorts) {
> short[] res = new short[SIZE];
> for (int i = 0; i < SIZE; i++) {
> res[i] = (short) (shorts[i] >>> 3);
> }
> return res;
> }
>
> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>
> Taking unsigned right shift on short type as an example,
> 
>
> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
> 
>
> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>
> ...
> sbfiz x13, x10, #1, #32
> add x15, x11, x13
> ldr q16, [x15, #16]
> sshr v16.8h, v16.8h, #3
> add x13, x17, x13
> str q16, [x13, #16]
> ...
>
>
> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>
> The perf data on AArch64:
> Before the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op
> urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op
>
> after the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op
> urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op
>
> The perf data on X86:
> Before the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op
> urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op
>
> After the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op
> urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op
>
> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
> [2] https://github.com/jpountz/decode-128-ints-benchmark/
Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
- Remove related comments in some test files
Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8
- Merge branch 'master' into fg8283307
Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a
- 8283307: Vectorize unsigned shift right on signed subword types
```
public short[] vectorUnsignedShiftRight(short[] shorts) {
short[] res = new short[SIZE];
for (int i = 0; i < SIZE; i++) {
res[i] = (short) (shorts[i] >>> 3);
}
return res;
}
```
In C2's SLP, vectorization of unsigned shift right on signed
subword types (byte/short) like the case above is intentionally
disabled[1]. Because the vector unsigned shift on signed
subword types behaves differently from the Java spec. It's
worthy to vectorize more cases in quite low cost. Also,
unsigned shift right on signed subword is not uncommon and we
may find similar cases in Lucene benchmark[2].
Taking unsigned right shift on short type as an example,
Short:
| <- 16 bits -> | <- 16 bits -> |
| 1 1 1 ... 1 1 | data |
when the shift amount is a constant not greater than the number
of sign extended bits, 16 higher bits for short type shown like
above, the unsigned shift on signed subword types can be
transformed into a signed shift and hence becomes vectorizable.
Here is the transformation:
For T_SHORT (shift <= 16):
src RShiftCntV shift src RShiftCntV shift
\ / ==> \ /
URShiftVS RShiftVS
This patch does the transformation in SuperWord::implemented() and
SuperWord::output(). It helps vectorize the short cases above. We
can handle unsigned right shift on byte type in a similar way. The
generated assembly code for one iteration on aarch64 is like:
```
...
sbfiz x13, x10, #1, #32
add x15, x11, x13
ldr q16, [x15, #16]
sshr v16.8h, v16.8h, #3
add x13, x17, x13
str q16, [x13, #16]
...
```
Here is the performance data for micro-benchmark before and after
this patch on both AArch64 and x64 machines. We can observe about
~80% improvement with this patch.
The perf data on AArch64:
Before the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op
urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op
after the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op
urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op
The perf data on X86:
Before the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op
urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op
After the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op
urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op
[1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
[2] https://github.com/jpountz/decode-128-ints-benchmark/
Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/7979/files
- new: https://git.openjdk.java.net/jdk/pull/7979/files/a26ebe81..907b14cb
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=01
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=00-01
Stats: 190367 lines in 2571 files changed: 136384 ins; 13432 del; 40551 mod
Patch: https://git.openjdk.java.net/jdk/pull/7979.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7979/head:pull/7979
PR: https://git.openjdk.java.net/jdk/pull/7979
More information about the hotspot-compiler-dev
mailing list