RFR: 8283307: Vectorize unsigned shift right on signed subword types

Fei Gao fgao at openjdk.java.net
Mon Apr 18 09:06:42 UTC 2022


On Mon, 18 Apr 2022 01:29:13 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> public short[] vectorUnsignedShiftRight(short[] shorts) {
>>     short[] res = new short[SIZE];
>>     for (int i = 0; i < SIZE; i++) {
>>         res[i] = (short) (shorts[i] >>> 3);
>>     }
>>     return res;
>> }
>> 
>> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>> 
>> Taking unsigned right shift on short type as an example,
>> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png)
>> 
>> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
>> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
>> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png)
>> 
>> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>> 
>> ...
>> sbfiz   x13, x10, #1, #32
>> add     x15, x11, x13
>> ldr     q16, [x15, #16]
>> sshr    v16.8h, v16.8h, #3
>> add     x13, x17, x13
>> str     q16, [x13, #16]
>> ...
>> 
>> 
>> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>> 
>> The perf data on AArch64:
>> Before the patch:
>> Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
>> urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
>> urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op
>> 
>> after the patch:
>> Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
>> urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
>> urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op
>> 
>> The perf data on X86:
>> Before the patch:
>> Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>> urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
>> urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op
>> 
>> After the patch:
>> Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>> urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
>> urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op
>> 
>> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
>> [2] https://github.com/jpountz/decode-128-ints-benchmark/
>
>> My humble opinion is that we should not bother to optimise this pattern. An unsigned shift should operate on an unsigned type, so doing `(short)(s[i] >>> k)` is more likely a bug rather than an intended behaviour. This operation is really hard to reason about, and often not the result anyone cares about. And the second point is that if you want an unsigned shift, you should cast the values to `int` in an unsigned manner.
> 
> Thanks for your kind review, @merykitty . 
> 
> We may find some unsigned shift on signed subword types in the benchmark of Lucene, https://github.com/jpountz/decode-128-ints-benchmark/. So this pattern is possibly intended, in my opinion. 
> 
>> The first point is that this operation cannot be reason as a simple shift, and must be viewed as an int shift between 2 casts. 
> 
> Yes. I really agree. The byte/short value is promoted to int first then can be shifted, and is converted to bite/short. That's why we do the replacement here as I explained in the conversations above. 
> 
> What do you think? Thanks :)

> @fg1417 Thanks for your response, I have taken a look at the benchmark, it seems that the full operation is `((short & 0xFFFF) >>> k) & (1 << l - 1)` and then the `& 0xFFFF` part is omitted if `k + l <= 16`. So it would be the most appropriate if we can do the same here, that is to recognise the pattern `(short & x) >>> k` and `(short >>> k) & y` with `x \in 0xFFFF` and `y \in (1 << (16 - k) - 1)` (`\in` as in every set bit of the first operand is also a set bit of the second operand).

Thanks for your kind reply, @merykitty .

Actually, the pattern `(short & x) >>> k` has been transformed into `(short  >>> k) & (x >>> k)` in GVN phase https://github.com/openjdk/jdk/blob/21ea740e1da48054ee46efda493d0812a35d786e/src/hotspot/share/opto/mulnode.cpp#L1305. In this way, SLP need to support only `(short  >>> k)` like the patch did, to help vectorize the whole `(short & x) >>> k` or `(short >>> k) & y`, because the rest part,  ` & (x >>> k)` or `& y`, has been supported by SLP already. 

If the optimization in GVN phase doesn't work and the pattern `(short & x) >>> k` is transferred to SLP phase, SLP won't take  `(short & x) >>> k` as potential short vector operations. Because C2 can't get precise info about sign here for the shift operation, https://github.com/openjdk/jdk/blob/21ea740e1da48054ee46efda493d0812a35d786e/src/hotspot/share/opto/superword.cpp#L3328. If the src input of the `URShift` is not from a `load` node, we can't assign it as any signed subword type and vectorization will break off. That's to keep strictly consistent with Java Spec in `Shift` behavior.

Thank you very much.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7979


More information about the hotspot-compiler-dev mailing list