RFR: 8283307: Vectorize unsigned shift right on signed subword types

Fei Gao fgao at openjdk.java.net
Fri Apr 8 08:22:43 UTC 2022


On Fri, 8 Apr 2022 03:53:56 GMT, Jie Fu <jiefu at openjdk.org> wrote:

>> public short[] vectorUnsignedShiftRight(short[] shorts) {
>>     short[] res = new short[SIZE];
>>     for (int i = 0; i < SIZE; i++) {
>>         res[i] = (short) (shorts[i] >>> 3);
>>     }
>>     return res;
>> }
>> 
>> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>> 
>> Taking unsigned right shift on short type as an example,
>> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png)
>> 
>> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
>> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
>> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png)
>> 
>> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>> 
>> ...
>> sbfiz   x13, x10, #1, #32
>> add     x15, x11, x13
>> ldr     q16, [x15, #16]
>> sshr    v16.8h, v16.8h, #3
>> add     x13, x17, x13
>> str     q16, [x13, #16]
>> ...
>> 
>> 
>> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>> 
>> The perf data on AArch64:
>> Before the patch:
>> Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
>> urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
>> urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op
>> 
>> after the patch:
>> Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
>> urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
>> urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op
>> 
>> The perf data on X86:
>> Before the patch:
>> Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>> urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
>> urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op
>> 
>> After the patch:
>> Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>> urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
>> urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op
>> 
>> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
>> [2] https://github.com/jpountz/decode-128-ints-benchmark/
>
> src/hotspot/share/opto/superword.cpp line 2027:
> 
>> 2025:       }
>> 2026:     } else {
>> 2027:       // Vector unsigned right shift for signed subword types behaves differently
> 
> Can you make it to be more clear about the difference?

In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short value, after sign-extension to int, the value should be like:
![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png)
In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like:
![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png)
As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like:
![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png)
In this way, the result of vector unsigned right shift is different from the result of scalar unsigned right shift for signed subword types.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7979


More information about the hotspot-compiler-dev mailing list