RFR: 8283307: Vectorize unsigned shift right on signed subword types [v2]

Fri Apr 22 11:15:18 UTC 2022

On Wed, 20 Apr 2022 04:04:34 GMT, Jie Fu <jiefu at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>> 
>>  - Remove related comments in some test files
>>    
>>    Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8
>>  - Merge branch 'master' into fg8283307
>>    
>>    Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a
>>  - 8283307: Vectorize unsigned shift right on signed subword types
>>    
>>    ```
>>    public short[] vectorUnsignedShiftRight(short[] shorts) {
>>        short[] res = new short[SIZE];
>>        for (int i = 0; i < SIZE; i++) {
>>            res[i] = (short) (shorts[i] >>> 3);
>>        }
>>        return res;
>>    }
>>    ```
>>    In C2's SLP, vectorization of unsigned shift right on signed
>>    subword types (byte/short) like the case above is intentionally
>>    disabled[1]. Because the vector unsigned shift on signed
>>    subword types behaves differently from the Java spec. It's
>>    worthy to vectorize more cases in quite low cost. Also,
>>    unsigned shift right on signed subword is not uncommon and we
>>    may find similar cases in Lucene benchmark[2].
>>    
>>    Taking unsigned right shift on short type as an example,
>>    
>>    Short:
>>        | <- 16 bits  -> |  <- 16 bits ->  |
>>        | 1 1 1 ... 1  1 |      data       |
>>    
>>    when the shift amount is a constant not greater than the number
>>    of sign extended bits, 16 higher bits for short type shown like
>>    above, the unsigned shift on signed subword types can be
>>    transformed into a signed shift and hence becomes vectorizable.
>>    Here is the transformation:
>>    
>>    For T_SHORT (shift <= 16):
>>      src    RShiftCntV shift          src    RShiftCntV shift
>>       \      /                  ==>    \       /
>>       URShiftVS                         RShiftVS
>>    
>>    This patch does the transformation in SuperWord::implemented() and
>>    SuperWord::output(). It helps vectorize the short cases above. We
>>    can handle unsigned right shift on byte type in a similar way. The
>>    generated assembly code for one iteration on aarch64 is like:
>>    ```
>>    ...
>>    sbfiz   x13, x10, #1, #32
>>    add     x15, x11, x13
>>    ldr     q16, [x15, #16]
>>    sshr    v16.8h, v16.8h, #3
>>    add     x13, x17, x13
>>    str     q16, [x13, #16]
>>    ...
>>    ```
>>    
>>    Here is the performance data for micro-benchmark before and after
>>    this patch on both AArch64 and x64 machines. We can observe about
>>    ~80% improvement with this patch.
>>    
>>    The perf data on AArch64:
>>    Before the patch:
>>    Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
>>    urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
>>    urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op
>>    
>>    after the patch:
>>    Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
>>    urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
>>    urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op
>>    
>>    The perf data on X86:
>>    Before the patch:
>>    Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>>    urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
>>    urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op
>>    
>>    After the patch:
>>    Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>>    urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
>>    urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op
>>    
>>    [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
>>    [2] https://github.com/jpountz/decode-128-ints-benchmark/
>>    
>>    Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
>
> test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 114:
> 
>> 112:         testByte0();
>> 113:         for (int i = 0; i < bytea.length; i++) {
>> 114:             Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3));
> 
> I'm still a bit worried about the test.
> 
> Suggestion:
> Rewrite
> 
> Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3));
> 
> 
> to
> 
> Asserts.assertEquals(byteb[i], urshift(bytea[i], 3)));
> 
> And disable inlining during the testing.

Done. Thanks very much @DamonFool .

-------------

PR: https://git.openjdk.java.net/jdk/pull/7979