RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3]

Wed Jun 1 02:16:40 UTC 2022

On Tue, 31 May 2022 09:12:10 GMT, Pengfei Li <pli at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
>> 
>>  - Rewrite the scalar calculation to avoid inline
>>    
>>    Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723
>>  - Merge branch 'master' into fg8283307
>>    
>>    Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303
>>  - Remove related comments in some test files
>>    
>>    Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8
>>  - Merge branch 'master' into fg8283307
>>    
>>    Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a
>>  - 8283307: Vectorize unsigned shift right on signed subword types
>>    
>>    ```
>>    public short[] vectorUnsignedShiftRight(short[] shorts) {
>>        short[] res = new short[SIZE];
>>        for (int i = 0; i < SIZE; i++) {
>>            res[i] = (short) (shorts[i] >>> 3);
>>        }
>>        return res;
>>    }
>>    ```
>>    In C2's SLP, vectorization of unsigned shift right on signed
>>    subword types (byte/short) like the case above is intentionally
>>    disabled[1]. Because the vector unsigned shift on signed
>>    subword types behaves differently from the Java spec. It's
>>    worthy to vectorize more cases in quite low cost. Also,
>>    unsigned shift right on signed subword is not uncommon and we
>>    may find similar cases in Lucene benchmark[2].
>>    
>>    Taking unsigned right shift on short type as an example,
>>    
>>    Short:
>>        | <- 16 bits  -> |  <- 16 bits ->  |
>>        | 1 1 1 ... 1  1 |      data       |
>>    
>>    when the shift amount is a constant not greater than the number
>>    of sign extended bits, 16 higher bits for short type shown like
>>    above, the unsigned shift on signed subword types can be
>>    transformed into a signed shift and hence becomes vectorizable.
>>    Here is the transformation:
>>    
>>    For T_SHORT (shift <= 16):
>>      src    RShiftCntV shift          src    RShiftCntV shift
>>       \      /                  ==>    \       /
>>       URShiftVS                         RShiftVS
>>    
>>    This patch does the transformation in SuperWord::implemented() and
>>    SuperWord::output(). It helps vectorize the short cases above. We
>>    can handle unsigned right shift on byte type in a similar way. The
>>    generated assembly code for one iteration on aarch64 is like:
>>    ```
>>    ...
>>    sbfiz   x13, x10, #1, #32
>>    add     x15, x11, x13
>>    ldr     q16, [x15, #16]
>>    sshr    v16.8h, v16.8h, #3
>>    add     x13, x17, x13
>>    str     q16, [x13, #16]
>>    ...
>>    ```
>>    
>>    Here is the performance data for micro-benchmark before and after
>>    this patch on both AArch64 and x64 machines. We can observe about
>>    ~80% improvement with this patch.
>>    
>>    The perf data on AArch64:
>>    Before the patch:
>>    Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
>>    urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
>>    urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op
>>    
>>    after the patch:
>>    Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
>>    urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
>>    urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op
>>    
>>    The perf data on X86:
>>    Before the patch:
>>    Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>>    urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
>>    urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op
>>    
>>    After the patch:
>>    Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
>>    urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
>>    urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op
>>    
>>    [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
>>    [2] https://github.com/jpountz/decode-128-ints-benchmark/
>>    
>>    Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
>
> test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 36:
> 
>> 34:  * @key randomness
>> 35:  * @summary Auto-vectorization enhancement for unsigned shift right on signed subword types
>> 36:  * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64"
> 
> This IR test for vectorizable check looks good on AArch64. But AFAIK, some operations cannot be vectorized on old x86 CPUs with AVX=1. Could you add something like `(os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*")` to check the CPU feature?

@merykitty @sviswa7 Could you help confirm if byte/short shift operations are vectorizable with all AVX versions of x86?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7979