RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3]

Wed Jun 1 09:53:36 UTC 2022

On Thu, 26 May 2022 06:18:33 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example:
>> 
>> short[] addShort(short[] a, short[] b, short[] c) {
>>   for (int i = 0; i < SIZE; i++) {
>>     b[i] = (short) (a[i] + 8); //  line A
>>     sres[i] = (short) (b[i] + c[i]); // line B
>>   }
>> }
>> 
>> However, similar cases of int/float/double/long/char type can be vectorized successfully.
>> 
>> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below.
>> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png)
>> 
>> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`.
>> 
>> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable.
>> 
>> What we do in the mid-end is eliminating the sign extension before some subword integer operations like:
>> 
>> 
>> int x, y;
>> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16
>> 
>> to
>> 
>> short s = (short) (x OP y);
>> 
>> 
>> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`,
>> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short
>> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte.
>> 
>> After the patch, the short loop case above can be vectorized as:
>> 
>> movi    v18.8h, #0x8
>> ...
>> ldr     q16, [x14, #32] // vector load a[i]
>> // vector add, a[i] + 8, no promotion or demotion
>> add     v17.8h, v16.8h, v18.8h
>> str     q17, [x6, #32]  // vector store a[i] + 8, b[i]
>> ldr     q17, [x0, #32]  // vector load c[i]
>> // vector add, a[i] + c[i], no promotion or demotion
>> add     v16.8h, v17.8h, v16.8h
>> // vector add, a[i] + c[i] + 8, no promotion or demotion
>> add     v16.8h, v16.8h, v18.8h
>> str     q16, [x11, #32]  //vector store sres[i]
>> ...
>> 
>> 
>> The patch works for byte cases as well.
>> 
>> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch.
>> 
>> on AArch64:
>> Before the patch:
>> Benchmark (length)  Mode  Cnt    Score    Error  Units
>> addB       523      avgt   15  401.521 ±  0.033  ns/op
>> addS       523      avgt   15  401.512 ±  0.021  ns/op
>> 
>> After the patch:
>> Benchmark (length)  Mode  Cnt    Score   Error  Units
>> addB       523      avgt   15   68.444 ± 0.318  ns/op
>> addS       523      avgt   15   69.847 ± 0.043  ns/op
>> 
>> on x86:
>> Before the patch:
>> Benchmark (length)  Mode  Cnt    Score    Error  Units
>> addB       523      avgt   15  454.102 ± 36.180  ns/op
>> addS       523      avgt   15  432.245 ± 22.640  ns/op
>> 
>> After the patch:
>> Benchmark (length)  Mode  Cnt    Score    Error  Units
>> addB       523      avgt   15   75.812 ±  5.063  ns/op
>> addS       523      avgt   15   72.839 ± 10.109  ns/op
>> 
>> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241
>> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206
>> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249
>> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Merge branch 'master' into fg8282470
>    
>    Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b
>  - Merge branch 'master' into fg8282470
>    
>    Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7
>  - 8282470: Eliminate useless sign extension before some subword integer operations
>    
>    Some loop cases of subword types, including byte and
>    short, can't be vectorized by C2's SLP. Here is an example:
>    ```
>    short[] addShort(short[] a, short[] b, short[] c) {
>      for (int i = 0; i < SIZE; i++) {
>        b[i] = (short) (a[i] + 8); //  *line A*
>        sres[i] = (short) (b[i] + c[i]); // *line B*
>      }
>    }
>    ```
>    However, similar cases of int/float/double/long/char type can
>    be vectorized successfully.
>    
>    The reason why SLP can't vectorize the short case above is
>    that, as illustrated here[1], the result of the scalar add
>    operation on *line A* has been promoted to int type. It needs
>    to be narrowed to short type first before it can work as one
>    of source operands of addition on *line B*. The demotion is
>    done by left-shifting 16 bits then right-shifting 16 bits.
>    The ideal graph for the process is showed like below.
>    
>     LoadS a[i]  8
>           \   /
>            AddI (line A)
>           /    \
>    StoreC b[i]  Lshift 16bits
>                     \
>                   RShiftI 16 bits    LoadS c[i]
>                           \           /
>                           AddI (line B)
>                               \
>                             StoreC sres[i]
>    
>    In SLP, for most short-type cases, we can determine the precise
>    type of the scalar int-type operation and finally execute it
>    with short-type vector operations[2], except rshift opcode and
>    abs in some situations[3]. But in this case, the source operand
>    of RShiftI is from LShiftI rather than from any LoadS[4], so we
>    can't determine its real type and conservatively assign it with
>    int type rather than real short type. The int-type opearation
>    RShiftI here can't be vectorized together with other short-type
>    operations, like AddI(line B). The reason for byte loop cases
>    is the same. Similar loop cases of char type could be
>    vectorized because its demotion from int to char is done by
>    `and` with mask rather than `lshift_rshift`.
>    
>    Therefore, we try to remove the patterns like
>    `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short
>    cases, to vectorize more scenarios. Optimizing it in the
>    mid-end by i-GVN is more reasonable.
>    
>    What we do in the mid-end is eliminating the sign extension
>    before some subword integer operations like:
>    
>    ```
>    int x, y;
>    short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16
>    ```
>    to
>    ```
>    short s = (short) (x OP y);
>    ```
>    
>    In the patch, assuming that `x` can be any int number, we need
>    guarantee that the optimization doesn't have any impact on
>    result. Not all arithmetic logic OPs meet the requirements. For
>    example, assuming that `Imm` equals `16`, `x` equals `131068`,
>    `y` equals `50` and `OP` is division`/`,
>    `short s = (short) (((131068 << 16) >> 16) / 50)` is not
>    equal to `short s = (short) (131068 / 50)`. When OP is division,
>    we may get different result with or without demotion
>    before OP, because the upper 16 bits of division may have
>    influence on the lower 16 bits of result, which can't be
>    optimized. All optimizable opcodes are listed in
>    StoreNode::no_need_sign_extension(), whose upper 16 bits of src
>    operands don't influence the lower 16 bits of result for short
>    type and upper 24 bits of src operand don't influence the lower
>    8 bits of dst operand for byte.
>    
>    After the patch, the short loop case above can be vectorized as:
>    ```
>    movi    v18.8h, #0x8
>    ...
>    ldr     q16, [x14, #32] // vector load a[i]
>    // vector add, a[i] + 8, no promotion or demotion
>    add     v17.8h, v16.8h, v18.8h
>    str     q17, [x6, #32]  // vector store a[i] + 8, b[i]
>    ldr     q17, [x0, #32]  // vector load c[i]
>    // vector add, a[i] + c[i], no promotion or demotion
>    add     v16.8h, v17.8h, v16.8h
>    // vector add, a[i] + c[i] + 8, no promotion or demotion
>    add     v16.8h, v16.8h, v18.8h
>    str     q16, [x11, #32]  //vector store sres[i]
>    ...
>    ```
>    
>    The patch works for byte cases as well.
>    
>    Here is the performance data for micro-benchmark before
>    and after this patch on both AArch64 and x64 machines.
>    We can observe about ~83% improvement with this patch.
>    
>    on AArch64:
>    Before the patch:
>    Benchmark (length)  Mode  Cnt    Score    Error  Units
>    addB       523      avgt   15  401.521 ±  0.033  ns/op
>    addS       523      avgt   15  401.512 ±  0.021  ns/op
>    
>    After the patch:
>    Benchmark (length)  Mode  Cnt    Score   Error  Units
>    addB       523      avgt   15   68.444 ± 0.318  ns/op
>    addS       523      avgt   15   69.847 ± 0.043  ns/op
>    
>    on x86:
>    Before the patch:
>    Benchmark (length)  Mode  Cnt    Score    Error  Units
>    addB       523      avgt   15  454.102 ± 36.180  ns/op
>    addS       523      avgt   15  432.245 ± 22.640  ns/op
>    
>    After the patch:
>    Benchmark (length)  Mode  Cnt    Score    Error  Units
>    addB       523      avgt   15   75.812 ±  5.063  ns/op
>    addS       523      avgt   15   72.839 ± 10.109  ns/op
>    
>    [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241
>    [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206
>    [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249
>    [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251
>    
>    Change-Id: I92ce42b550ef057964a3b58716436735275d8d31

So the current implementation only works for limited scenarios, right?

I'm not sure if there is a good way to eliminate the useless sign extension in `testShort2`.
But I really hope this opt can be used for more situations.

Let's see if someone has a good idea.
Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7954