RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3]

Thu Jun 2 06:21:42 UTC 2022

On Wed, 1 Jun 2022 20:55:16 GMT, John R Rose <jrose at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>> 
>>  - Merge branch 'master' into fg8282470
>>    
>>    Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b
>>  - Merge branch 'master' into fg8282470
>>    
>>    Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7
>>  - 8282470: Eliminate useless sign extension before some subword integer operations
>>    
>>    Some loop cases of subword types, including byte and
>>    short, can't be vectorized by C2's SLP. Here is an example:
>>    ```
>>    short[] addShort(short[] a, short[] b, short[] c) {
>>      for (int i = 0; i < SIZE; i++) {
>>        b[i] = (short) (a[i] + 8); //  *line A*
>>        sres[i] = (short) (b[i] + c[i]); // *line B*
>>      }
>>    }
>>    ```
>>    However, similar cases of int/float/double/long/char type can
>>    be vectorized successfully.
>>    
>>    The reason why SLP can't vectorize the short case above is
>>    that, as illustrated here[1], the result of the scalar add
>>    operation on *line A* has been promoted to int type. It needs
>>    to be narrowed to short type first before it can work as one
>>    of source operands of addition on *line B*. The demotion is
>>    done by left-shifting 16 bits then right-shifting 16 bits.
>>    The ideal graph for the process is showed like below.
>>    
>>     LoadS a[i]  8
>>           \   /
>>            AddI (line A)
>>           /    \
>>    StoreC b[i]  Lshift 16bits
>>                     \
>>                   RShiftI 16 bits    LoadS c[i]
>>                           \           /
>>                           AddI (line B)
>>                               \
>>                             StoreC sres[i]
>>    
>>    In SLP, for most short-type cases, we can determine the precise
>>    type of the scalar int-type operation and finally execute it
>>    with short-type vector operations[2], except rshift opcode and
>>    abs in some situations[3]. But in this case, the source operand
>>    of RShiftI is from LShiftI rather than from any LoadS[4], so we
>>    can't determine its real type and conservatively assign it with
>>    int type rather than real short type. The int-type opearation
>>    RShiftI here can't be vectorized together with other short-type
>>    operations, like AddI(line B). The reason for byte loop cases
>>    is the same. Similar loop cases of char type could be
>>    vectorized because its demotion from int to char is done by
>>    `and` with mask rather than `lshift_rshift`.
>>    
>>    Therefore, we try to remove the patterns like
>>    `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short
>>    cases, to vectorize more scenarios. Optimizing it in the
>>    mid-end by i-GVN is more reasonable.
>>    
>>    What we do in the mid-end is eliminating the sign extension
>>    before some subword integer operations like:
>>    
>>    ```
>>    int x, y;
>>    short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16
>>    ```
>>    to
>>    ```
>>    short s = (short) (x OP y);
>>    ```
>>    
>>    In the patch, assuming that `x` can be any int number, we need
>>    guarantee that the optimization doesn't have any impact on
>>    result. Not all arithmetic logic OPs meet the requirements. For
>>    example, assuming that `Imm` equals `16`, `x` equals `131068`,
>>    `y` equals `50` and `OP` is division`/`,
>>    `short s = (short) (((131068 << 16) >> 16) / 50)` is not
>>    equal to `short s = (short) (131068 / 50)`. When OP is division,
>>    we may get different result with or without demotion
>>    before OP, because the upper 16 bits of division may have
>>    influence on the lower 16 bits of result, which can't be
>>    optimized. All optimizable opcodes are listed in
>>    StoreNode::no_need_sign_extension(), whose upper 16 bits of src
>>    operands don't influence the lower 16 bits of result for short
>>    type and upper 24 bits of src operand don't influence the lower
>>    8 bits of dst operand for byte.
>>    
>>    After the patch, the short loop case above can be vectorized as:
>>    ```
>>    movi    v18.8h, #0x8
>>    ...
>>    ldr     q16, [x14, #32] // vector load a[i]
>>    // vector add, a[i] + 8, no promotion or demotion
>>    add     v17.8h, v16.8h, v18.8h
>>    str     q17, [x6, #32]  // vector store a[i] + 8, b[i]
>>    ldr     q17, [x0, #32]  // vector load c[i]
>>    // vector add, a[i] + c[i], no promotion or demotion
>>    add     v16.8h, v17.8h, v16.8h
>>    // vector add, a[i] + c[i] + 8, no promotion or demotion
>>    add     v16.8h, v16.8h, v18.8h
>>    str     q16, [x11, #32]  //vector store sres[i]
>>    ...
>>    ```
>>    
>>    The patch works for byte cases as well.
>>    
>>    Here is the performance data for micro-benchmark before
>>    and after this patch on both AArch64 and x64 machines.
>>    We can observe about ~83% improvement with this patch.
>>    
>>    on AArch64:
>>    Before the patch:
>>    Benchmark (length)  Mode  Cnt    Score    Error  Units
>>    addB       523      avgt   15  401.521 ±  0.033  ns/op
>>    addS       523      avgt   15  401.512 ±  0.021  ns/op
>>    
>>    After the patch:
>>    Benchmark (length)  Mode  Cnt    Score   Error  Units
>>    addB       523      avgt   15   68.444 ± 0.318  ns/op
>>    addS       523      avgt   15   69.847 ± 0.043  ns/op
>>    
>>    on x86:
>>    Before the patch:
>>    Benchmark (length)  Mode  Cnt    Score    Error  Units
>>    addB       523      avgt   15  454.102 ± 36.180  ns/op
>>    addS       523      avgt   15  432.245 ± 22.640  ns/op
>>    
>>    After the patch:
>>    Benchmark (length)  Mode  Cnt    Score    Error  Units
>>    addB       523      avgt   15   75.812 ±  5.063  ns/op
>>    addS       523      avgt   15   72.839 ± 10.109  ns/op
>>    
>>    [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241
>>    [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206
>>    [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249
>>    [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251
>>    
>>    Change-Id: I92ce42b550ef057964a3b58716436735275d8d31
>
> I will believe more in this transformation if it covers a wide range of expressions, not just a few specially-matched patterns `r = (short)(x + (int)y)`.
> 
> Searching to an unbounded depth in one transform is usually a sign something is formulated wrong.  Often it's possible to break the search up into a series of incremental transformations that, taken together, get to the desired end.
> 
> In this case, I guess you want to be able to transform `r = (short)XYZ` such that all needless short-to-int conversions inside of `XYZ` are removed.  A short-to-int conversion `S2I(x)` is needless if (a) it is an operand of a carryless or leftward-carrying ALU operation `S2I(x) op y`, and (b) all consumers of the operation only care about the low 16 bits of the result.
> 
> I think that, whether or not this patch is accepted, and certainly before the next patch which handles ternary adds or SubI or MulI in a similar manner is proposed, we should tackle that more general problem.  (And not just for 16-bit ints.  If I do a mask `(x + y) & 0x7F` or sign extension `(x + y) <<7 >>7` I should be able to benefit from the same kinds of reasoning.

Thanks for your review and kind suggestions, @rose00 @DamonFool @JohnTortugo 

I really understand your concern and agree. Frankly, I attempted to seek for a common solution to cover a wider range but failed. But I will give a more try to figure it out if we can make this transformation not that specific and more general, before continuing proceeding with this patch. Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7954