RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3]
Fei Gao
fgao at openjdk.java.net
Thu Jun 2 06:21:42 UTC 2022
On Wed, 1 Jun 2022 20:55:16 GMT, John R Rose <jrose at openjdk.org> wrote:
>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>>
>> - Merge branch 'master' into fg8282470
>>
>> Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b
>> - Merge branch 'master' into fg8282470
>>
>> Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7
>> - 8282470: Eliminate useless sign extension before some subword integer operations
>>
>> Some loop cases of subword types, including byte and
>> short, can't be vectorized by C2's SLP. Here is an example:
>> ```
>> short[] addShort(short[] a, short[] b, short[] c) {
>> for (int i = 0; i < SIZE; i++) {
>> b[i] = (short) (a[i] + 8); // *line A*
>> sres[i] = (short) (b[i] + c[i]); // *line B*
>> }
>> }
>> ```
>> However, similar cases of int/float/double/long/char type can
>> be vectorized successfully.
>>
>> The reason why SLP can't vectorize the short case above is
>> that, as illustrated here[1], the result of the scalar add
>> operation on *line A* has been promoted to int type. It needs
>> to be narrowed to short type first before it can work as one
>> of source operands of addition on *line B*. The demotion is
>> done by left-shifting 16 bits then right-shifting 16 bits.
>> The ideal graph for the process is showed like below.
>>
>> LoadS a[i] 8
>> \ /
>> AddI (line A)
>> / \
>> StoreC b[i] Lshift 16bits
>> \
>> RShiftI 16 bits LoadS c[i]
>> \ /
>> AddI (line B)
>> \
>> StoreC sres[i]
>>
>> In SLP, for most short-type cases, we can determine the precise
>> type of the scalar int-type operation and finally execute it
>> with short-type vector operations[2], except rshift opcode and
>> abs in some situations[3]. But in this case, the source operand
>> of RShiftI is from LShiftI rather than from any LoadS[4], so we
>> can't determine its real type and conservatively assign it with
>> int type rather than real short type. The int-type opearation
>> RShiftI here can't be vectorized together with other short-type
>> operations, like AddI(line B). The reason for byte loop cases
>> is the same. Similar loop cases of char type could be
>> vectorized because its demotion from int to char is done by
>> `and` with mask rather than `lshift_rshift`.
>>
>> Therefore, we try to remove the patterns like
>> `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short
>> cases, to vectorize more scenarios. Optimizing it in the
>> mid-end by i-GVN is more reasonable.
>>
>> What we do in the mid-end is eliminating the sign extension
>> before some subword integer operations like:
>>
>> ```
>> int x, y;
>> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16
>> ```
>> to
>> ```
>> short s = (short) (x OP y);
>> ```
>>
>> In the patch, assuming that `x` can be any int number, we need
>> guarantee that the optimization doesn't have any impact on
>> result. Not all arithmetic logic OPs meet the requirements. For
>> example, assuming that `Imm` equals `16`, `x` equals `131068`,
>> `y` equals `50` and `OP` is division`/`,
>> `short s = (short) (((131068 << 16) >> 16) / 50)` is not
>> equal to `short s = (short) (131068 / 50)`. When OP is division,
>> we may get different result with or without demotion
>> before OP, because the upper 16 bits of division may have
>> influence on the lower 16 bits of result, which can't be
>> optimized. All optimizable opcodes are listed in
>> StoreNode::no_need_sign_extension(), whose upper 16 bits of src
>> operands don't influence the lower 16 bits of result for short
>> type and upper 24 bits of src operand don't influence the lower
>> 8 bits of dst operand for byte.
>>
>> After the patch, the short loop case above can be vectorized as:
>> ```
>> movi v18.8h, #0x8
>> ...
>> ldr q16, [x14, #32] // vector load a[i]
>> // vector add, a[i] + 8, no promotion or demotion
>> add v17.8h, v16.8h, v18.8h
>> str q17, [x6, #32] // vector store a[i] + 8, b[i]
>> ldr q17, [x0, #32] // vector load c[i]
>> // vector add, a[i] + c[i], no promotion or demotion
>> add v16.8h, v17.8h, v16.8h
>> // vector add, a[i] + c[i] + 8, no promotion or demotion
>> add v16.8h, v16.8h, v18.8h
>> str q16, [x11, #32] //vector store sres[i]
>> ...
>> ```
>>
>> The patch works for byte cases as well.
>>
>> Here is the performance data for micro-benchmark before
>> and after this patch on both AArch64 and x64 machines.
>> We can observe about ~83% improvement with this patch.
>>
>> on AArch64:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 401.521 ± 0.033 ns/op
>> addS 523 avgt 15 401.512 ± 0.021 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 68.444 ± 0.318 ns/op
>> addS 523 avgt 15 69.847 ± 0.043 ns/op
>>
>> on x86:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 454.102 ± 36.180 ns/op
>> addS 523 avgt 15 432.245 ± 22.640 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 75.812 ± 5.063 ns/op
>> addS 523 avgt 15 72.839 ± 10.109 ns/op
>>
>> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241
>> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206
>> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249
>> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251
>>
>> Change-Id: I92ce42b550ef057964a3b58716436735275d8d31
>
> I will believe more in this transformation if it covers a wide range of expressions, not just a few specially-matched patterns `r = (short)(x + (int)y)`.
>
> Searching to an unbounded depth in one transform is usually a sign something is formulated wrong. Often it's possible to break the search up into a series of incremental transformations that, taken together, get to the desired end.
>
> In this case, I guess you want to be able to transform `r = (short)XYZ` such that all needless short-to-int conversions inside of `XYZ` are removed. A short-to-int conversion `S2I(x)` is needless if (a) it is an operand of a carryless or leftward-carrying ALU operation `S2I(x) op y`, and (b) all consumers of the operation only care about the low 16 bits of the result.
>
> I think that, whether or not this patch is accepted, and certainly before the next patch which handles ternary adds or SubI or MulI in a similar manner is proposed, we should tackle that more general problem. (And not just for 16-bit ints. If I do a mask `(x + y) & 0x7F` or sign extension `(x + y) <<7 >>7` I should be able to benefit from the same kinds of reasoning.
Thanks for your review and kind suggestions, @rose00 @DamonFool @JohnTortugo
I really understand your concern and agree. Frankly, I attempted to seek for a common solution to cover a wider range but failed. But I will give a more try to figure it out if we can make this transformation not that specific and more general, before continuing proceeding with this patch. Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7954
More information about the hotspot-compiler-dev
mailing list