RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3]
Fei Gao
fgao at openjdk.java.net
Mon May 30 11:44:39 UTC 2022
On Fri, 27 May 2022 15:32:17 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>>
>> - Merge branch 'master' into fg8282470
>>
>> Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b
>> - Merge branch 'master' into fg8282470
>>
>> Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7
>> - 8282470: Eliminate useless sign extension before some subword integer operations
>>
>> Some loop cases of subword types, including byte and
>> short, can't be vectorized by C2's SLP. Here is an example:
>> ```
>> short[] addShort(short[] a, short[] b, short[] c) {
>> for (int i = 0; i < SIZE; i++) {
>> b[i] = (short) (a[i] + 8); // *line A*
>> sres[i] = (short) (b[i] + c[i]); // *line B*
>> }
>> }
>> ```
>> However, similar cases of int/float/double/long/char type can
>> be vectorized successfully.
>>
>> The reason why SLP can't vectorize the short case above is
>> that, as illustrated here[1], the result of the scalar add
>> operation on *line A* has been promoted to int type. It needs
>> to be narrowed to short type first before it can work as one
>> of source operands of addition on *line B*. The demotion is
>> done by left-shifting 16 bits then right-shifting 16 bits.
>> The ideal graph for the process is showed like below.
>>
>> LoadS a[i] 8
>> \ /
>> AddI (line A)
>> / \
>> StoreC b[i] Lshift 16bits
>> \
>> RShiftI 16 bits LoadS c[i]
>> \ /
>> AddI (line B)
>> \
>> StoreC sres[i]
>>
>> In SLP, for most short-type cases, we can determine the precise
>> type of the scalar int-type operation and finally execute it
>> with short-type vector operations[2], except rshift opcode and
>> abs in some situations[3]. But in this case, the source operand
>> of RShiftI is from LShiftI rather than from any LoadS[4], so we
>> can't determine its real type and conservatively assign it with
>> int type rather than real short type. The int-type opearation
>> RShiftI here can't be vectorized together with other short-type
>> operations, like AddI(line B). The reason for byte loop cases
>> is the same. Similar loop cases of char type could be
>> vectorized because its demotion from int to char is done by
>> `and` with mask rather than `lshift_rshift`.
>>
>> Therefore, we try to remove the patterns like
>> `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short
>> cases, to vectorize more scenarios. Optimizing it in the
>> mid-end by i-GVN is more reasonable.
>>
>> What we do in the mid-end is eliminating the sign extension
>> before some subword integer operations like:
>>
>> ```
>> int x, y;
>> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16
>> ```
>> to
>> ```
>> short s = (short) (x OP y);
>> ```
>>
>> In the patch, assuming that `x` can be any int number, we need
>> guarantee that the optimization doesn't have any impact on
>> result. Not all arithmetic logic OPs meet the requirements. For
>> example, assuming that `Imm` equals `16`, `x` equals `131068`,
>> `y` equals `50` and `OP` is division`/`,
>> `short s = (short) (((131068 << 16) >> 16) / 50)` is not
>> equal to `short s = (short) (131068 / 50)`. When OP is division,
>> we may get different result with or without demotion
>> before OP, because the upper 16 bits of division may have
>> influence on the lower 16 bits of result, which can't be
>> optimized. All optimizable opcodes are listed in
>> StoreNode::no_need_sign_extension(), whose upper 16 bits of src
>> operands don't influence the lower 16 bits of result for short
>> type and upper 24 bits of src operand don't influence the lower
>> 8 bits of dst operand for byte.
>>
>> After the patch, the short loop case above can be vectorized as:
>> ```
>> movi v18.8h, #0x8
>> ...
>> ldr q16, [x14, #32] // vector load a[i]
>> // vector add, a[i] + 8, no promotion or demotion
>> add v17.8h, v16.8h, v18.8h
>> str q17, [x6, #32] // vector store a[i] + 8, b[i]
>> ldr q17, [x0, #32] // vector load c[i]
>> // vector add, a[i] + c[i], no promotion or demotion
>> add v16.8h, v17.8h, v16.8h
>> // vector add, a[i] + c[i] + 8, no promotion or demotion
>> add v16.8h, v16.8h, v18.8h
>> str q16, [x11, #32] //vector store sres[i]
>> ...
>> ```
>>
>> The patch works for byte cases as well.
>>
>> Here is the performance data for micro-benchmark before
>> and after this patch on both AArch64 and x64 machines.
>> We can observe about ~83% improvement with this patch.
>>
>> on AArch64:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 401.521 ± 0.033 ns/op
>> addS 523 avgt 15 401.512 ± 0.021 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 68.444 ± 0.318 ns/op
>> addS 523 avgt 15 69.847 ± 0.043 ns/op
>>
>> on x86:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 454.102 ± 36.180 ns/op
>> addS 523 avgt 15 432.245 ± 22.640 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> addB 523 avgt 15 75.812 ± 5.063 ns/op
>> addS 523 avgt 15 72.839 ± 10.109 ns/op
>>
>> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241
>> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206
>> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249
>> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251
>>
>> Change-Id: I92ce42b550ef057964a3b58716436735275d8d31
>
> Testing tier1-4 passed.
Thanks a lot for your kind review and also for your test work, @vnkozlov .
-------------
PR: https://git.openjdk.java.net/jdk/pull/7954
More information about the hotspot-compiler-dev
mailing list