RFR: 8283091: Support type conversion between different data sizes in SLP [v7]
Fei Gao
fgao at openjdk.java.net
Mon Jun 6 14:02:49 UTC 2022
On Mon, 6 Jun 2022 13:32:57 GMT, Fei Gao <fgao at openjdk.org> wrote:
>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
>>
>> - Add assertion line for opcode() and withdraw some common code as a function
>>
>> Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe
>> - Merge branch 'master' into fg8283091
>>
>> Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6
>> - Implement an interface for auto-vectorization to consult supported match rules
>>
>> Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701
>> - Merge branch 'master' into fg8283091
>>
>> Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd
>> - Merge branch 'master' into fg8283091
>>
>> Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f
>> - Merge branch 'master' into fg8283091
>>
>> Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83
>> - Add micro-benchmark cases
>>
>> Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8
>> - Merge branch 'master' into fg8283091
>>
>> Change-Id: I674581135fd0844accc65520574fcef161eededa
>> - 8283091: Support type conversion between different data sizes in SLP
>>
>> After JDK-8275317, C2's SLP vectorizer has supported type conversion
>> between the same data size. We can also support conversions between
>> different data sizes like:
>> int <-> double
>> float <-> long
>> int <-> long
>> float <-> double
>>
>> A typical test case:
>>
>> int[] a;
>> double[] b;
>> for (int i = start; i < limit; i++) {
>> b[i] = (double) a[i];
>> }
>>
>> Our expected OptoAssembly code for one iteration is like below:
>>
>> add R12, R2, R11, LShiftL #2
>> vector_load V16,[R12, #16]
>> vectorcast_i2d V16, V16 # convert I to D vector
>> add R11, R1, R11, LShiftL #3 # ptr
>> add R13, R11, #16 # ptr
>> vector_store [R13], V16
>>
>> To enable the vectorization, the patch solves the following problems
>> in the SLP.
>>
>> There are three main operations in the case above, LoadI, ConvI2D and
>> StoreD. Assuming that the vector length is 128 bits, how many scalar
>> nodes should be packed together to a vector? If we decide it
>> separately for each operation node, like what we did before the patch
>> in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
>> or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
>> in a vector node sequence, like loading 4 elements to a vector, then
>> typecasting 2 elements and lastly storing these 2 elements, they become
>> invalid. As a result, we should look through the whole def-use chain
>> and then pick up the minimum of these element sizes, like function
>> SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
>> In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
>> generate valid vector node sequence, like loading 2 elements,
>> converting the 2 elements to another type and storing the 2 elements
>> with new type.
>>
>> After this, LoadI nodes don't make full use of the whole vector and
>> only occupy part of it. So we adapt the code in
>> SuperWord::get_vw_bytes_special() to the situation.
>>
>> In SLP, we calculate a kind of alignment as position trace for each
>> scalar node in the whole vector. In this case, the alignments for 2
>> LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
>> Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
>> mark that this node is the second node in the whole vector, while the
>> difference between 4 and 8 are just because of their own data sizes. In
>> this situation, we should try to remove the impact caused by different
>> data size in SLP. For example, in the stage of
>> SuperWord::extend_packlist(), while determining if it's potential to
>> pack a pair of def nodes in the function SuperWord::follow_use_defs(),
>> we remove the side effect of different data size by transforming the
>> target alignment from the use node. Because we believe that, assuming
>> that the vector length is 512 bits, if the ConvI2D use nodes have
>> alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
>> these two LoadI nodes should be packed as a pair as well.
>>
>> Similarly, when determining if the vectorization is profitable, type
>> conversion between different data size takes a type of one size and
>> produces a type of another size, hence the special checks on alignment
>> and size should be applied, like what we do in SuperWord::is_vector_use.
>>
>> After solving these problems, we successfully implemented the
>> vectorization of type conversion between different data sizes.
>>
>> Here is the test data on NEON:
>>
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> VectorLoop.convertD2F 523 avgt 15 216.431 ± 0.131 ns/op
>> VectorLoop.convertD2I 523 avgt 15 220.522 ± 0.311 ns/op
>> VectorLoop.convertF2D 523 avgt 15 217.034 ± 0.292 ns/op
>> VectorLoop.convertF2L 523 avgt 15 231.634 ± 1.881 ns/op
>> VectorLoop.convertI2D 523 avgt 15 229.538 ± 0.095 ns/op
>> VectorLoop.convertI2L 523 avgt 15 214.822 ± 0.131 ns/op
>> VectorLoop.convertL2F 523 avgt 15 230.188 ± 0.217 ns/op
>> VectorLoop.convertL2I 523 avgt 15 162.234 ± 0.235 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> VectorLoop.convertD2F 523 avgt 15 124.352 ± 1.079 ns/op
>> VectorLoop.convertD2I 523 avgt 15 557.388 ± 8.166 ns/op
>> VectorLoop.convertF2D 523 avgt 15 118.082 ± 4.026 ns/op
>> VectorLoop.convertF2L 523 avgt 15 225.810 ± 11.180 ns/op
>> VectorLoop.convertI2D 523 avgt 15 166.247 ± 0.120 ns/op
>> VectorLoop.convertI2L 523 avgt 15 119.699 ± 2.925 ns/op
>> VectorLoop.convertL2F 523 avgt 15 220.847 ± 0.053 ns/op
>> VectorLoop.convertL2I 523 avgt 15 122.339 ± 2.738 ns/op
>>
>> perf data on X86:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> VectorLoop.convertD2F 523 avgt 15 279.466 ± 0.069 ns/op
>> VectorLoop.convertD2I 523 avgt 15 551.009 ± 7.459 ns/op
>> VectorLoop.convertF2D 523 avgt 15 276.066 ± 0.117 ns/op
>> VectorLoop.convertF2L 523 avgt 15 545.108 ± 5.697 ns/op
>> VectorLoop.convertI2D 523 avgt 15 745.303 ± 0.185 ns/op
>> VectorLoop.convertI2L 523 avgt 15 260.878 ± 0.044 ns/op
>> VectorLoop.convertL2F 523 avgt 15 502.016 ± 0.172 ns/op
>> VectorLoop.convertL2I 523 avgt 15 261.654 ± 3.326 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> VectorLoop.convertD2F 523 avgt 15 106.975 ± 0.045 ns/op
>> VectorLoop.convertD2I 523 avgt 15 546.866 ± 9.287 ns/op
>> VectorLoop.convertF2D 523 avgt 15 82.414 ± 0.340 ns/op
>> VectorLoop.convertF2L 523 avgt 15 542.235 ± 2.785 ns/op
>> VectorLoop.convertI2D 523 avgt 15 92.966 ± 1.400 ns/op
>> VectorLoop.convertI2L 523 avgt 15 79.960 ± 0.528 ns/op
>> VectorLoop.convertL2F 523 avgt 15 504.712 ± 4.794 ns/op
>> VectorLoop.convertL2I 523 avgt 15 129.753 ± 0.094 ns/op
>>
>> perf data on AVX512:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> VectorLoop.convertD2F 523 avgt 15 282.984 ± 4.022 ns/op
>> VectorLoop.convertD2I 523 avgt 15 543.080 ± 3.873 ns/op
>> VectorLoop.convertF2D 523 avgt 15 273.950 ± 0.131 ns/op
>> VectorLoop.convertF2L 523 avgt 15 539.568 ± 2.747 ns/op
>> VectorLoop.convertI2D 523 avgt 15 745.238 ± 0.069 ns/op
>> VectorLoop.convertI2L 523 avgt 15 260.935 ± 0.169 ns/op
>> VectorLoop.convertL2F 523 avgt 15 501.870 ± 0.359 ns/op
>> VectorLoop.convertL2I 523 avgt 15 257.508 ± 0.174 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> VectorLoop.convertD2F 523 avgt 15 76.687 ± 0.530 ns/op
>> VectorLoop.convertD2I 523 avgt 15 545.408 ± 4.657 ns/op
>> VectorLoop.convertF2D 523 avgt 15 273.935 ± 0.099 ns/op
>> VectorLoop.convertF2L 523 avgt 15 540.534 ± 3.032 ns/op
>> VectorLoop.convertI2D 523 avgt 15 745.234 ± 0.053 ns/op
>> VectorLoop.convertI2L 523 avgt 15 260.865 ± 0.104 ns/op
>> VectorLoop.convertL2F 523 avgt 15 63.834 ± 4.777 ns/op
>> VectorLoop.convertL2I 523 avgt 15 48.183 ± 0.990 ns/op
>>
>> Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
>
>> // And exclude cases which are not profitable to auto-vectorize.
>
> Done.
>
>> Put it into a separate function because this code pattern is used 2 times.
>
> Done.
>
>> May be we should have assert here to make sure that in all places we call `VectorCastNode::opcode()` for `Conv*` nodes
>
> Done.
>
> Fixed the comments above and rebased to the latest JDK. All jtreg tests passed.
>
> Thanks.
> @fg1417 Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See [OpenJDK Developers’ Guide](https://openjdk.java.net/guide/#working-with-pull-requests) for more information.
May I ask if I do anything wrong? I just rebased the master, resolved conflict and pushed a new commit as it guides... and did not do any force-push... Why I got the notification this time?
-------------
PR: https://git.openjdk.java.net/jdk/pull/7806
More information about the hotspot-compiler-dev
mailing list