RFR: 8283091: Support type conversion between different data sizes in SLP [v5]
Fei Gao
fgao at openjdk.java.net
Thu May 12 07:11:20 UTC 2022
> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like:
> int <-> double
> float <-> long
> int <-> long
> float <-> double
>
> A typical test case:
>
> int[] a;
> double[] b;
> for (int i = start; i < limit; i++) {
> b[i] = (double) a[i];
> }
>
> Our expected OptoAssembly code for one iteration is like below:
>
> add R12, R2, R11, LShiftL #2
> vector_load V16,[R12, #16]
> vectorcast_i2d V16, V16 # convert I to D vector
> add R11, R1, R11, LShiftL #3 # ptr
> add R13, R11, #16 # ptr
> vector_store [R13], V16
>
> To enable the vectorization, the patch solves the following problems in the SLP.
>
> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain
> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type.
>
> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation.
>
> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as
a pair as well.
>
> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use().
>
> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes.
>
> Here is the test data (-XX:+UseSuperWord) on NEON:
>
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> convertD2F 523 avgt 15 216.431 ± 0.131 ns/op
> convertD2I 523 avgt 15 220.522 ± 0.311 ns/op
> convertF2D 523 avgt 15 217.034 ± 0.292 ns/op
> convertF2L 523 avgt 15 231.634 ± 1.881 ns/op
> convertI2D 523 avgt 15 229.538 ± 0.095 ns/op
> convertI2L 523 avgt 15 214.822 ± 0.131 ns/op
> convertL2F 523 avgt 15 230.188 ± 0.217 ns/op
> convertL2I 523 avgt 15 162.234 ± 0.235 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> convertD2F 523 avgt 15 124.352 ± 1.079 ns/op
> convertD2I 523 avgt 15 557.388 ± 8.166 ns/op
> convertF2D 523 avgt 15 118.082 ± 4.026 ns/op
> convertF2L 523 avgt 15 225.810 ± 11.180 ns/op
> convertI2D 523 avgt 15 166.247 ± 0.120 ns/op
> convertI2L 523 avgt 15 119.699 ± 2.925 ns/op
> convertL2F 523 avgt 15 220.847 ± 0.053 ns/op
> convertL2I 523 avgt 15 122.339 ± 2.738 ns/op
>
> perf data on X86:
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> convertD2F 523 avgt 15 279.466 ± 0.069 ns/op
> convertD2I 523 avgt 15 551.009 ± 7.459 ns/op
> convertF2D 523 avgt 15 276.066 ± 0.117 ns/op
> convertF2L 523 avgt 15 545.108 ± 5.697 ns/op
> convertI2D 523 avgt 15 745.303 ± 0.185 ns/op
> convertI2L 523 avgt 15 260.878 ± 0.044 ns/op
> convertL2F 523 avgt 15 502.016 ± 0.172 ns/op
> convertL2I 523 avgt 15 261.654 ± 3.326 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> convertD2F 523 avgt 15 106.975 ± 0.045 ns/op
> convertD2I 523 avgt 15 546.866 ± 9.287 ns/op
> convertF2D 523 avgt 15 82.414 ± 0.340 ns/op
> convertF2L 523 avgt 15 542.235 ± 2.785 ns/op
> convertI2D 523 avgt 15 92.966 ± 1.400 ns/op
> convertI2L 523 avgt 15 79.960 ± 0.528 ns/op
> convertL2F 523 avgt 15 504.712 ± 4.794 ns/op
> convertL2I 523 avgt 15 129.753 ± 0.094 ns/op
>
> perf data on AVX512:
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> convertD2F 523 avgt 15 282.984 ± 4.022 ns/op
> convertD2I 523 avgt 15 543.080 ± 3.873 ns/op
> convertF2D 523 avgt 15 273.950 ± 0.131 ns/op
> convertF2L 523 avgt 15 539.568 ± 2.747 ns/op
> convertI2D 523 avgt 15 745.238 ± 0.069 ns/op
> convertI2L 523 avgt 15 260.935 ± 0.169 ns/op
> convertL2F 523 avgt 15 501.870 ± 0.359 ns/op
> convertL2I 523 avgt 15 257.508 ± 0.174 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> convertD2F 523 avgt 15 76.687 ± 0.530 ns/op
> convertD2I 523 avgt 15 545.408 ± 4.657 ns/op
> convertF2D 523 avgt 15 273.935 ± 0.099 ns/op
> convertF2L 523 avgt 15 540.534 ± 3.032 ns/op
> convertI2D 523 avgt 15 745.234 ± 0.053 ns/op
> convertI2L 523 avgt 15 260.865 ± 0.104 ns/op
> convertL2F 523 avgt 15 63.834 ± 4.777 ns/op
> convertL2I 523 avgt 15 48.183 ± 0.990 ns/op
Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
- Merge branch 'master' into fg8283091
Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f
- Merge branch 'master' into fg8283091
Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83
- Add micro-benchmark cases
Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8
- Merge branch 'master' into fg8283091
Change-Id: I674581135fd0844accc65520574fcef161eededa
- 8283091: Support type conversion between different data sizes in SLP
After JDK-8275317, C2's SLP vectorizer has supported type conversion
between the same data size. We can also support conversions between
different data sizes like:
int <-> double
float <-> long
int <-> long
float <-> double
A typical test case:
int[] a;
double[] b;
for (int i = start; i < limit; i++) {
b[i] = (double) a[i];
}
Our expected OptoAssembly code for one iteration is like below:
add R12, R2, R11, LShiftL #2
vector_load V16,[R12, #16]
vectorcast_i2d V16, V16 # convert I to D vector
add R11, R1, R11, LShiftL #3 # ptr
add R13, R11, #16 # ptr
vector_store [R13], V16
To enable the vectorization, the patch solves the following problems
in the SLP.
There are three main operations in the case above, LoadI, ConvI2D and
StoreD. Assuming that the vector length is 128 bits, how many scalar
nodes should be packed together to a vector? If we decide it
separately for each operation node, like what we did before the patch
in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
in a vector node sequence, like loading 4 elements to a vector, then
typecasting 2 elements and lastly storing these 2 elements, they become
invalid. As a result, we should look through the whole def-use chain
and then pick up the minimum of these element sizes, like function
SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
generate valid vector node sequence, like loading 2 elements,
converting the 2 elements to another type and storing the 2 elements
with new type.
After this, LoadI nodes don't make full use of the whole vector and
only occupy part of it. So we adapt the code in
SuperWord::get_vw_bytes_special() to the situation.
In SLP, we calculate a kind of alignment as position trace for each
scalar node in the whole vector. In this case, the alignments for 2
LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
mark that this node is the second node in the whole vector, while the
difference between 4 and 8 are just because of their own data sizes. In
this situation, we should try to remove the impact caused by different
data size in SLP. For example, in the stage of
SuperWord::extend_packlist(), while determining if it's potential to
pack a pair of def nodes in the function SuperWord::follow_use_defs(),
we remove the side effect of different data size by transforming the
target alignment from the use node. Because we believe that, assuming
that the vector length is 512 bits, if the ConvI2D use nodes have
alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
these two LoadI nodes should be packed as a pair as well.
Similarly, when determining if the vectorization is profitable, type
conversion between different data size takes a type of one size and
produces a type of another size, hence the special checks on alignment
and size should be applied, like what we do in SuperWord::is_vector_use.
After solving these problems, we successfully implemented the
vectorization of type conversion between different data sizes.
Here is the test data on NEON:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 216.431 ± 0.131 ns/op
VectorLoop.convertD2I 523 avgt 15 220.522 ± 0.311 ns/op
VectorLoop.convertF2D 523 avgt 15 217.034 ± 0.292 ns/op
VectorLoop.convertF2L 523 avgt 15 231.634 ± 1.881 ns/op
VectorLoop.convertI2D 523 avgt 15 229.538 ± 0.095 ns/op
VectorLoop.convertI2L 523 avgt 15 214.822 ± 0.131 ns/op
VectorLoop.convertL2F 523 avgt 15 230.188 ± 0.217 ns/op
VectorLoop.convertL2I 523 avgt 15 162.234 ± 0.235 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 124.352 ± 1.079 ns/op
VectorLoop.convertD2I 523 avgt 15 557.388 ± 8.166 ns/op
VectorLoop.convertF2D 523 avgt 15 118.082 ± 4.026 ns/op
VectorLoop.convertF2L 523 avgt 15 225.810 ± 11.180 ns/op
VectorLoop.convertI2D 523 avgt 15 166.247 ± 0.120 ns/op
VectorLoop.convertI2L 523 avgt 15 119.699 ± 2.925 ns/op
VectorLoop.convertL2F 523 avgt 15 220.847 ± 0.053 ns/op
VectorLoop.convertL2I 523 avgt 15 122.339 ± 2.738 ns/op
perf data on X86:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 279.466 ± 0.069 ns/op
VectorLoop.convertD2I 523 avgt 15 551.009 ± 7.459 ns/op
VectorLoop.convertF2D 523 avgt 15 276.066 ± 0.117 ns/op
VectorLoop.convertF2L 523 avgt 15 545.108 ± 5.697 ns/op
VectorLoop.convertI2D 523 avgt 15 745.303 ± 0.185 ns/op
VectorLoop.convertI2L 523 avgt 15 260.878 ± 0.044 ns/op
VectorLoop.convertL2F 523 avgt 15 502.016 ± 0.172 ns/op
VectorLoop.convertL2I 523 avgt 15 261.654 ± 3.326 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 106.975 ± 0.045 ns/op
VectorLoop.convertD2I 523 avgt 15 546.866 ± 9.287 ns/op
VectorLoop.convertF2D 523 avgt 15 82.414 ± 0.340 ns/op
VectorLoop.convertF2L 523 avgt 15 542.235 ± 2.785 ns/op
VectorLoop.convertI2D 523 avgt 15 92.966 ± 1.400 ns/op
VectorLoop.convertI2L 523 avgt 15 79.960 ± 0.528 ns/op
VectorLoop.convertL2F 523 avgt 15 504.712 ± 4.794 ns/op
VectorLoop.convertL2I 523 avgt 15 129.753 ± 0.094 ns/op
perf data on AVX512:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 282.984 ± 4.022 ns/op
VectorLoop.convertD2I 523 avgt 15 543.080 ± 3.873 ns/op
VectorLoop.convertF2D 523 avgt 15 273.950 ± 0.131 ns/op
VectorLoop.convertF2L 523 avgt 15 539.568 ± 2.747 ns/op
VectorLoop.convertI2D 523 avgt 15 745.238 ± 0.069 ns/op
VectorLoop.convertI2L 523 avgt 15 260.935 ± 0.169 ns/op
VectorLoop.convertL2F 523 avgt 15 501.870 ± 0.359 ns/op
VectorLoop.convertL2I 523 avgt 15 257.508 ± 0.174 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 76.687 ± 0.530 ns/op
VectorLoop.convertD2I 523 avgt 15 545.408 ± 4.657 ns/op
VectorLoop.convertF2D 523 avgt 15 273.935 ± 0.099 ns/op
VectorLoop.convertF2L 523 avgt 15 540.534 ± 3.032 ns/op
VectorLoop.convertI2D 523 avgt 15 745.234 ± 0.053 ns/op
VectorLoop.convertI2L 523 avgt 15 260.865 ± 0.104 ns/op
VectorLoop.convertL2F 523 avgt 15 63.834 ± 4.777 ns/op
VectorLoop.convertL2I 523 avgt 15 48.183 ± 0.990 ns/op
Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
-------------
Changes: https://git.openjdk.java.net/jdk/pull/7806/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=04
Stats: 1140 lines in 15 files changed: 1092 ins; 13 del; 35 mod
Patch: https://git.openjdk.java.net/jdk/pull/7806.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806
PR: https://git.openjdk.java.net/jdk/pull/7806
More information about the hotspot-compiler-dev
mailing list