RFR: 8283091: Support type conversion between different data sizes in SLP [v6]

Fri Jun 3 00:44:32 UTC 2022

On Thu, 2 Jun 2022 23:59:21 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits:
>> 
>>  - Implement an interface for auto-vectorization to consult supported match rules
>>    
>>    Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701
>>  - Merge branch 'master' into fg8283091
>>    
>>    Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd
>>  - Merge branch 'master' into fg8283091
>>    
>>    Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f
>>  - Merge branch 'master' into fg8283091
>>    
>>    Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83
>>  - Add micro-benchmark cases
>>    
>>    Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8
>>  - Merge branch 'master' into fg8283091
>>    
>>    Change-Id: I674581135fd0844accc65520574fcef161eededa
>>  - 8283091: Support type conversion between different data sizes in SLP
>>    
>>    After JDK-8275317, C2's SLP vectorizer has supported type conversion
>>    between the same data size. We can also support conversions between
>>    different data sizes like:
>>    int <-> double
>>    float <-> long
>>    int <-> long
>>    float <-> double
>>    
>>    A typical test case:
>>    
>>    int[] a;
>>    double[] b;
>>    for (int i = start; i < limit; i++) {
>>        b[i] = (double) a[i];
>>    }
>>    
>>    Our expected OptoAssembly code for one iteration is like below:
>>    
>>    add R12, R2, R11, LShiftL #2
>>    vector_load   V16,[R12, #16]
>>    vectorcast_i2d  V16, V16  # convert I to D vector
>>    add R11, R1, R11, LShiftL #3	# ptr
>>    add R13, R11, #16	# ptr
>>    vector_store [R13], V16
>>    
>>    To enable the vectorization, the patch solves the following problems
>>    in the SLP.
>>    
>>    There are three main operations in the case above, LoadI, ConvI2D and
>>    StoreD. Assuming that the vector length is 128 bits, how many scalar
>>    nodes should be packed together to a vector? If we decide it
>>    separately for each operation node, like what we did before the patch
>>    in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
>>    or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
>>    in a vector node sequence, like loading 4 elements to a vector, then
>>    typecasting 2 elements and lastly storing these 2 elements, they become
>>    invalid. As a result, we should look through the whole def-use chain
>>    and then pick up the minimum of these element sizes, like function
>>    SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
>>    In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
>>    generate valid vector node sequence, like loading 2 elements,
>>    converting the 2 elements to another type and storing the 2 elements
>>    with new type.
>>    
>>    After this, LoadI nodes don't make full use of the whole vector and
>>    only occupy part of it. So we adapt the code in
>>    SuperWord::get_vw_bytes_special() to the situation.
>>    
>>    In SLP, we calculate a kind of alignment as position trace for each
>>    scalar node in the whole vector. In this case, the alignments for 2
>>    LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
>>    Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
>>    mark that this node is the second node in the whole vector, while the
>>    difference between 4 and 8 are just because of their own data sizes. In
>>    this situation, we should try to remove the impact caused by different
>>    data size in SLP. For example, in the stage of
>>    SuperWord::extend_packlist(), while determining if it's potential to
>>    pack a pair of def nodes in the function SuperWord::follow_use_defs(),
>>    we remove the side effect of different data size by transforming the
>>    target alignment from the use node. Because we believe that, assuming
>>    that the vector length is 512 bits, if the ConvI2D use nodes have
>>    alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
>>    these two LoadI nodes should be packed as a pair as well.
>>    
>>    Similarly, when determining if the vectorization is profitable, type
>>    conversion between different data size takes a type of one size and
>>    produces a type of another size, hence the special checks on alignment
>>    and size should be applied, like what we do in SuperWord::is_vector_use.
>>    
>>    After solving these problems, we successfully implemented the
>>    vectorization of type conversion between different data sizes.
>>    
>>    Here is the test data on NEON:
>>    
>>    Before the patch:
>>    Benchmark              (length)  Mode  Cnt    Score   Error  Units
>>      VectorLoop.convertD2F       523  avgt   15  216.431 ± 0.131  ns/op
>>      VectorLoop.convertD2I       523  avgt   15  220.522 ± 0.311  ns/op
>>      VectorLoop.convertF2D       523  avgt   15  217.034 ± 0.292  ns/op
>>      VectorLoop.convertF2L       523  avgt   15  231.634 ± 1.881  ns/op
>>      VectorLoop.convertI2D       523  avgt   15  229.538 ± 0.095  ns/op
>>      VectorLoop.convertI2L       523  avgt   15  214.822 ± 0.131  ns/op
>>      VectorLoop.convertL2F       523  avgt   15  230.188 ± 0.217  ns/op
>>      VectorLoop.convertL2I       523  avgt   15  162.234 ± 0.235  ns/op
>>    
>>    After the patch:
>>    Benchmark              (length)  Mode  Cnt    Score    Error  Units
>>      VectorLoop.convertD2F       523  avgt   15  124.352 ±  1.079  ns/op
>>      VectorLoop.convertD2I       523  avgt   15  557.388 ±  8.166  ns/op
>>      VectorLoop.convertF2D       523  avgt   15  118.082 ±  4.026  ns/op
>>      VectorLoop.convertF2L       523  avgt   15  225.810 ± 11.180  ns/op
>>      VectorLoop.convertI2D       523  avgt   15  166.247 ±  0.120  ns/op
>>      VectorLoop.convertI2L       523  avgt   15  119.699 ±  2.925  ns/op
>>      VectorLoop.convertL2F       523  avgt   15  220.847 ±  0.053  ns/op
>>      VectorLoop.convertL2I       523  avgt   15  122.339 ±  2.738  ns/op
>>    
>>    perf data on X86:
>>    Before the patch:
>>    Benchmark              (length)  Mode  Cnt    Score   Error  Units
>>      VectorLoop.convertD2F       523  avgt   15  279.466 ± 0.069  ns/op
>>      VectorLoop.convertD2I       523  avgt   15  551.009 ± 7.459  ns/op
>>      VectorLoop.convertF2D       523  avgt   15  276.066 ± 0.117  ns/op
>>      VectorLoop.convertF2L       523  avgt   15  545.108 ± 5.697  ns/op
>>      VectorLoop.convertI2D       523  avgt   15  745.303 ± 0.185  ns/op
>>      VectorLoop.convertI2L       523  avgt   15  260.878 ± 0.044  ns/op
>>      VectorLoop.convertL2F       523  avgt   15  502.016 ± 0.172  ns/op
>>      VectorLoop.convertL2I       523  avgt   15  261.654 ± 3.326  ns/op
>>    
>>    After the patch:
>>    Benchmark              (length)  Mode  Cnt    Score   Error  Units
>>      VectorLoop.convertD2F       523  avgt   15  106.975 ± 0.045  ns/op
>>      VectorLoop.convertD2I       523  avgt   15  546.866 ± 9.287  ns/op
>>      VectorLoop.convertF2D       523  avgt   15   82.414 ± 0.340  ns/op
>>      VectorLoop.convertF2L       523  avgt   15  542.235 ± 2.785  ns/op
>>      VectorLoop.convertI2D       523  avgt   15   92.966 ± 1.400  ns/op
>>      VectorLoop.convertI2L       523  avgt   15   79.960 ± 0.528  ns/op
>>      VectorLoop.convertL2F       523  avgt   15  504.712 ± 4.794  ns/op
>>      VectorLoop.convertL2I       523  avgt   15  129.753 ± 0.094  ns/op
>>    
>>    perf data on AVX512:
>>    Before the patch:
>>    Benchmark              (length)  Mode  Cnt    Score   Error  Units
>>      VectorLoop.convertD2F       523  avgt   15  282.984 ± 4.022  ns/op
>>      VectorLoop.convertD2I       523  avgt   15  543.080 ± 3.873  ns/op
>>      VectorLoop.convertF2D       523  avgt   15  273.950 ± 0.131  ns/op
>>      VectorLoop.convertF2L       523  avgt   15  539.568 ± 2.747  ns/op
>>      VectorLoop.convertI2D       523  avgt   15  745.238 ± 0.069  ns/op
>>      VectorLoop.convertI2L       523  avgt   15  260.935 ± 0.169  ns/op
>>      VectorLoop.convertL2F       523  avgt   15  501.870 ± 0.359  ns/op
>>      VectorLoop.convertL2I       523  avgt   15  257.508 ± 0.174  ns/op
>>    
>>    After the patch:
>>    Benchmark              (length)  Mode  Cnt    Score   Error  Units
>>      VectorLoop.convertD2F       523  avgt   15   76.687 ± 0.530  ns/op
>>      VectorLoop.convertD2I       523  avgt   15  545.408 ± 4.657  ns/op
>>      VectorLoop.convertF2D       523  avgt   15  273.935 ± 0.099  ns/op
>>      VectorLoop.convertF2L       523  avgt   15  540.534 ± 3.032  ns/op
>>      VectorLoop.convertI2D       523  avgt   15  745.234 ± 0.053  ns/op
>>      VectorLoop.convertI2L       523  avgt   15  260.865 ± 0.104  ns/op
>>      VectorLoop.convertL2F       523  avgt   15   63.834 ± 4.777  ns/op
>>      VectorLoop.convertL2I       523  avgt   15   48.183 ± 0.990  ns/op
>>    
>>    Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
>
> src/hotspot/share/opto/vectornode.cpp line 258:
> 
>> 256:     return Op_VectorCastF2X;
>> 257:   case Op_ConvD2L:
>> 258:     return Op_VectorCastD2X;
> 
> Why you removed these lines?

Yes, removing these seems a wrong step. For x86, we do code generation for these VectorCastI2X, VectorCaseL2X, VectorCastF2X and VectorCastD2X nodes.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7806