RFR: 8283091: Support type conversion between different data sizes in SLP [v2]

Fei Gao fgao at openjdk.java.net
Mon Mar 14 08:33:26 UTC 2022


> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like:
> int <-> double
> float <-> long
> int <-> long
> float <-> double
> 
> A typical test case:
> 
> int[] a;
> double[] b;
> for (int i = start; i < limit; i++) {
>     b[i] = (double) a[i];
> }
> 
> Our expected OptoAssembly code for one iteration is like below:
> 
> add R12, R2, R11, LShiftL #2
> vector_load   V16,[R12, #16]
> vectorcast_i2d  V16, V16  # convert I to D vector
> add R11, R1, R11, LShiftL #3	# ptr
> add R13, R11, #16	# ptr
> vector_store [R13], V16
> 
> To enable the vectorization, the patch solves the following problems in the SLP.
> 
> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain
> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type.
> 
> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation.
> 
> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as
  a pair as well.
> 
> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use().
> 
> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes.
> 
> Here is the test data on NEON:
> 
> Before the patch:
> Benchmark              (length)  Mode  Cnt    Score   Error  Units
>   VectorLoop.convertD2F       523  avgt   15  216.431 ± 0.131  ns/op
>   VectorLoop.convertD2I       523  avgt   15  220.522 ± 0.311  ns/op
>   VectorLoop.convertF2D       523  avgt   15  217.034 ± 0.292  ns/op
>   VectorLoop.convertF2L       523  avgt   15  231.634 ± 1.881  ns/op
>   VectorLoop.convertI2D       523  avgt   15  229.538 ± 0.095  ns/op
>   VectorLoop.convertI2L       523  avgt   15  214.822 ± 0.131  ns/op
>   VectorLoop.convertL2F       523  avgt   15  230.188 ± 0.217  ns/op
>   VectorLoop.convertL2I       523  avgt   15  162.234 ± 0.235  ns/op
> 
> After the patch:
> Benchmark              (length)  Mode  Cnt    Score    Error  Units
>   VectorLoop.convertD2F       523  avgt   15  124.352 ±  1.079  ns/op
>   VectorLoop.convertD2I       523  avgt   15  557.388 ±  8.166  ns/op
>   VectorLoop.convertF2D       523  avgt   15  118.082 ±  4.026  ns/op
>   VectorLoop.convertF2L       523  avgt   15  225.810 ± 11.180  ns/op
>   VectorLoop.convertI2D       523  avgt   15  166.247 ±  0.120  ns/op
>   VectorLoop.convertI2L       523  avgt   15  119.699 ±  2.925  ns/op
>   VectorLoop.convertL2F       523  avgt   15  220.847 ±  0.053  ns/op
>   VectorLoop.convertL2I       523  avgt   15  122.339 ±  2.738  ns/op
> 
> perf data on X86:
> Before the patch:
> Benchmark              (length)  Mode  Cnt    Score   Error  Units
>   VectorLoop.convertD2F       523  avgt   15  279.466 ± 0.069  ns/op
>   VectorLoop.convertD2I       523  avgt   15  551.009 ± 7.459  ns/op
>   VectorLoop.convertF2D       523  avgt   15  276.066 ± 0.117  ns/op
>   VectorLoop.convertF2L       523  avgt   15  545.108 ± 5.697  ns/op
>   VectorLoop.convertI2D       523  avgt   15  745.303 ± 0.185  ns/op
>   VectorLoop.convertI2L       523  avgt   15  260.878 ± 0.044  ns/op
>   VectorLoop.convertL2F       523  avgt   15  502.016 ± 0.172  ns/op
>   VectorLoop.convertL2I       523  avgt   15  261.654 ± 3.326  ns/op
> 
> After the patch:
> Benchmark              (length)  Mode  Cnt    Score   Error  Units
>   VectorLoop.convertD2F       523  avgt   15  106.975 ± 0.045  ns/op
>   VectorLoop.convertD2I       523  avgt   15  546.866 ± 9.287  ns/op
>   VectorLoop.convertF2D       523  avgt   15   82.414 ± 0.340  ns/op
>   VectorLoop.convertF2L       523  avgt   15  542.235 ± 2.785  ns/op
>   VectorLoop.convertI2D       523  avgt   15   92.966 ± 1.400  ns/op
>   VectorLoop.convertI2L       523  avgt   15   79.960 ± 0.528  ns/op
>   VectorLoop.convertL2F       523  avgt   15  504.712 ± 4.794  ns/op
>   VectorLoop.convertL2I       523  avgt   15  129.753 ± 0.094  ns/op
> 
> perf data on AVX512:
> Before the patch:
> Benchmark              (length)  Mode  Cnt    Score   Error  Units
>   VectorLoop.convertD2F       523  avgt   15  282.984 ± 4.022  ns/op
>   VectorLoop.convertD2I       523  avgt   15  543.080 ± 3.873  ns/op
>   VectorLoop.convertF2D       523  avgt   15  273.950 ± 0.131  ns/op
>   VectorLoop.convertF2L       523  avgt   15  539.568 ± 2.747  ns/op
>   VectorLoop.convertI2D       523  avgt   15  745.238 ± 0.069  ns/op
>   VectorLoop.convertI2L       523  avgt   15  260.935 ± 0.169  ns/op
>   VectorLoop.convertL2F       523  avgt   15  501.870 ± 0.359  ns/op
>   VectorLoop.convertL2I       523  avgt   15  257.508 ± 0.174  ns/op
> 
> After the patch:
> Benchmark              (length)  Mode  Cnt    Score   Error  Units
>   VectorLoop.convertD2F       523  avgt   15   76.687 ± 0.530  ns/op
>   VectorLoop.convertD2I       523  avgt   15  545.408 ± 4.657  ns/op
>   VectorLoop.convertF2D       523  avgt   15  273.935 ± 0.099  ns/op
>   VectorLoop.convertF2L       523  avgt   15  540.534 ± 3.032  ns/op
>   VectorLoop.convertI2D       523  avgt   15  745.234 ± 0.053  ns/op
>   VectorLoop.convertI2L       523  avgt   15  260.865 ± 0.104  ns/op
>   VectorLoop.convertL2F       523  avgt   15   63.834 ± 4.777  ns/op
>   VectorLoop.convertL2I       523  avgt   15   48.183 ± 0.990  ns/op

Fei Gao has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:

  8283091: Support type conversion between different data sizes in SLP
  
  After JDK-8275317, C2's SLP vectorizer has supported type conversion
  between the same data size. We can also support conversions between
  different data sizes like:
  int <-> double
  float <-> long
  int <-> long
  float <-> double
  
  A typical test case:
  
  int[] a;
  double[] b;
  for (int i = start; i < limit; i++) {
      b[i] = (double) a[i];
  }
  
  Our expected OptoAssembly code for one iteration is like below:
  
  add R12, R2, R11, LShiftL #2
  vector_load   V16,[R12, #16]
  vectorcast_i2d  V16, V16  # convert I to D vector
  add R11, R1, R11, LShiftL #3	# ptr
  add R13, R11, #16	# ptr
  vector_store [R13], V16
  
  To enable the vectorization, the patch solves the following problems
  in the SLP.
  
  There are three main operations in the case above, LoadI, ConvI2D and
  StoreD. Assuming that the vector length is 128 bits, how many scalar
  nodes should be packed together to a vector? If we decide it
  separately for each operation node, like what we did before the patch
  in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
  or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
  in a vector node sequence, like loading 4 elements to a vector, then
  typecasting 2 elements and lastly storing these 2 elements, they become
  invalid. As a result, we should look through the whole def-use chain
  and then pick up the minimum of these element sizes, like function
  SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
  In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
  generate valid vector node sequence, like loading 2 elements,
  converting the 2 elements to another type and storing the 2 elements
  with new type.
  
  After this, LoadI nodes don't make full use of the whole vector and
  only occupy part of it. So we adapt the code in
  SuperWord::get_vw_bytes_special() to the situation.
  
  In SLP, we calculate a kind of alignment as position trace for each
  scalar node in the whole vector. In this case, the alignments for 2
  LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
  Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
  mark that this node is the second node in the whole vector, while the
  difference between 4 and 8 are just because of their own data sizes. In
  this situation, we should try to remove the impact caused by different
  data size in SLP. For example, in the stage of
  SuperWord::extend_packlist(), while determining if it's potential to
  pack a pair of def nodes in the function SuperWord::follow_use_defs(),
  we remove the side effect of different data size by transforming the
  target alignment from the use node. Because we believe that, assuming
  that the vector length is 512 bits, if the ConvI2D use nodes have
  alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
  these two LoadI nodes should be packed as a pair as well.
  
  Similarly, when determining if the vectorization is profitable, type
  conversion between different data size takes a type of one size and
  produces a type of another size, hence the special checks on alignment
  and size should be applied, like what we do in SuperWord::is_vector_use.
  
  After solving these problems, we successfully implemented the
  vectorization of type conversion between different data sizes.
  
  Here is the test data on NEON:
  
  Before the patch:
  Benchmark              (length)  Mode  Cnt    Score   Error  Units
    VectorLoop.convertD2F       523  avgt   15  216.431 ± 0.131  ns/op
    VectorLoop.convertD2I       523  avgt   15  220.522 ± 0.311  ns/op
    VectorLoop.convertF2D       523  avgt   15  217.034 ± 0.292  ns/op
    VectorLoop.convertF2L       523  avgt   15  231.634 ± 1.881  ns/op
    VectorLoop.convertI2D       523  avgt   15  229.538 ± 0.095  ns/op
    VectorLoop.convertI2L       523  avgt   15  214.822 ± 0.131  ns/op
    VectorLoop.convertL2F       523  avgt   15  230.188 ± 0.217  ns/op
    VectorLoop.convertL2I       523  avgt   15  162.234 ± 0.235  ns/op
  
  After the patch:
  Benchmark              (length)  Mode  Cnt    Score    Error  Units
    VectorLoop.convertD2F       523  avgt   15  124.352 ±  1.079  ns/op
    VectorLoop.convertD2I       523  avgt   15  557.388 ±  8.166  ns/op
    VectorLoop.convertF2D       523  avgt   15  118.082 ±  4.026  ns/op
    VectorLoop.convertF2L       523  avgt   15  225.810 ± 11.180  ns/op
    VectorLoop.convertI2D       523  avgt   15  166.247 ±  0.120  ns/op
    VectorLoop.convertI2L       523  avgt   15  119.699 ±  2.925  ns/op
    VectorLoop.convertL2F       523  avgt   15  220.847 ±  0.053  ns/op
    VectorLoop.convertL2I       523  avgt   15  122.339 ±  2.738  ns/op
  
  perf data on X86:
  Before the patch:
  Benchmark              (length)  Mode  Cnt    Score   Error  Units
    VectorLoop.convertD2F       523  avgt   15  279.466 ± 0.069  ns/op
    VectorLoop.convertD2I       523  avgt   15  551.009 ± 7.459  ns/op
    VectorLoop.convertF2D       523  avgt   15  276.066 ± 0.117  ns/op
    VectorLoop.convertF2L       523  avgt   15  545.108 ± 5.697  ns/op
    VectorLoop.convertI2D       523  avgt   15  745.303 ± 0.185  ns/op
    VectorLoop.convertI2L       523  avgt   15  260.878 ± 0.044  ns/op
    VectorLoop.convertL2F       523  avgt   15  502.016 ± 0.172  ns/op
    VectorLoop.convertL2I       523  avgt   15  261.654 ± 3.326  ns/op
  
  After the patch:
  Benchmark              (length)  Mode  Cnt    Score   Error  Units
    VectorLoop.convertD2F       523  avgt   15  106.975 ± 0.045  ns/op
    VectorLoop.convertD2I       523  avgt   15  546.866 ± 9.287  ns/op
    VectorLoop.convertF2D       523  avgt   15   82.414 ± 0.340  ns/op
    VectorLoop.convertF2L       523  avgt   15  542.235 ± 2.785  ns/op
    VectorLoop.convertI2D       523  avgt   15   92.966 ± 1.400  ns/op
    VectorLoop.convertI2L       523  avgt   15   79.960 ± 0.528  ns/op
    VectorLoop.convertL2F       523  avgt   15  504.712 ± 4.794  ns/op
    VectorLoop.convertL2I       523  avgt   15  129.753 ± 0.094  ns/op
  
  perf data on AVX512:
  Before the patch:
  Benchmark              (length)  Mode  Cnt    Score   Error  Units
    VectorLoop.convertD2F       523  avgt   15  282.984 ± 4.022  ns/op
    VectorLoop.convertD2I       523  avgt   15  543.080 ± 3.873  ns/op
    VectorLoop.convertF2D       523  avgt   15  273.950 ± 0.131  ns/op
    VectorLoop.convertF2L       523  avgt   15  539.568 ± 2.747  ns/op
    VectorLoop.convertI2D       523  avgt   15  745.238 ± 0.069  ns/op
    VectorLoop.convertI2L       523  avgt   15  260.935 ± 0.169  ns/op
    VectorLoop.convertL2F       523  avgt   15  501.870 ± 0.359  ns/op
    VectorLoop.convertL2I       523  avgt   15  257.508 ± 0.174  ns/op
  
  After the patch:
  Benchmark              (length)  Mode  Cnt    Score   Error  Units
    VectorLoop.convertD2F       523  avgt   15   76.687 ± 0.530  ns/op
    VectorLoop.convertD2I       523  avgt   15  545.408 ± 4.657  ns/op
    VectorLoop.convertF2D       523  avgt   15  273.935 ± 0.099  ns/op
    VectorLoop.convertF2L       523  avgt   15  540.534 ± 3.032  ns/op
    VectorLoop.convertI2D       523  avgt   15  745.234 ± 0.053  ns/op
    VectorLoop.convertI2L       523  avgt   15  260.865 ± 0.104  ns/op
    VectorLoop.convertL2F       523  avgt   15   63.834 ± 4.777  ns/op
    VectorLoop.convertL2I       523  avgt   15   48.183 ± 0.990  ns/op
  
  Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7806/files
  - new: https://git.openjdk.java.net/jdk/pull/7806/files/c6d0716e..c2c13739

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=00-01

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7806.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806

PR: https://git.openjdk.java.net/jdk/pull/7806


More information about the hotspot-compiler-dev mailing list