RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors [v6]

Thu Mar 20 07:13:43 UTC 2025

> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
> 
> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
> 
> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>  - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>  - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>  - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
> 
> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
> 
> 1) NEON
> 
> JMH on panama-vector:vectorIntrinsics:
> 
> Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
> Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
> Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
> Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
> Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
> Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
> Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
> Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
> Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
> Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
> Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
> Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
> Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.999   747.848  10.38x
> 
> 
> JMH on jdk mainline:
> 
> Benchmark ...

Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:

 - Merge branch 'master' into JDK-8350463
 - Use a smaller warmup and array length in IR test
 - Update IR test based on the review comment
 - Merge branch 'jdk:master' into JDK-8350463
 - Add the IR test
 - 8350463: AArch64: Add vector rearrange support for small lane count vectors

   The AArch64 vector rearrange implementation currently lacks support for
   vector types with lane counts < 4 (see [1]). This limitation results in
   significant performance gaps when running Long/Double vector benchmarks
   on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to
   other SVE and x86 platforms.

   Vector rearrange operations depend on vector shuffle inputs, which used
   byte array as payload previously. The minimum vector lane count of 4 for
   byte type on AArch64 imposed this limitation on rearrange operations.
   However, vector shuffle payload has been updated to use vector-specific
   data types (e.g., `int` for `IntVector`) (see [2]). This change enables
   us to remove the lane count restriction for vector rearrange operations.

   This patch added the rearrange support for vector types with small lane
   count. Here are the main changes:
    - Added AArch64 match rule support for `VectorRearrange` with smaller
      lane counts (e.g., `2D/2S`)
    - Relocated NEON implementation from ad file to c2 macro assembler file
      for better handling of complex implementation
    - Optimized temporary register usage in NEON implementation for
      short/int/float types from two registers to one

   Following is the performance improvement data of several Vector API JMH
   benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the
   same JMH with other vector types remains unchanged.

   1) NEON

   JMH on panama-vector:vectorIntrinsics:
   ```
   Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
   Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
   Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
   Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
   Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
   Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
   Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
   Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
   Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
   Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
   Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
   Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
   Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.999   747.848  10.38x
   ```

   JMH on jdk mainline:
   ```
   Benchmark                                     (SIZE) Mode  Cnt  Units   Before   After    Gain
   SelectFromBenchmark.rearrangeFromDoubleVector  1024  thrpt  30  ops/ms  44.593  1319.977  29.63x
   SelectFromBenchmark.rearrangeFromDoubleVector  2048  thrpt  30  ops/ms  22.318   660.061  29.58x
   SelectFromBenchmark.rearrangeFromLongVector    1024  thrpt  30  ops/ms  45.823  1458.144  31.82x
   SelectFromBenchmark.rearrangeFromLongVector    2048  thrpt  30  ops/ms  23.050   729.881  31.67x
   VectorXXH3HashingBenchmark.hashingKernel       1024  thrpt  30  ops/ms  97.210  1082.884  11.14x
   VectorXXH3HashingBenchmark.hashingKernel       2048  thrpt  30  ops/ms  48.642   541.341  11.13x
   VectorXXH3HashingBenchmark.hashingKernel       4096  thrpt  30  ops/ms  24.285   270.419  11.14x
   VectorXXH3HashingBenchmark.hashingKernel       8192  thrpt  30  ops/ms  12.421   135.115  10.88x
   ```

   2) SVE

   JMH on panama-vector:vectorIntrinsics:
   ```
   Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
   Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.396   577.744  7.37x
   Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.119  2538.261  35.19x
   Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.992  2536.972  34.75x
   Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.400   561.934  7.26x
   Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.858  2949.076  41.61x
   Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  70.654  2954.273  41.81x
   Int64Vector.rearrange         1024  thrpt  30  ops/ms  77.851   563.969  7.24x
   Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  67.433  1510.484  22.39x
   Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  66.614  1511.617  22.69x
   Long128Vector.rearrange       1024  thrpt  30  ops/ms  77.637   579.021  7.46x
   Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  69.886  1274.331  18.23x
   Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  70.069  1273.787  18.17x
   ```

   JMH on jdk mainline:
   ```
   Benchmark                                     (SIZE)  Mode  Cnt Units   Before    After   Gain
   SelectFromBenchmark.rearrangeFromDoubleVector  1024  thrpt  30  ops/ms  44.612  1351.850  30.30x
   SelectFromBenchmark.rearrangeFromDoubleVector  2048  thrpt  30  ops/ms  22.315   676.314  30.31x
   SelectFromBenchmark.rearrangeFromLongVector    1024  thrpt  30  ops/ms  46.372  1502.036  32.39x
   SelectFromBenchmark.rearrangeFromLongVector    2048  thrpt  30  ops/ms  23.361   749.133  32.07x
   VectorXXH3HashingBenchmark.hashingKernel       1024  thrpt  30  ops/ms  97.780  1759.061  17.99x
   VectorXXH3HashingBenchmark.hashingKernel       2048  thrpt  30  ops/ms  48.923   879.584  17.98x
   VectorXXH3HashingBenchmark.hashingKernel       4096  thrpt  30  ops/ms  24.219   439.588  18.15x
   VectorXXH3HashingBenchmark.hashingKernel       8192  thrpt  30  ops/ms  12.416   219.603  17.69x
   ```

   [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L209
   [2] https://bugs.openjdk.org/browse/JDK-8310691

-------------

Changes: https://git.openjdk.org/jdk/pull/23790/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23790&range=05
  Stats: 510 lines in 6 files changed: 401 ins; 86 del; 23 mod
  Patch: https://git.openjdk.org/jdk/pull/23790.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23790/head:pull/23790

PR: https://git.openjdk.org/jdk/pull/23790