RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors

Xiaohong Gong xgong at openjdk.org
Wed Feb 26 01:23:50 UTC 2025


The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.

Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.

This patch added the rearrange support for vector types with small lane count. Here are the main changes:
 - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
 - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
 - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one

Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.

1) NEON

JMH on panama-vector:vectorIntrinsics:

Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.999   747.848  10.38x


JMH on jdk mainline:

Benchmark                                     (SIZE) Mode  Cnt  Units   Before   After    Gain
SelectFromBenchmark.rearrangeFromDoubleVector  1024  thrpt  30  ops/ms  44.593  1319.977  29.63x
SelectFromBenchmark.rearrangeFromDoubleVector  2048  thrpt  30  ops/ms  22.318   660.061  29.58x
SelectFromBenchmark.rearrangeFromLongVector    1024  thrpt  30  ops/ms  45.823  1458.144  31.82x
SelectFromBenchmark.rearrangeFromLongVector    2048  thrpt  30  ops/ms  23.050   729.881  31.67x
VectorXXH3HashingBenchmark.hashingKernel       1024  thrpt  30  ops/ms  97.210  1082.884  11.14x
VectorXXH3HashingBenchmark.hashingKernel       2048  thrpt  30  ops/ms  48.642   541.341  11.13x
VectorXXH3HashingBenchmark.hashingKernel       4096  thrpt  30  ops/ms  24.285   270.419  11.14x
VectorXXH3HashingBenchmark.hashingKernel       8192  thrpt  30  ops/ms  12.421   135.115  10.88x


2) SVE

JMH on panama-vector:vectorIntrinsics:

Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.396   577.744  7.37x
Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.119  2538.261  35.19x
Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.992  2536.972  34.75x
Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.400   561.934  7.26x
Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.858  2949.076  41.61x
Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  70.654  2954.273  41.81x
Int64Vector.rearrange         1024  thrpt  30  ops/ms  77.851   563.969  7.24x
Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  67.433  1510.484  22.39x
Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  66.614  1511.617  22.69x
Long128Vector.rearrange       1024  thrpt  30  ops/ms  77.637   579.021  7.46x
Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  69.886  1274.331  18.23x
Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  70.069  1273.787  18.17x


JMH on jdk mainline:

Benchmark                                     (SIZE)  Mode  Cnt Units   Before    After   Gain
SelectFromBenchmark.rearrangeFromDoubleVector  1024  thrpt  30  ops/ms  44.612  1351.850  30.30x
SelectFromBenchmark.rearrangeFromDoubleVector  2048  thrpt  30  ops/ms  22.315   676.314  30.31x
SelectFromBenchmark.rearrangeFromLongVector    1024  thrpt  30  ops/ms  46.372  1502.036  32.39x
SelectFromBenchmark.rearrangeFromLongVector    2048  thrpt  30  ops/ms  23.361   749.133  32.07x
VectorXXH3HashingBenchmark.hashingKernel       1024  thrpt  30  ops/ms  97.780  1759.061  17.99x
VectorXXH3HashingBenchmark.hashingKernel       2048  thrpt  30  ops/ms  48.923   879.584  17.98x
VectorXXH3HashingBenchmark.hashingKernel       4096  thrpt  30  ops/ms  24.219   439.588  18.15x
VectorXXH3HashingBenchmark.hashingKernel       8192  thrpt  30  ops/ms  12.416   219.603  17.69x


[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L209
[2] https://bugs.openjdk.org/browse/JDK-8310691

-------------

Commit messages:
 - 8350463: AArch64: Add vector rearrange support for small lane count vectors

Changes: https://git.openjdk.org/jdk/pull/23790/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23790&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8350463
  Stats: 169 lines in 4 files changed: 60 ins; 86 del; 23 mod
  Patch: https://git.openjdk.org/jdk/pull/23790.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23790/head:pull/23790

PR: https://git.openjdk.org/jdk/pull/23790


More information about the hotspot-compiler-dev mailing list