RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors

Mon Mar 3 09:46:51 UTC 2025

On Wed, 26 Feb 2025 01:18:57 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
> 
> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
> 
> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>  - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>  - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>  - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
> 
> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
> 
> 1) NEON
> 
> JMH on panama-vector:vectorIntrinsics:
> 
> Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
> Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
> Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
> Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
> Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
> Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
> Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
> Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
> Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
> Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
> Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
> Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
> Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.999   747.848  10.38x
> 
> 
> JMH on jdk mainline:
> 
> Benchmark ...

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2595:

> 2593:       // type B/S/I/L/F/D, and the offset between two types is 16; Hence
> 2594:       // the offset for L is 48.
> 2595:       lea(rscratch1,

Hi @XiaohongGong , thanks for adding support for 2D/2L as well. I was trying to implement the same for the two vector table and I am wondering what you think of this implementation - 

negr(dst, shuffle); // this would help create a mask. If input is 1, it would be all 1s and all 0s if its 0
dup(tmp1, src1, 0); // duplicate first element of src1
dup(tmp2, src2, 1); // duplicate second element of src2
bsl(dst, T16B, tmp2, tmp1); // Select from tmp2 if dst is 1 and from tmp1 if dst is 0 

I am really not sure which implementation would be faster though. This implementation might take around 8 cycles.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1977187103