RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors

Tue Mar 4 08:02:59 UTC 2025

On Mon, 3 Mar 2025 09:44:37 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
>> 
>> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
>> 
>> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>>  - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>>  - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>>  - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
>> 
>> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
>> 
>> 1) NEON
>> 
>> JMH on panama-vector:vectorIntrinsics:
>> 
>> Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
>> Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
>> Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
>> Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
>> Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
>> Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
>> Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
>> Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
>> Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
>> Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
>> Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
>> Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
>> Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71....
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2595:
> 
>> 2593:       // type B/S/I/L/F/D, and the offset between two types is 16; Hence
>> 2594:       // the offset for L is 48.
>> 2595:       lea(rscratch1,
> 
> Hi @XiaohongGong , thanks for adding support for 2D/2L as well. I was trying to implement the same for the two vector table and I am wondering what you think of this implementation - 
> 
> negr(dst, shuffle); // this would help create a mask. If input is 1, it would be all 1s and all 0s if its 0
> dup(tmp1, src1, 0); // duplicate first element of src1
> dup(tmp2, src1, 1); // duplicate second element of src1
> bsl(dst, T16B, tmp2, tmp1); // Select from tmp2 if dst is 1 and from tmp1 if dst is 0 
> 
> 
> 
> I am really not sure which implementation would be faster though. This implementation might take around 8 cycles.

Hi @Bhavana-Kilambi , I'v finished the test with what you suggested on my Grace CPU. The vectorapi jtreg all pass. So this solution works well.  But the performance seems no obvious change compared with the current PR's codegen as expected.

Here is the performance data:

Benchmark                                     (size)  Mode  Cnt   Current   Bahavana's  Units   Gain
Double128Vector.rearrange                      1024  thrpt   30    591.504    588.616   ops/ms  0.995
Long128Vector.rearrange                        1024  thrpt   30    593.348    590.802   ops/ms  0.995
SelectFromBenchmark.rearrangeFromByteVector    1024  thrpt   30  16576.713  16664.580   ops/ms  1.005
SelectFromBenchmark.rearrangeFromByteVector    2048  thrpt   30   8358.694   8392.733   ops/ms  1.004
SelectFromBenchmark.rearrangeFromDoubleVector  1024  thrpt   30   1312.752   1213.538   ops/ms  0.924
SelectFromBenchmark.rearrangeFromDoubleVector  2048  thrpt   30    657.365    607.060   ops/ms  0.923
SelectFromBenchmark.rearrangeFromFloatVector   1024  thrpt   30   1905.595   1911.831   ops/ms  1.003
SelectFromBenchmark.rearrangeFromFloatVector   2048  thrpt   30    952.205    957.160   ops/ms  1.005
SelectFromBenchmark.rearrangeFromIntVector     1024  thrpt   30   2106.763   2107.238   ops/ms  1.000
SelectFromBenchmark.rearrangeFromIntVector     2048  thrpt   30   1056.299   1056.769   ops/ms  1.000
SelectFromBenchmark.rearrangeFromLongVector    1024  thrpt   30   1462.355   1247.853   ops/ms  0.853
SelectFromBenchmark.rearrangeFromLongVector    2048  thrpt   30    732.559    616.753   ops/ms  0.841
SelectFromBenchmark.rearrangeFromShortVector   1024  thrpt   30   4560.253   4559.861   ops/ms  0.999
SelectFromBenchmark.rearrangeFromShortVector   2048  thrpt   30   2279.058   2279.693   ops/ms  1.000
VectorXXH3HashingBenchmark.hashingKernel       1024  thrpt   30   1080.589   1073.883   ops/ms  0.993
VectorXXH3HashingBenchmark.hashingKernel       2048  thrpt   30    541.629    537.288   ops/ms  0.991
VectorXXH3HashingBenchmark.hashingKernel       4096  thrpt   30    269.886    268.460   ops/ms  0.994
VectorXXH3HashingBenchmark.hashingKernel       8192  thrpt   30    135.193    134.175   ops/ms  0.992

I expected it will have obvious improvement since we do not need the heavy `ldr` instruction. But I also got the similar performance data on an AArch64 n1 machine. One shortage of your suggestion I can see is it needs one more temp vector register.  To be honest, I'm not sure which one is better. Maybe we need more performance data on different kinds of AArch64 machines. So, would you mind testing the performance on other AArch64 machines with NEON? Thanks a lot!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1978832690