RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors
Xiaohong Gong
xgong at openjdk.org
Tue Mar 4 08:02:59 UTC 2025
On Mon, 3 Mar 2025 09:44:37 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:
>> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
>>
>> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
>>
>> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>> - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>> - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>> - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
>>
>> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
>>
>> 1) NEON
>>
>> JMH on panama-vector:vectorIntrinsics:
>>
>> Benchmark (size) Mode Cnt Units Before After Gain
>> Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x
>> Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x
>> Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x
>> Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x
>> Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x
>> Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x
>> Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x
>> Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 71.680 1190.127 16.60x
>> Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.895 1185.094 16.48x
>> Long128Vector.rearrange 1024 thrpt 30 ops/ms 78.902 579.250 7.34x
>> Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.389 747.794 10.33x
>> Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 71....
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2595:
>
>> 2593: // type B/S/I/L/F/D, and the offset between two types is 16; Hence
>> 2594: // the offset for L is 48.
>> 2595: lea(rscratch1,
>
> Hi @XiaohongGong , thanks for adding support for 2D/2L as well. I was trying to implement the same for the two vector table and I am wondering what you think of this implementation -
>
> negr(dst, shuffle); // this would help create a mask. If input is 1, it would be all 1s and all 0s if its 0
> dup(tmp1, src1, 0); // duplicate first element of src1
> dup(tmp2, src1, 1); // duplicate second element of src1
> bsl(dst, T16B, tmp2, tmp1); // Select from tmp2 if dst is 1 and from tmp1 if dst is 0
>
>
>
> I am really not sure which implementation would be faster though. This implementation might take around 8 cycles.
Hi @Bhavana-Kilambi , I'v finished the test with what you suggested on my Grace CPU. The vectorapi jtreg all pass. So this solution works well. But the performance seems no obvious change compared with the current PR's codegen as expected.
Here is the performance data:
Benchmark (size) Mode Cnt Current Bahavana's Units Gain
Double128Vector.rearrange 1024 thrpt 30 591.504 588.616 ops/ms 0.995
Long128Vector.rearrange 1024 thrpt 30 593.348 590.802 ops/ms 0.995
SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 30 16576.713 16664.580 ops/ms 1.005
SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 30 8358.694 8392.733 ops/ms 1.004
SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 1312.752 1213.538 ops/ms 0.924
SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 657.365 607.060 ops/ms 0.923
SelectFromBenchmark.rearrangeFromFloatVector 1024 thrpt 30 1905.595 1911.831 ops/ms 1.003
SelectFromBenchmark.rearrangeFromFloatVector 2048 thrpt 30 952.205 957.160 ops/ms 1.005
SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 30 2106.763 2107.238 ops/ms 1.000
SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 30 1056.299 1056.769 ops/ms 1.000
SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 1462.355 1247.853 ops/ms 0.853
SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 732.559 616.753 ops/ms 0.841
SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 30 4560.253 4559.861 ops/ms 0.999
SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 30 2279.058 2279.693 ops/ms 1.000
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 1080.589 1073.883 ops/ms 0.993
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 541.629 537.288 ops/ms 0.991
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 269.886 268.460 ops/ms 0.994
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 135.193 134.175 ops/ms 0.992
I expected it will have obvious improvement since we do not need the heavy `ldr` instruction. But I also got the similar performance data on an AArch64 n1 machine. One shortage of your suggestion I can see is it needs one more temp vector register. To be honest, I'm not sure which one is better. Maybe we need more performance data on different kinds of AArch64 machines. So, would you mind testing the performance on other AArch64 machines with NEON? Thanks a lot!
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1978832690
More information about the hotspot-compiler-dev
mailing list