RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors [v6]
Emanuel Peter
epeter at openjdk.org
Mon Mar 24 12:16:12 UTC 2025
On Thu, 20 Mar 2025 07:13:43 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
>>
>> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
>>
>> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>> - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>> - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>> - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
>>
>> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
>>
>> 1) NEON
>>
>> JMH on panama-vector:vectorIntrinsics:
>>
>> Benchmark (size) Mode Cnt Units Before After Gain
>> Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x
>> Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x
>> Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x
>> Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x
>> Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x
>> Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x
>> Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x
>> Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 71.680 1190.127 16.60x
>> Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.895 1185.094 16.48x
>> Long128Vector.rearrange 1024 thrpt 30 ops/ms 78.902 579.250 7.34x
>> Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.389 747.794 10.33x
>> Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 71....
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
>
> - Merge branch 'master' into JDK-8350463
> - Use a smaller warmup and array length in IR test
> - Update IR test based on the review comment
> - Merge branch 'jdk:master' into JDK-8350463
> - Add the IR test
> - 8350463: AArch64: Add vector rearrange support for small lane count vectors
>
> The AArch64 vector rearrange implementation currently lacks support for
> vector types with lane counts < 4 (see [1]). This limitation results in
> significant performance gaps when running Long/Double vector benchmarks
> on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to
> other SVE and x86 platforms.
>
> Vector rearrange operations depend on vector shuffle inputs, which used
> byte array as payload previously. The minimum vector lane count of 4 for
> byte type on AArch64 imposed this limitation on rearrange operations.
> However, vector shuffle payload has been updated to use vector-specific
> data types (e.g., `int` for `IntVector`) (see [2]). This change enables
> us to remove the lane count restriction for vector rearrange operations.
>
> This patch added the rearrange support for vector types with small lane
> count. Here are the main changes:
> - Added AArch64 match rule support for `VectorRearrange` with smaller
> lane counts (e.g., `2D/2S`)
> - Relocated NEON implementation from ad file to c2 macro assembler file
> for better handling of complex implementation
> - Optimized temporary register usage in NEON implementation for
> short/int/float types from two registers to one
>
> Following is the performance improvement data of several Vector API JMH
> benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the
> same JMH with other vector types remains unchanged.
>
> 1) NEON
>
> JMH on panama-vector:vectorIntrinsics:
> ```
> Benchmark (size) Mode Cnt Units Before After Gain
> Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x
> Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x
> Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x
> Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x
> Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x
> Float64Vector.unsliceUnary 1024 thrpt ...
I don't know about aarch64 instructions specifically to review this in depth, but it looks reasonable. Testing looks good too.
-------------
Marked as reviewed by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/23790#pullrequestreview-2710173259
More information about the hotspot-compiler-dev
mailing list