RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors [v6]

Mon Mar 24 12:16:12 UTC 2025

On Thu, 20 Mar 2025 07:13:43 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
>> 
>> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
>> 
>> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>>  - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>>  - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>>  - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
>> 
>> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
>> 
>> 1) NEON
>> 
>> JMH on panama-vector:vectorIntrinsics:
>> 
>> Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
>> Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
>> Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
>> Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
>> Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
>> Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
>> Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
>> Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
>> Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
>> Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
>> Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
>> Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
>> Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71....
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
> 
>  - Merge branch 'master' into JDK-8350463
>  - Use a smaller warmup and array length in IR test
>  - Update IR test based on the review comment
>  - Merge branch 'jdk:master' into JDK-8350463
>  - Add the IR test
>  - 8350463: AArch64: Add vector rearrange support for small lane count vectors
>    
>    The AArch64 vector rearrange implementation currently lacks support for
>    vector types with lane counts < 4 (see [1]). This limitation results in
>    significant performance gaps when running Long/Double vector benchmarks
>    on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to
>    other SVE and x86 platforms.
>    
>    Vector rearrange operations depend on vector shuffle inputs, which used
>    byte array as payload previously. The minimum vector lane count of 4 for
>    byte type on AArch64 imposed this limitation on rearrange operations.
>    However, vector shuffle payload has been updated to use vector-specific
>    data types (e.g., `int` for `IntVector`) (see [2]). This change enables
>    us to remove the lane count restriction for vector rearrange operations.
>    
>    This patch added the rearrange support for vector types with small lane
>    count. Here are the main changes:
>     - Added AArch64 match rule support for `VectorRearrange` with smaller
>       lane counts (e.g., `2D/2S`)
>     - Relocated NEON implementation from ad file to c2 macro assembler file
>       for better handling of complex implementation
>     - Optimized temporary register usage in NEON implementation for
>       short/int/float types from two registers to one
>    
>    Following is the performance improvement data of several Vector API JMH
>    benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the
>    same JMH with other vector types remains unchanged.
>    
>    1) NEON
>    
>    JMH on panama-vector:vectorIntrinsics:
>    ```
>    Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
>    Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
>    Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
>    Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
>    Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
>    Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
>    Float64Vector.unsliceUnary    1024  thrpt  ...

I don't know about aarch64 instructions specifically to review this in depth, but it looks reasonable. Testing looks good too.

-------------

Marked as reviewed by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23790#pullrequestreview-2710173259