RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors [v3]

Mon Mar 17 07:21:01 UTC 2025

On Fri, 14 Mar 2025 10:04:50 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
>> 
>> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
>> 
>> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>>  - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>>  - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>>  - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
>> 
>> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
>> 
>> 1) NEON
>> 
>> JMH on panama-vector:vectorIntrinsics:
>> 
>> Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
>> Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
>> Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
>> Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
>> Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
>> Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
>> Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
>> Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
>> Int64Vector.sliceUnary        1024  thrpt  30  ops/ms  71.680  1190.127  16.60x
>> Int64Vector.unsliceUnary      1024  thrpt  30  ops/ms  71.895  1185.094  16.48x
>> Long128Vector.rearrange       1024  thrpt  30  ops/ms  78.902   579.250  7.34x
>> Long128Vector.sliceUnary      1024  thrpt  30  ops/ms  72.389   747.794  10.33x
>> Long128Vector.unsliceUnary    1024  thrpt  30  ops/ms  71....
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
> 
>  - Merge branch 'jdk:master' into JDK-8350463
>  - Add the IR test
>  - 8350463: AArch64: Add vector rearrange support for small lane count vectors
>    
>    The AArch64 vector rearrange implementation currently lacks support for
>    vector types with lane counts < 4 (see [1]). This limitation results in
>    significant performance gaps when running Long/Double vector benchmarks
>    on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to
>    other SVE and x86 platforms.
>    
>    Vector rearrange operations depend on vector shuffle inputs, which used
>    byte array as payload previously. The minimum vector lane count of 4 for
>    byte type on AArch64 imposed this limitation on rearrange operations.
>    However, vector shuffle payload has been updated to use vector-specific
>    data types (e.g., `int` for `IntVector`) (see [2]). This change enables
>    us to remove the lane count restriction for vector rearrange operations.
>    
>    This patch added the rearrange support for vector types with small lane
>    count. Here are the main changes:
>     - Added AArch64 match rule support for `VectorRearrange` with smaller
>       lane counts (e.g., `2D/2S`)
>     - Relocated NEON implementation from ad file to c2 macro assembler file
>       for better handling of complex implementation
>     - Optimized temporary register usage in NEON implementation for
>       short/int/float types from two registers to one
>    
>    Following is the performance improvement data of several Vector API JMH
>    benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the
>    same JMH with other vector types remains unchanged.
>    
>    1) NEON
>    
>    JMH on panama-vector:vectorIntrinsics:
>    ```
>    Benchmark                    (size) Mode   Cnt Units   Before    After   Gain
>    Double128Vector.rearrange     1024  thrpt  30  ops/ms  78.060   578.859  7.42x
>    Double128Vector.sliceUnary    1024  thrpt  30  ops/ms  72.332  1811.664  25.05x
>    Double128Vector.unsliceUnary  1024  thrpt  30  ops/ms  72.256  1812.344  25.08x
>    Float64Vector.rearrange       1024  thrpt  30  ops/ms  77.879   558.797  7.18x
>    Float64Vector.sliceUnary      1024  thrpt  30  ops/ms  70.528  1981.304  28.09x
>    Float64Vector.unsliceUnary    1024  thrpt  30  ops/ms  71.735  1994.168  27.79x
>    Int64Vector.rearrange         1024  thrpt  30  ops/ms  76.374   562.106  7.36x
>    Int64Vector.slic...

Thanks for adding the IR test!

Actually I just checked, and it seems we don't really have many tests for `rearrange`, and that's not great. The only test I could find was:
`test/hotspot/jtreg/compiler/vectorapi/TestTwoVectorPermute.java`
But the coverage here is not sufficient at all. I think it only covers 256 bit vectors.

Your test now covers 64 bit and 128 bit cases, but not for all types. I guess you are only targetting small types, so I don't want to burden you with writing tests for all sizes now. But we should definitely file an RFE that tests `rearrange` more thoroughly.

Maybe I just didn't find the tests, so please check again yourself first ;)

I have not reviewed the actual aarch64 code, I'll leave that to experts in that field ;)

test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 28:

> 26:  * @bug 8350463
> 27:  * @summary AArch64: Add vector rearrange support for small lane count vectors
> 28:  * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx.*") | os.arch=="aarch64"

Can you please remove the requirement here, and instead move restrictions to just the IR rules? That way the test can run on all platforms, but the IR verification is only executed on the specified platforms.

test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 94:

> 92:             fsrc[i] = random.nextFloat();
> 93:             dsrc[i] = random.nextDouble();
> 94:         }

Could you please use `Generators.java`? This makes sure that we have more "interesting" values in the distribution.

test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 115:

> 113:                       .intoArray(bdst, i);
> 114:         }
> 115:     }

Is there any verification here that the result is correct? I usually do that with a `GOLD` value computed once at the beginning (in interpreter mode because C2 compilation only happens later), and then a `@Check` method where I compare the results.

test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 219:

> 217:         TestFramework testFramework = new TestFramework();
> 218:         testFramework.setDefaultWarmup(10000)
> 219:                      .addFlags("--add-modules=jdk.incubator.vector", "-XX:-TieredCompilation")

Suggestion:

                     .addFlags("--add-modules=jdk.incubator.vector")

I don't think that `TieredCompilation` should be necessary here. Or was it?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23790#pullrequestreview-2689316798
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998060161
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998061285
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998061844
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998062644