RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors [v3]
Emanuel Peter
epeter at openjdk.org
Mon Mar 17 07:21:01 UTC 2025
On Fri, 14 Mar 2025 10:04:50 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms.
>>
>> Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations.
>>
>> This patch added the rearrange support for vector types with small lane count. Here are the main changes:
>> - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`)
>> - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation
>> - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one
>>
>> Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged.
>>
>> 1) NEON
>>
>> JMH on panama-vector:vectorIntrinsics:
>>
>> Benchmark (size) Mode Cnt Units Before After Gain
>> Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x
>> Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x
>> Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x
>> Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x
>> Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x
>> Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x
>> Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x
>> Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 71.680 1190.127 16.60x
>> Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.895 1185.094 16.48x
>> Long128Vector.rearrange 1024 thrpt 30 ops/ms 78.902 579.250 7.34x
>> Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.389 747.794 10.33x
>> Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 71....
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
>
> - Merge branch 'jdk:master' into JDK-8350463
> - Add the IR test
> - 8350463: AArch64: Add vector rearrange support for small lane count vectors
>
> The AArch64 vector rearrange implementation currently lacks support for
> vector types with lane counts < 4 (see [1]). This limitation results in
> significant performance gaps when running Long/Double vector benchmarks
> on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to
> other SVE and x86 platforms.
>
> Vector rearrange operations depend on vector shuffle inputs, which used
> byte array as payload previously. The minimum vector lane count of 4 for
> byte type on AArch64 imposed this limitation on rearrange operations.
> However, vector shuffle payload has been updated to use vector-specific
> data types (e.g., `int` for `IntVector`) (see [2]). This change enables
> us to remove the lane count restriction for vector rearrange operations.
>
> This patch added the rearrange support for vector types with small lane
> count. Here are the main changes:
> - Added AArch64 match rule support for `VectorRearrange` with smaller
> lane counts (e.g., `2D/2S`)
> - Relocated NEON implementation from ad file to c2 macro assembler file
> for better handling of complex implementation
> - Optimized temporary register usage in NEON implementation for
> short/int/float types from two registers to one
>
> Following is the performance improvement data of several Vector API JMH
> benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the
> same JMH with other vector types remains unchanged.
>
> 1) NEON
>
> JMH on panama-vector:vectorIntrinsics:
> ```
> Benchmark (size) Mode Cnt Units Before After Gain
> Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x
> Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x
> Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x
> Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x
> Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x
> Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x
> Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x
> Int64Vector.slic...
Thanks for adding the IR test!
Actually I just checked, and it seems we don't really have many tests for `rearrange`, and that's not great. The only test I could find was:
`test/hotspot/jtreg/compiler/vectorapi/TestTwoVectorPermute.java`
But the coverage here is not sufficient at all. I think it only covers 256 bit vectors.
Your test now covers 64 bit and 128 bit cases, but not for all types. I guess you are only targetting small types, so I don't want to burden you with writing tests for all sizes now. But we should definitely file an RFE that tests `rearrange` more thoroughly.
Maybe I just didn't find the tests, so please check again yourself first ;)
I have not reviewed the actual aarch64 code, I'll leave that to experts in that field ;)
test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 28:
> 26: * @bug 8350463
> 27: * @summary AArch64: Add vector rearrange support for small lane count vectors
> 28: * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx.*") | os.arch=="aarch64"
Can you please remove the requirement here, and instead move restrictions to just the IR rules? That way the test can run on all platforms, but the IR verification is only executed on the specified platforms.
test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 94:
> 92: fsrc[i] = random.nextFloat();
> 93: dsrc[i] = random.nextDouble();
> 94: }
Could you please use `Generators.java`? This makes sure that we have more "interesting" values in the distribution.
test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 115:
> 113: .intoArray(bdst, i);
> 114: }
> 115: }
Is there any verification here that the result is correct? I usually do that with a `GOLD` value computed once at the beginning (in interpreter mode because C2 compilation only happens later), and then a `@Check` method where I compare the results.
test/hotspot/jtreg/compiler/vectorapi/VectorRearrangeTest.java line 219:
> 217: TestFramework testFramework = new TestFramework();
> 218: testFramework.setDefaultWarmup(10000)
> 219: .addFlags("--add-modules=jdk.incubator.vector", "-XX:-TieredCompilation")
Suggestion:
.addFlags("--add-modules=jdk.incubator.vector")
I don't think that `TieredCompilation` should be necessary here. Or was it?
-------------
PR Review: https://git.openjdk.org/jdk/pull/23790#pullrequestreview-2689316798
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998060161
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998061285
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998061844
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1998062644
More information about the hotspot-compiler-dev
mailing list