RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2]
erifan
duke at openjdk.org
Wed Sep 17 03:16:35 UTC 2025
On Tue, 16 Sep 2025 07:02:23 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>>
>> - Merge branch 'master' into JDK-8366333-compress
>> - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>>
>> The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>> subword-type `compress` operations. Therefore, the current implementation
>> uses the 32-bit SVE `compact` instruction to compress subword types by
>> first widening the high and low parts to 32 bits, compressing them, and
>> then narrowing them back to their original type. Finally, the high and
>> low parts are merged using the `index + tbl` instructions.
>>
>> This approach is significantly slower compared to architectures with native
>> support. After evaluating all available AArch64 SVE instructions and
>> experimenting with various implementations—such as looping over the active
>> elements, extraction, and insertion—I confirmed that the existing algorithm
>> is optimal given the instruction set. However, there is still room for
>> optimization in the following two aspects:
>> 1. Merging with `index + tbl` is suboptimal due to the high latency of
>> the `index` instruction.
>> 2. For partial subword types, operations to the highest half are unnecessary
>> because those bits are invalid.
>>
>> This pull request introduces the following changes:
>> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>> offer lower latency and higher throughput.
>> 2. Eliminates unnecessary compress operations for partial subword type cases.
>> 3. For `sve_compress_byte`, one less temporary register is used to alleviate
>> potential register pressure.
>>
>> Benchmark results demonstrate that these changes significantly improve performance.
>>
>> Benchmarks on Nvidia Grace machine with 128-bit SVE:
>> ```
>> Benchmark Unit Before Error After Error Uplift
>> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36
>> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92
>> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17
>> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38
>> ```
>>
>> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>> and all tests passed.
>
> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 36:
>
>> 34: * @key randomness
>> 35: * @library /test/lib /
>> 36: * @summary AArch64: Enhance SVE subword type implementation of vector compress
>
> I would change the summary to something a bit more generic, since the test is not only good for aarch64 / SVE.
> Suggestion:
>
> * @summary IR test for VectorAPI compress
It seems that the summary and the PR title are usually consistent. Is there any convention or rule for this?
> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 228:
>
>> 226: .start();
>> 227: }
>> 228: }
>
> Question: is there already another test that checks `compress`?
Yes, just like `expand`, it's here https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5357
This test file is mainly for IR test.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354169473
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354167428
More information about the hotspot-compiler-dev
mailing list