RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2]
erifan
duke at openjdk.org
Tue Sep 23 10:01:02 UTC 2025
On Tue, 16 Sep 2025 06:54:06 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>>
>> - Merge branch 'master' into JDK-8366333-compress
>> - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>>
>> The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>> subword-type `compress` operations. Therefore, the current implementation
>> uses the 32-bit SVE `compact` instruction to compress subword types by
>> first widening the high and low parts to 32 bits, compressing them, and
>> then narrowing them back to their original type. Finally, the high and
>> low parts are merged using the `index + tbl` instructions.
>>
>> This approach is significantly slower compared to architectures with native
>> support. After evaluating all available AArch64 SVE instructions and
>> experimenting with various implementations—such as looping over the active
>> elements, extraction, and insertion—I confirmed that the existing algorithm
>> is optimal given the instruction set. However, there is still room for
>> optimization in the following two aspects:
>> 1. Merging with `index + tbl` is suboptimal due to the high latency of
>> the `index` instruction.
>> 2. For partial subword types, operations to the highest half are unnecessary
>> because those bits are invalid.
>>
>> This pull request introduces the following changes:
>> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>> offer lower latency and higher throughput.
>> 2. Eliminates unnecessary compress operations for partial subword type cases.
>> 3. For `sve_compress_byte`, one less temporary register is used to alleviate
>> potential register pressure.
>>
>> Benchmark results demonstrate that these changes significantly improve performance.
>>
>> Benchmarks on Nvidia Grace machine with 128-bit SVE:
>> ```
>> Benchmark Unit Before Error After Error Uplift
>> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36
>> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92
>> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17
>> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38
>> ```
>>
>> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>> and all tests passed.
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2287:
>
>> 2285: sve_compress_short(dst, vtmp1, ptmp, vtmp2, vtmp3, pgtmp, extended_size > MaxVectorSize ? MaxVectorSize : extended_size);
>> 2286: // Narrow the result back to type BYTE.
>> 2287: // dst = 0 0 0 0 0 0 0 0 0 0 0 0 0 g c a
>
> Can you make sure that your examples are all nicely aligned?
Done, thanks.
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2315:
>
>> 2313: // Combine the compressed low with the compressed high.
>> 2314: // dst = 0 0 0 0 0 0 0 0 0 0 0 p i g c a
>> 2315: sve_splice(dst, B, ptmp, vtmp1);
>
> Alignment of examples would be nice
Done
> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 214:
>
>> 212:
>> 213: @Test
>> 214: @IR(counts = { IRNode.COMPRESS_VD, "= 1" }, applyIfCPUFeature = { "sve", "true" })
>
> Could you please change this so that the `applyIfCPUFeature` is on a new line?
> That would make it easier to add more platforms later :)
Done
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2371774311
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2371775262
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2371776740
More information about the hotspot-compiler-dev
mailing list