RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v3]
erifan
duke at openjdk.org
Mon Sep 29 08:01:49 UTC 2025
On Fri, 26 Sep 2025 21:43:03 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:
>> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
>>
>> - Improve some code style
>> - Merge branch 'master' into JDK-8366333-compress
>> - Merge branch 'master' into JDK-8366333-compress
>> - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>>
>> The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>> subword-type `compress` operations. Therefore, the current implementation
>> uses the 32-bit SVE `compact` instruction to compress subword types by
>> first widening the high and low parts to 32 bits, compressing them, and
>> then narrowing them back to their original type. Finally, the high and
>> low parts are merged using the `index + tbl` instructions.
>>
>> This approach is significantly slower compared to architectures with native
>> support. After evaluating all available AArch64 SVE instructions and
>> experimenting with various implementations—such as looping over the active
>> elements, extraction, and insertion—I confirmed that the existing algorithm
>> is optimal given the instruction set. However, there is still room for
>> optimization in the following two aspects:
>> 1. Merging with `index + tbl` is suboptimal due to the high latency of
>> the `index` instruction.
>> 2. For partial subword types, operations to the highest half are unnecessary
>> because those bits are invalid.
>>
>> This pull request introduces the following changes:
>> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>> offer lower latency and higher throughput.
>> 2. Eliminates unnecessary compress operations for partial subword type cases.
>> 3. For `sve_compress_byte`, one less temporary register is used to alleviate
>> potential register pressure.
>>
>> Benchmark results demonstrate that these changes significantly improve performance.
>>
>> Benchmarks on Nvidia Grace machine with 128-bit SVE:
>> ```
>> Benchmark Unit Before Error After Error Uplift
>> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36
>> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92
>> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17
>> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38
>> ```
>>
>> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>> and all tests passed.
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2208:
>
>> 2206: // Preserves: mask, vtmp_zr
>> 2207: void C2_MacroAssembler::sve_compress_short(FloatRegister dst, FloatRegister src, PRegister mask,
>> 2208: FloatRegister vtmp, FloatRegister vtmp_zr,
>
> On code style: it's confusing to see a temp register used in non-destructive way to pass a constant. If you want to save on materializing an all 0 vector constant, I suggest to name it differently (e.g., `zr`) and put the argument before vtmp.
Done.
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2274:
>
>> 2272: // mask = 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1, one character is 1 bit.
>> 2273: // Expected result: dst = 0 0 0 0 0 0 0 0 0 0 0 p i g c a
>> 2274: sve_dup(vtmp3, B, 0);
>
> For clarity, you could declare a local `FloatRegister vzr = vtmp3` and refer to it at all use sites. That would make things clearer.
Done, thanks!
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2387039117
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2387033499
More information about the hotspot-compiler-dev
mailing list