RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v3]

Mon Sep 29 08:01:49 UTC 2025

On Fri, 26 Sep 2025 21:43:03 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
>> 
>>  - Improve some code style
>>  - Merge branch 'master' into JDK-8366333-compress
>>  - Merge branch 'master' into JDK-8366333-compress
>>  - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>>    
>>    The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>>    subword-type `compress` operations. Therefore, the current implementation
>>    uses the 32-bit SVE `compact` instruction to compress subword types by
>>    first widening the high and low parts to 32 bits, compressing them, and
>>    then narrowing them back to their original type. Finally, the high and
>>    low parts are merged using the `index + tbl` instructions.
>>    
>>    This approach is significantly slower compared to architectures with native
>>    support. After evaluating all available AArch64 SVE instructions and
>>    experimenting with various implementations—such as looping over the active
>>    elements, extraction, and insertion—I confirmed that the existing algorithm
>>    is optimal given the instruction set. However, there is still room for
>>    optimization in the following two aspects:
>>    1. Merging with `index + tbl` is suboptimal due to the high latency of
>>    the `index` instruction.
>>    2. For partial subword types, operations to the highest half are unnecessary
>>    because those bits are invalid.
>>    
>>    This pull request introduces the following changes:
>>    1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>>    offer lower latency and higher throughput.
>>    2. Eliminates unnecessary compress operations for partial subword type cases.
>>    3. For `sve_compress_byte`, one less temporary register is used to alleviate
>>    potential register pressure.
>>    
>>    Benchmark results demonstrate that these changes significantly improve performance.
>>    
>>    Benchmarks on Nvidia Grace machine with 128-bit SVE:
>>    ```
>>    Benchmark	        Unit	Before	 Error	After	 Error	Uplift
>>    Byte128Vector.compress	ops/ms	4846.97	 26.23	6638.56	 31.60	1.36
>>    Byte64Vector.compress	ops/ms	2447.69	 12.95	7167.68	 34.49	2.92
>>    Short128Vector.compress	ops/ms	7174.88	 40.94	8398.45	 9.48	1.17
>>    Short64Vector.compress	ops/ms	3618.72	 3.04	8618.22	 10.91	2.38
>>    ```
>>    
>>    This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>>    and all tests passed.
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2208:
> 
>> 2206: // Preserves: mask, vtmp_zr
>> 2207: void C2_MacroAssembler::sve_compress_short(FloatRegister dst, FloatRegister src, PRegister mask,
>> 2208:                                            FloatRegister vtmp, FloatRegister vtmp_zr,
> 
> On code style: it's confusing to see a temp register used in non-destructive way to pass a constant. If you want to save on materializing an all 0 vector constant, I suggest to name it differently (e.g., `zr`) and put the argument before vtmp.

Done.

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2274:
> 
>> 2272:   //                  mask  = 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1, one character is 1 bit.
>> 2273:   // Expected result: dst   = 0 0 0 0 0 0 0 0 0 0 0 p i g c a
>> 2274:   sve_dup(vtmp3, B, 0);
> 
> For clarity, you could declare a local `FloatRegister vzr = vtmp3` and refer to it at all use sites. That would make things clearer.

Done, thanks!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2387039117
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2387033499