RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2]

Tue Sep 23 10:01:02 UTC 2025

On Tue, 16 Sep 2025 06:54:06 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>> 
>>  - Merge branch 'master' into JDK-8366333-compress
>>  - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>>    
>>    The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>>    subword-type `compress` operations. Therefore, the current implementation
>>    uses the 32-bit SVE `compact` instruction to compress subword types by
>>    first widening the high and low parts to 32 bits, compressing them, and
>>    then narrowing them back to their original type. Finally, the high and
>>    low parts are merged using the `index + tbl` instructions.
>>    
>>    This approach is significantly slower compared to architectures with native
>>    support. After evaluating all available AArch64 SVE instructions and
>>    experimenting with various implementations—such as looping over the active
>>    elements, extraction, and insertion—I confirmed that the existing algorithm
>>    is optimal given the instruction set. However, there is still room for
>>    optimization in the following two aspects:
>>    1. Merging with `index + tbl` is suboptimal due to the high latency of
>>    the `index` instruction.
>>    2. For partial subword types, operations to the highest half are unnecessary
>>    because those bits are invalid.
>>    
>>    This pull request introduces the following changes:
>>    1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>>    offer lower latency and higher throughput.
>>    2. Eliminates unnecessary compress operations for partial subword type cases.
>>    3. For `sve_compress_byte`, one less temporary register is used to alleviate
>>    potential register pressure.
>>    
>>    Benchmark results demonstrate that these changes significantly improve performance.
>>    
>>    Benchmarks on Nvidia Grace machine with 128-bit SVE:
>>    ```
>>    Benchmark	        Unit	Before	 Error	After	 Error	Uplift
>>    Byte128Vector.compress	ops/ms	4846.97	 26.23	6638.56	 31.60	1.36
>>    Byte64Vector.compress	ops/ms	2447.69	 12.95	7167.68	 34.49	2.92
>>    Short128Vector.compress	ops/ms	7174.88	 40.94	8398.45	 9.48	1.17
>>    Short64Vector.compress	ops/ms	3618.72	 3.04	8618.22	 10.91	2.38
>>    ```
>>    
>>    This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>>    and all tests passed.
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2287:
> 
>> 2285:   sve_compress_short(dst, vtmp1, ptmp, vtmp2, vtmp3, pgtmp, extended_size > MaxVectorSize ? MaxVectorSize : extended_size);
>> 2286:   // Narrow the result back to type BYTE.
>> 2287:   // dst   = 0 0 0 0 0 0 0 0 0 0 0 0 0 g c a
> 
> Can you make sure that your examples are all nicely aligned?

Done, thanks.

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2315:
> 
>> 2313:   // Combine the compressed low with the compressed high.
>> 2314:   // dst   = 0 0 0 0 0 0 0 0 0 0 0 p i g c a
>> 2315:   sve_splice(dst, B, ptmp, vtmp1);
> 
> Alignment of examples would be nice

Done

> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 214:
> 
>> 212: 
>> 213:     @Test
>> 214:     @IR(counts = { IRNode.COMPRESS_VD, "= 1" }, applyIfCPUFeature = { "sve", "true" })
> 
> Could you please change this so that the `applyIfCPUFeature` is on a new line?
> That would make it easier to add more platforms later :)

Done

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2371774311
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2371775262
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2371776740