RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2]

Tue Sep 16 07:09:58 UTC 2025

On Mon, 15 Sep 2025 09:58:20 GMT, erifan <duke at openjdk.org> wrote:

>> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions.
>> 
>> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations—such as looping over the active elements, extraction, and insertion—I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects:
>> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction.
>> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid.
>> 
>> This pull request introduces the following changes:
>> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput.
>> 2. Eliminates unnecessary compress operations for partial subword type cases.
>> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure.
>> 
>> Benchmark results demonstrate that these changes significantly improve performance.
>> 
>> Benchmarks on Nvidia Grace machine with 128-bit SVE:
>> 
>> Benchmark	            Unit	Before	 Error	After	 Error	Uplift
>> Byte128Vector.compress	ops/ms	4846.97	 26.23	6638.56	 31.60	1.36
>> Byte64Vector.compress	ops/ms	2447.69	 12.95	7167.68	 34.49	2.92
>> Short128Vector.compress	ops/ms	7174.88	 40.94	8398.45	 9.48	1.17
>> Short64Vector.compress	ops/ms	3618.72	 3.04	8618.22	 10.91	2.38
>> 
>> 
>> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed.
>
> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
> 
>  - Merge branch 'master' into JDK-8366333-compress
>  - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>    
>    The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>    subword-type `compress` operations. Therefore, the current implementation
>    uses the 32-bit SVE `compact` instruction to compress subword types by
>    first widening the high and low parts to 32 bits, compressing them, and
>    then narrowing them back to their original type. Finally, the high and
>    low parts are merged using the `index + tbl` instructions.
>    
>    This approach is significantly slower compared to architectures with native
>    support. After evaluating all available AArch64 SVE instructions and
>    experimenting with various implementations—such as looping over the active
>    elements, extraction, and insertion—I confirmed that the existing algorithm
>    is optimal given the instruction set. However, there is still room for
>    optimization in the following two aspects:
>    1. Merging with `index + tbl` is suboptimal due to the high latency of
>    the `index` instruction.
>    2. For partial subword types, operations to the highest half are unnecessary
>    because those bits are invalid.
>    
>    This pull request introduces the following changes:
>    1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>    offer lower latency and higher throughput.
>    2. Eliminates unnecessary compress operations for partial subword type cases.
>    3. For `sve_compress_byte`, one less temporary register is used to alleviate
>    potential register pressure.
>    
>    Benchmark results demonstrate that these changes significantly improve performance.
>    
>    Benchmarks on Nvidia Grace machine with 128-bit SVE:
>    ```
>    Benchmark	        Unit	Before	 Error	After	 Error	Uplift
>    Byte128Vector.compress	ops/ms	4846.97	 26.23	6638.56	 31.60	1.36
>    Byte64Vector.compress	ops/ms	2447.69	 12.95	7167.68	 34.49	2.92
>    Short128Vector.compress	ops/ms	7174.88	 40.94	8398.45	 9.48	1.17
>    Short64Vector.compress	ops/ms	3618.72	 3.04	8618.22	 10.91	2.38
>    ```
>    
>    This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>    and all tests passed.

Drive-by comments, going on vacation soon so don't depend on me fully reviewing this any time soon ;)

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2287:

> 2285:   sve_compress_short(dst, vtmp1, ptmp, vtmp2, vtmp3, pgtmp, extended_size > MaxVectorSize ? MaxVectorSize : extended_size);
> 2286:   // Narrow the result back to type BYTE.
> 2287:   // dst   = 0 0 0 0 0 0 0 0 0 0 0 0 0 g c a

Can you make sure that your examples are all nicely aligned?

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2315:

> 2313:   // Combine the compressed low with the compressed high.
> 2314:   // dst   = 0 0 0 0 0 0 0 0 0 0 0 p i g c a
> 2315:   sve_splice(dst, B, ptmp, vtmp1);

Alignment of examples would be nice

test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 36:

> 34:  * @key randomness
> 35:  * @library /test/lib /
> 36:  * @summary AArch64: Enhance SVE subword type implementation of vector compress

I would change the summary to something a bit more generic, since the test is not only good for aarch64 / SVE.
Suggestion:

 * @summary IR test for VectorAPI compress

test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 214:

> 212: 
> 213:     @Test
> 214:     @IR(counts = { IRNode.COMPRESS_VD, "= 1" }, applyIfCPUFeature = { "sve", "true" })

Could you please change this so that the `applyIfCPUFeature` is on a new line?
That would make it easier to add more platforms later :)

test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 228:

> 226:                      .start();
> 227:     }
> 228: }

Question: is there already another test that checks `compress`?

-------------

PR Review: https://git.openjdk.org/jdk/pull/27188#pullrequestreview-3227854355
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351095704
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351097303
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351125031
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351129802
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351138273