RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2]

Wed Sep 17 03:16:35 UTC 2025

On Tue, 16 Sep 2025 07:02:23 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>> 
>>  - Merge branch 'master' into JDK-8366333-compress
>>  - 8366333: AArch64: Enhance SVE subword type implementation of vector compress
>>    
>>    The AArch64 SVE and SVE2 architectures lack an instruction suitable for
>>    subword-type `compress` operations. Therefore, the current implementation
>>    uses the 32-bit SVE `compact` instruction to compress subword types by
>>    first widening the high and low parts to 32 bits, compressing them, and
>>    then narrowing them back to their original type. Finally, the high and
>>    low parts are merged using the `index + tbl` instructions.
>>    
>>    This approach is significantly slower compared to architectures with native
>>    support. After evaluating all available AArch64 SVE instructions and
>>    experimenting with various implementations—such as looping over the active
>>    elements, extraction, and insertion—I confirmed that the existing algorithm
>>    is optimal given the instruction set. However, there is still room for
>>    optimization in the following two aspects:
>>    1. Merging with `index + tbl` is suboptimal due to the high latency of
>>    the `index` instruction.
>>    2. For partial subword types, operations to the highest half are unnecessary
>>    because those bits are invalid.
>>    
>>    This pull request introduces the following changes:
>>    1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
>>    offer lower latency and higher throughput.
>>    2. Eliminates unnecessary compress operations for partial subword type cases.
>>    3. For `sve_compress_byte`, one less temporary register is used to alleviate
>>    potential register pressure.
>>    
>>    Benchmark results demonstrate that these changes significantly improve performance.
>>    
>>    Benchmarks on Nvidia Grace machine with 128-bit SVE:
>>    ```
>>    Benchmark	        Unit	Before	 Error	After	 Error	Uplift
>>    Byte128Vector.compress	ops/ms	4846.97	 26.23	6638.56	 31.60	1.36
>>    Byte64Vector.compress	ops/ms	2447.69	 12.95	7167.68	 34.49	2.92
>>    Short128Vector.compress	ops/ms	7174.88	 40.94	8398.45	 9.48	1.17
>>    Short64Vector.compress	ops/ms	3618.72	 3.04	8618.22	 10.91	2.38
>>    ```
>>    
>>    This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
>>    and all tests passed.
>
> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 36:
> 
>> 34:  * @key randomness
>> 35:  * @library /test/lib /
>> 36:  * @summary AArch64: Enhance SVE subword type implementation of vector compress
> 
> I would change the summary to something a bit more generic, since the test is not only good for aarch64 / SVE.
> Suggestion:
> 
>  * @summary IR test for VectorAPI compress

It seems that the summary and the PR title are usually consistent. Is there any convention or rule for this?

> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 228:
> 
>> 226:                      .start();
>> 227:     }
>> 228: }
> 
> Question: is there already another test that checks `compress`?

Yes, just like `expand`, it's here https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5357
This test file is mainly for IR test.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354169473
PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354167428