RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2]
erifan
duke at openjdk.org
Mon Sep 15 09:58:20 UTC 2025
> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions.
>
> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations—such as looping over the active elements, extraction, and insertion—I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects:
> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction.
> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid.
>
> This pull request introduces the following changes:
> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput.
> 2. Eliminates unnecessary compress operations for partial subword type cases.
> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure.
>
> Benchmark results demonstrate that these changes significantly improve performance.
>
> Benchmarks on Nvidia Grace machine with 128-bit SVE:
>
> Benchmark Unit Before Error After Error Uplift
> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36
> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92
> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17
> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38
>
>
> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed.
erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
- Merge branch 'master' into JDK-8366333-compress
- 8366333: AArch64: Enhance SVE subword type implementation of vector compress
The AArch64 SVE and SVE2 architectures lack an instruction suitable for
subword-type `compress` operations. Therefore, the current implementation
uses the 32-bit SVE `compact` instruction to compress subword types by
first widening the high and low parts to 32 bits, compressing them, and
then narrowing them back to their original type. Finally, the high and
low parts are merged using the `index + tbl` instructions.
This approach is significantly slower compared to architectures with native
support. After evaluating all available AArch64 SVE instructions and
experimenting with various implementations—such as looping over the active
elements, extraction, and insertion—I confirmed that the existing algorithm
is optimal given the instruction set. However, there is still room for
optimization in the following two aspects:
1. Merging with `index + tbl` is suboptimal due to the high latency of
the `index` instruction.
2. For partial subword types, operations to the highest half are unnecessary
because those bits are invalid.
This pull request introduces the following changes:
1. Replaces `index + tbl` with the `whilelt + splice` instructions, which
offer lower latency and higher throughput.
2. Eliminates unnecessary compress operations for partial subword type cases.
3. For `sve_compress_byte`, one less temporary register is used to alleviate
potential register pressure.
Benchmark results demonstrate that these changes significantly improve performance.
Benchmarks on Nvidia Grace machine with 128-bit SVE:
```
Benchmark Unit Before Error After Error Uplift
Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36
Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92
Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17
Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38
```
This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments,
and all tests passed.
-------------
Changes: https://git.openjdk.org/jdk/pull/27188/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27188&range=01
Stats: 414 lines in 9 files changed: 297 ins; 24 del; 93 mod
Patch: https://git.openjdk.org/jdk/pull/27188.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/27188/head:pull/27188
PR: https://git.openjdk.org/jdk/pull/27188
More information about the hotspot-compiler-dev
mailing list