RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress

erifan duke at openjdk.org
Mon Sep 15 10:01:18 UTC 2025


On Thu, 11 Sep 2025 06:07:42 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions.
>> 
>> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations—such as looping over the active elements, extraction, and insertion—I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects:
>> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction.
>> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid.
>> 
>> This pull request introduces the following changes:
>> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput.
>> 2. Eliminates unnecessary compress operations for partial subword type cases.
>> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure.
>> 
>> Benchmark results demonstrate that these changes significantly improve performance.
>> 
>> Benchmarks on Nvidia Grace machine with 128-bit SVE:
>> 
>> Benchmark	            Unit	Before	 Error	After	 Error	Uplift
>> Byte128Vector.compress	ops/ms	4846.97	 26.23	6638.56	 31.60	1.36
>> Byte64Vector.compress	ops/ms	2447.69	 12.95	7167.68	 34.49	2.92
>> Short128Vector.compress	ops/ms	7174.88	 40.94	8398.45	 9.48	1.17
>> Short64Vector.compress	ops/ms	3618.72	 3.04	8618.22	 10.91	2.38
>> 
>> 
>> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed.
>
> Would it make sense to additionally run the relevant benchmarks on other popular aarch64 platforms such as Graviton, to make sure the improvements are seen there as well?

@galderz Yeah, absolutely. This is the test results on an **AWS graviton3 V1 machine**, we can see similar performance gain.

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">



</head>

<body link="#467886" vlink="#96607D">


Benchmark | Units | Before | Error | After | Error | Uplift
-- | -- | -- | -- | -- | -- | --
Byte128Vector.compress | ops/ms | 2405.511 | 0.763 | 6116.85 | 17.699 | 2.54284848
Byte64Vector.compress | ops/ms | 1151.662 | 11.262 | 5278.924 | 6.74 | 4.58374419
Double128Vector.compress | ops/ms | 4919.017 | 4.909 | 4940.232 | 20.143 | 1.00431285
Double64Vector.compress | ops/ms | 37.071 | 0.778 | 37.109 | 0.945 | 1.00102506
Float128Vector.compress | ops/ms | 9580.312 | 48.341 | 9586.499 | 74.934 | 1.0006458
Float64Vector.compress | ops/ms | 4943.728 | 7.361 | 4941.917 | 5.871 | 0.99963368
Int128Vector.compress | ops/ms | 9496.991 | 34.972 | 9515.122 | 29.204 | 1.00190913
Int64Vector.compress | ops/ms | 4940.23 | 7.141 | 4941.815 | 5.077 | 1.00032084
Long128Vector.compress | ops/ms | 4918.142 | 14.835 | 4917.148 | 9.05 | 0.99979789
Long64Vector.compress | ops/ms | 36.58 | 0.426 | 36.574 | 0.431 | 0.99983598
Short128Vector.compress | ops/ms | 3343.878 | 0.898 | 6813.421 | 4.143 | 2.03758062
Short64Vector.compress | ops/ms | 1595.358 | 3.37 | 3390.959 | 3.55 | 2.12551603



</body>

</html>

-------------

PR Comment: https://git.openjdk.org/jdk/pull/27188#issuecomment-3291355148


More information about the hotspot-compiler-dev mailing list