RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation
erifan
duke at openjdk.org
Tue Aug 12 09:10:01 UTC 2025
Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
1. **Subword types** on SVE2-capable hardware.
2. **All types** on NEON and SVE1 environments.
As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
To compute: dst = src.expand(mask)
Data direction: high <== low
Input:
src = p o n m l k j i h g f e d c b a
mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
Expected result:
dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
Step 1: calculate the index input of the TBL instruction.
// Set tmp1 as all 0 vector.
tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
// Move the mask bits from the predicate register to a vector register.
// **1-bit** mask lane of P register to **8-bit** mask lane of V register.
tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
// Shift the entire register. Prefix sum algorithm.
dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
// Clear inactive elements.
dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
// Set the inactive lane value to -1 and set the active lane to the target index.
dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
Step 2: shuffle the source vector elements to the target vector
tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
The same algorithm is used for NEON and SVE1, but with different instructions where appropriate.
The following benchmarks are from panama-vector/vectorIntrinsics.
On Nvidia Grace machine with option `-XX:UseSVE=2`:
Benchmark Unit Before Score Error After Score Error Uplift
Byte128Vector.expand ops/ms 1791.022366 5.619883 9633.388683 1.968788 5.37
Double128Vector.expand ops/ms 4489.255846 0.48485 4488.772949 0.491596 0.99
Float128Vector.expand ops/ms 8863.02424 6.888087 8908.352235 51.487453 1
Int128Vector.expand ops/ms 8873.485683 3.275682 8879.635643 1.243863 1
Long128Vector.expand ops/ms 4485.1149 4.458073 4489.365269 0.851093 1
Short128Vector.expand ops/ms 792.068834 2.640398 5880.811288 6.40683 7.42
Byte64Vector.expand ops/ms 854.455002 8.548982 5999.046295 37.209987 7.02
Double64Vector.expand ops/ms 46.49763 0.104773 46.526043 0.102451 1
Float64Vector.expand ops/ms 4510.596811 0.504477 4509.984244 1.519178 0.99
Int64Vector.expand ops/ms 4508.778322 1.664461 4535.216611 26.742484 1
Long64Vector.expand ops/ms 45.665462 0.705485 46.496232 0.075648 1.01
Short64Vector.expand ops/ms 394.527324 1.284691 3860.199621 0.720015 9.78
On Nvidia Grace machine with option `-XX:UseSVE=1`:
Benchmark Unit Before Score Error After Score Error Uplift
Byte128Vector.expand ops/ms 1767.314171 12.431526 9630.892248 1.478813 5.44
Double128Vector.expand ops/ms 197.614381 0.945541 2416.075281 2.664325 12.22
Float128Vector.expand ops/ms 390.878183 2.089234 3844.011978 3.792751 9.83
Int128Vector.expand ops/ms 394.550044 2.025371 3843.280133 3.528017 9.74
Long128Vector.expand ops/ms 198.366863 0.651726 2423.234639 4.911434 12.21
Short128Vector.expand ops/ms 790.044704 3.339363 5885.595035 1.440598 7.44
Byte64Vector.expand ops/ms 853.479119 7.158898 5942.750116 1.054905 6.96
Double64Vector.expand ops/ms 46.550458 0.079191 46.423053 0.057554 0.99
Float64Vector.expand ops/ms 197.977215 1.156535 2445.010767 1.992358 12.34
Int64Vector.expand ops/ms 198.326857 1.02785 2444.211583 2.5432 12.32
Long64Vector.expand ops/ms 46.526513 0.25779 45.984253 0.566691 0.98
Short64Vector.expand ops/ms 398.649412 1.87764 3837.495773 3.528926 9.62
On Nvidia Grace machine with option `-XX:UseSVE=0`:
Benchmark Unit Before Score Error After Score Error Uplift
Byte128Vector.expand ops/ms 1802.98702 6.906394 9427.491602 2.067934 5.22
Double128Vector.expand ops/ms 198.498191 0.429071 1190.476326 0.247358 5.99
Float128Vector.expand ops/ms 392.849005 2.034676 2373.195574 2.006566 6.04
Int128Vector.expand ops/ms 395.69179 2.194773 2372.084745 2.058303 5.99
Long128Vector.expand ops/ms 198.191673 1.476362 1189.712301 1.006821 6
Short128Vector.expand ops/ms 795.785831 5.62611 4731.514053 2.365213 5.94
Byte64Vector.expand ops/ms 843.549268 7.174254 5865.556155 37.639415 6.95
Double64Vector.expand ops/ms 45.943599 0.484743 46.529755 0.111551 1.01
Float64Vector.expand ops/ms 193.945993 0.943338 1463.836772 0.618393 7.54
Int64Vector.expand ops/ms 194.168021 0.492286 1473.004575 8.802656 7.58
Long64Vector.expand ops/ms 46.570488 0.076372 46.696353 0.078649 1
Short64Vector.expand ops/ms 387.973334 2.367312 2920.428114 0.863635 7.52
Some JTReg test cases are added for the above changes. And the patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed.
-------------
Commit messages:
- 8363989: AArch64: Add missing backend support of VectorAPI expand operation
Changes: https://git.openjdk.org/jdk/pull/26740/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8363989
Stats: 482 lines in 9 files changed: 386 ins; 12 del; 84 mod
Patch: https://git.openjdk.org/jdk/pull/26740.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/26740/head:pull/26740
PR: https://git.openjdk.org/jdk/pull/26740
More information about the hotspot-compiler-dev
mailing list