RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2]
erifan
duke at openjdk.org
Thu Aug 21 07:00:35 UTC 2025
> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
> 1. **Subword types** on SVE2-capable hardware.
> 2. **All types** on NEON and SVE1 environments.
>
> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
>
> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
>
> To compute: dst = src.expand(mask)
> Data direction: high <== low
> Input:
> src = p o n m l k j i h g f e d c b a
> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> Expected result:
> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>
> Step 1: calculate the index input of the TBL instruction.
>
> // Set tmp1 as all 0 vector.
> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> // Move the mask bits from the predicate register to a vector register.
> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>
> // Shift the entire register. Prefix sum algorithm.
> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>
> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>
> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
>
> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
>
> // Clear inactive elements.
> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>
> // Set the inactive lane value to -1 and set the active lane to the target index.
> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
>
> Step 2: shuffle the source vector elements to the target vector
>
> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>
>
> The same algorithm is used for NEON and SVE1, but with different instructions where appropriate.
>
> The following benchmarks are from panama-...
erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
- Improve the comment of the vector expand implementation
- Merge branch 'master' into JDK-8363989
- 8363989: AArch64: Add missing backend support of VectorAPI expand operation
Currently, on AArch64, the VectorAPI `expand` operation is intrinsified
for 32-bit and 64-bit types only when SVE2 is available. In the following
cases, `expand` has not yet been intrinsified:
1. **Subword types** on SVE2-capable hardware.
2. **All types** on NEON and SVE1 environments.
As a result, `expand` API performance is very poor in these scenarios.
This patch intrinsifies the `expand` operation in the above environments.
Since there are no native instructions directly corresponding to `expand`
in these cases, this patch mainly leverages the `TBL` instruction to
implement `expand`. To compute the index input for `TBL`, the prefix sum
algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used.
Take a 128-bit byte vector on SVE2 as an example:
```
To compute: dst = src.expand(mask)
Data direction: high <== low
Input:
src = p o n m l k j i h g f e d c b a
mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
Expected result:
dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
```
Step 1: calculate the index input of the TBL instruction.
```
// Set tmp1 as all 0 vector.
tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
// Move the mask bits from the predicate register to a vector register.
// **1-bit** mask lane of P register to **8-bit** mask lane of V register.
tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
// Shift the entire register. Prefix sum algorithm.
dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
// Clear inactive elements.
dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
// Set the inactive lane value to -1 and set the active lane to the target index.
dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
```
Step 2: shuffle the source vector elements to the target vector
```
tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
```
The same algorithm is used for NEON and SVE1, but with different
instructions where appropriate.
The following benchmarks are from panama-vector/vectorIntrinsics.
On Nvidia Grace machine with option `-XX:UseSVE=2`:
```
Benchmark Unit Before Score Error After Score Error Uplift
Byte128Vector.expand ops/ms 1791.022366 5.619883 9633.388683 1.968788 5.37
Double128Vector.expand ops/ms 4489.255846 0.48485 4488.772949 0.491596 0.99
Float128Vector.expand ops/ms 8863.02424 6.888087 8908.352235 51.487453 1
Int128Vector.expand ops/ms 8873.485683 3.275682 8879.635643 1.243863 1
Long128Vector.expand ops/ms 4485.1149 4.458073 4489.365269 0.851093 1
Short128Vector.expand ops/ms 792.068834 2.640398 5880.811288 6.40683 7.42
Byte64Vector.expand ops/ms 854.455002 8.548982 5999.046295 37.209987 7.02
Double64Vector.expand ops/ms 46.49763 0.104773 46.526043 0.102451 1
Float64Vector.expand ops/ms 4510.596811 0.504477 4509.984244 1.519178 0.99
Int64Vector.expand ops/ms 4508.778322 1.664461 4535.216611 26.742484 1
Long64Vector.expand ops/ms 45.665462 0.705485 46.496232 0.075648 1.01
Short64Vector.expand ops/ms 394.527324 1.284691 3860.199621 0.720015 9.78
```
On Nvidia Grace machine with option `-XX:UseSVE=1`:
```
Benchmark Unit Before Score Error After Score Error Uplift
Byte128Vector.expand ops/ms 1767.314171 12.431526 9630.892248 1.478813 5.44
Double128Vector.expand ops/ms 197.614381 0.945541 2416.075281 2.664325 12.22
Float128Vector.expand ops/ms 390.878183 2.089234 3844.011978 3.792751 9.83
Int128Vector.expand ops/ms 394.550044 2.025371 3843.280133 3.528017 9.74
Long128Vector.expand ops/ms 198.366863 0.651726 2423.234639 4.911434 12.21
Short128Vector.expand ops/ms 790.044704 3.339363 5885.595035 1.440598 7.44
Byte64Vector.expand ops/ms 853.479119 7.158898 5942.750116 1.054905 6.96
Double64Vector.expand ops/ms 46.550458 0.079191 46.423053 0.057554 0.99
Float64Vector.expand ops/ms 197.977215 1.156535 2445.010767 1.992358 12.34
Int64Vector.expand ops/ms 198.326857 1.02785 2444.211583 2.5432 12.32
Long64Vector.expand ops/ms 46.526513 0.25779 45.984253 0.566691 0.98
Short64Vector.expand ops/ms 398.649412 1.87764 3837.495773 3.528926 9.62
```
On Nvidia Grace machine with option `-XX:UseSVE=0`:
```
Benchmark Unit Before Score Error After Score Error Uplift
Byte128Vector.expand ops/ms 1802.98702 6.906394 9427.491602 2.067934 5.22
Double128Vector.expand ops/ms 198.498191 0.429071 1190.476326 0.247358 5.99
Float128Vector.expand ops/ms 392.849005 2.034676 2373.195574 2.006566 6.04
Int128Vector.expand ops/ms 395.69179 2.194773 2372.084745 2.058303 5.99
Long128Vector.expand ops/ms 198.191673 1.476362 1189.712301 1.006821 6
Short128Vector.expand ops/ms 795.785831 5.62611 4731.514053 2.365213 5.94
Byte64Vector.expand ops/ms 843.549268 7.174254 5865.556155 37.639415 6.95
Double64Vector.expand ops/ms 45.943599 0.484743 46.529755 0.111551 1.01
Float64Vector.expand ops/ms 193.945993 0.943338 1463.836772 0.618393 7.54
Int64Vector.expand ops/ms 194.168021 0.492286 1473.004575 8.802656 7.58
Long64Vector.expand ops/ms 46.570488 0.076372 46.696353 0.078649 1
Short64Vector.expand ops/ms 387.973334 2.367312 2920.428114 0.863635 7.52
```
Some JTReg test cases are added for the above changes. And the patch was
tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed.
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/26740/files
- new: https://git.openjdk.org/jdk/pull/26740/files/86d011ac..a1777974
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=00-01
Stats: 30300 lines in 941 files changed: 17180 ins; 9555 del; 3565 mod
Patch: https://git.openjdk.org/jdk/pull/26740.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/26740/head:pull/26740
PR: https://git.openjdk.org/jdk/pull/26740
More information about the hotspot-compiler-dev
mailing list