RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation

Tue Aug 12 09:10:01 UTC 2025

Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
1. **Subword types** on SVE2-capable hardware.
2. **All types** on NEON and SVE1 environments.

As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.

Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:

To compute: dst = src.expand(mask)
Data direction: high <== low
Input:
  src                         = p o n m l k j i h g f e d c b a
  mask                        = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
Expected result:
  dst                         = 0 0 h g 0 0 f e 0 0 d c 0 0 b a

Step 1: calculate the index input of the TBL instruction.

// Set tmp1 as all 0 vector.
tmp1                          = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

// Move the mask bits from the predicate register to a vector register.
// **1-bit** mask lane of P register to **8-bit** mask lane of V register.
tmp2 = mask                   = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

// Shift the entire register. Prefix sum algorithm.
dst = tmp2 << 8               = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
tmp2 += dst                   = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1

dst = tmp2 << 16              = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
tmp2 += dst                   = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1

dst = tmp2 << 32              = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
tmp2 += dst                   = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1

dst = tmp2 << 64              = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
tmp2 += dst                   = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1

// Clear inactive elements.
dst = sel(mask, tmp2, tmp1)   = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1

// Set the inactive lane value to -1 and set the active lane to the target index.
dst -= 1                      = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0

Step 2: shuffle the source vector elements to the target vector

tbl(dst, src, dst)            = 0 0 h g 0 0 f e 0 0 d c 0 0 b a

The same algorithm is used for NEON and SVE1, but with different instructions where appropriate.

The following benchmarks are from panama-vector/vectorIntrinsics.

On Nvidia Grace machine with option `-XX:UseSVE=2`:

Benchmark		             Unit	Before		Score Error	After		Score Error	Uplift
Byte128Vector.expand        ops/ms	1791.022366	5.619883	9633.388683	1.968788	5.37
Double128Vector.expand      ops/ms	4489.255846	0.48485		4488.772949	0.491596	0.99
Float128Vector.expand       ops/ms	8863.02424	6.888087	8908.352235	51.487453	1
Int128Vector.expand         ops/ms	8873.485683	3.275682	8879.635643	1.243863	1
Long128Vector.expand        ops/ms	4485.1149	4.458073	4489.365269	0.851093	1
Short128Vector.expand       ops/ms	792.068834	2.640398	5880.811288	6.40683		7.42
Byte64Vector.expand         ops/ms	854.455002	8.548982	5999.046295	37.209987	7.02
Double64Vector.expand       ops/ms	46.49763	0.104773	46.526043	0.102451	1
Float64Vector.expand        ops/ms	4510.596811	0.504477	4509.984244	1.519178	0.99
Int64Vector.expand          ops/ms	4508.778322	1.664461	4535.216611	26.742484	1
Long64Vector.expand         ops/ms	45.665462	0.705485	46.496232	0.075648	1.01
Short64Vector.expand        ops/ms	394.527324	1.284691	3860.199621	0.720015	9.78

On Nvidia Grace machine with option `-XX:UseSVE=1`:

Benchmark		        Unit	Before		Score Error	After		Score Error	Uplift
Byte128Vector.expand	ops/ms	1767.314171	12.431526	9630.892248	1.478813	5.44
Double128Vector.expand	ops/ms	197.614381	0.945541	2416.075281	2.664325	12.22
Float128Vector.expand	ops/ms	390.878183	2.089234	3844.011978	3.792751	9.83
Int128Vector.expand	    ops/ms	394.550044	2.025371	3843.280133	3.528017	9.74
Long128Vector.expand	ops/ms	198.366863	0.651726	2423.234639	4.911434	12.21
Short128Vector.expand	ops/ms	790.044704	3.339363	5885.595035	1.440598	7.44
Byte64Vector.expand	    ops/ms	853.479119	7.158898	5942.750116	1.054905	6.96
Double64Vector.expand	ops/ms	46.550458	0.079191	46.423053	0.057554	0.99
Float64Vector.expand	ops/ms	197.977215	1.156535	2445.010767	1.992358	12.34
Int64Vector.expand	    ops/ms	198.326857	1.02785		2444.211583	2.5432		12.32
Long64Vector.expand	    ops/ms	46.526513	0.25779		45.984253	0.566691	0.98
Short64Vector.expand	ops/ms	398.649412	1.87764		3837.495773	3.528926	9.62

On Nvidia Grace machine with option `-XX:UseSVE=0`:

Benchmark		        Unit	Before		Score Error	After		Score Error	Uplift
Byte128Vector.expand	ops/ms	1802.98702	6.906394	9427.491602	2.067934	5.22
Double128Vector.expand	ops/ms	198.498191	0.429071	1190.476326	0.247358	5.99
Float128Vector.expand	ops/ms	392.849005	2.034676	2373.195574	2.006566	6.04
Int128Vector.expand	    ops/ms	395.69179	2.194773	2372.084745	2.058303	5.99
Long128Vector.expand	ops/ms	198.191673	1.476362	1189.712301	1.006821	6
Short128Vector.expand	ops/ms	795.785831	5.62611		4731.514053	2.365213	5.94
Byte64Vector.expand	    ops/ms	843.549268	7.174254	5865.556155	37.639415	6.95
Double64Vector.expand	ops/ms	45.943599	0.484743	46.529755	0.111551	1.01
Float64Vector.expand	ops/ms	193.945993	0.943338	1463.836772	0.618393	7.54
Int64Vector.expand	    ops/ms	194.168021	0.492286	1473.004575	8.802656	7.58
Long64Vector.expand	    ops/ms	46.570488	0.076372	46.696353	0.078649	1
Short64Vector.expand	ops/ms	387.973334	2.367312	2920.428114	0.863635	7.52

Some JTReg test cases are added for the above changes. And the patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed.

-------------

Commit messages:
 - 8363989: AArch64: Add missing backend support of VectorAPI expand operation

Changes: https://git.openjdk.org/jdk/pull/26740/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8363989
  Stats: 482 lines in 9 files changed: 386 ins; 12 del; 84 mod
  Patch: https://git.openjdk.org/jdk/pull/26740.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26740/head:pull/26740

PR: https://git.openjdk.org/jdk/pull/26740