RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2]

Thu Aug 21 07:00:35 UTC 2025

> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
> 1. **Subword types** on SVE2-capable hardware.
> 2. **All types** on NEON and SVE1 environments.
> 
> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
> 
> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
> 
> To compute: dst = src.expand(mask)
> Data direction: high <== low
> Input:
>   src                         = p o n m l k j i h g f e d c b a
>   mask                        = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> Expected result:
>   dst                         = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
> 
> Step 1: calculate the index input of the TBL instruction.
> 
> // Set tmp1 as all 0 vector.
> tmp1                          = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 
> // Move the mask bits from the predicate register to a vector register.
> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
> tmp2 = mask                   = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> 
> // Shift the entire register. Prefix sum algorithm.
> dst = tmp2 << 8               = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
> tmp2 += dst                   = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
> 
> dst = tmp2 << 16              = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
> tmp2 += dst                   = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
> 
> dst = tmp2 << 32              = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
> tmp2 += dst                   = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
> 
> dst = tmp2 << 64              = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
> tmp2 += dst                   = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
> 
> // Clear inactive elements.
> dst = sel(mask, tmp2, tmp1)   = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
> 
> // Set the inactive lane value to -1 and set the active lane to the target index.
> dst -= 1                      = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
> 
> Step 2: shuffle the source vector elements to the target vector
> 
> tbl(dst, src, dst)            = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
> 
> 
> The same algorithm is used for NEON and SVE1, but with different instructions where appropriate.
> 
> The following benchmarks are from panama-...

erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:

 - Improve the comment of the vector expand implementation
 - Merge branch 'master' into JDK-8363989
 - 8363989: AArch64: Add missing backend support of VectorAPI expand operation

   Currently, on AArch64, the VectorAPI `expand` operation is intrinsified
   for 32-bit and 64-bit types only when SVE2 is available. In the following
   cases, `expand` has not yet been intrinsified:
   1. **Subword types** on SVE2-capable hardware.
   2. **All types** on NEON and SVE1 environments.

   As a result, `expand` API performance is very poor in these scenarios.
   This patch intrinsifies the `expand` operation in the above environments.

   Since there are no native instructions directly corresponding to `expand`
   in these cases, this patch mainly leverages the `TBL` instruction to
   implement `expand`. To compute the index input for `TBL`, the prefix sum
   algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used.
   Take a 128-bit byte vector on SVE2 as an example:
   ```
   To compute: dst = src.expand(mask)
   Data direction: high <== low
   Input:
     src                         = p o n m l k j i h g f e d c b a
     mask                        = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
   Expected result:
     dst                         = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
   ```
   Step 1: calculate the index input of the TBL instruction.
   ```
   // Set tmp1 as all 0 vector.
   tmp1                          = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

   // Move the mask bits from the predicate register to a vector register.
   // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
   tmp2 = mask                   = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

   // Shift the entire register. Prefix sum algorithm.
   dst = tmp2 << 8               = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
   tmp2 += dst                   = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1

   dst = tmp2 << 16              = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
   tmp2 += dst                   = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1

   dst = tmp2 << 32              = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
   tmp2 += dst                   = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1

   dst = tmp2 << 64              = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
   tmp2 += dst                   = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1

   // Clear inactive elements.
   dst = sel(mask, tmp2, tmp1)   = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1

   // Set the inactive lane value to -1 and set the active lane to the target index.
   dst -= 1                      = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
   ```
   Step 2: shuffle the source vector elements to the target vector
   ```
   tbl(dst, src, dst)            = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
   ```

   The same algorithm is used for NEON and SVE1, but with different
   instructions where appropriate.

   The following benchmarks are from panama-vector/vectorIntrinsics.

   On Nvidia Grace machine with option `-XX:UseSVE=2`:
   ```
   Benchmark		Unit	Before		Score Error	After		Score Error	Uplift
   Byte128Vector.expand	ops/ms	1791.022366	5.619883	9633.388683	1.968788	5.37
   Double128Vector.expand	ops/ms	4489.255846	0.48485		4488.772949	0.491596	0.99
   Float128Vector.expand	ops/ms	8863.02424	6.888087	8908.352235	51.487453	1
   Int128Vector.expand	ops/ms	8873.485683	3.275682	8879.635643	1.243863	1
   Long128Vector.expand	ops/ms	4485.1149	4.458073	4489.365269	0.851093	1
   Short128Vector.expand	ops/ms	792.068834	2.640398	5880.811288	6.40683		7.42
   Byte64Vector.expand	ops/ms	854.455002	8.548982	5999.046295	37.209987	7.02
   Double64Vector.expand	ops/ms	46.49763	0.104773	46.526043	0.102451	1
   Float64Vector.expand	ops/ms	4510.596811	0.504477	4509.984244	1.519178	0.99
   Int64Vector.expand	ops/ms	4508.778322	1.664461	4535.216611	26.742484	1
   Long64Vector.expand	ops/ms	45.665462	0.705485	46.496232	0.075648	1.01
   Short64Vector.expand	ops/ms	394.527324	1.284691	3860.199621	0.720015	9.78
   ```

   On Nvidia Grace machine with option `-XX:UseSVE=1`:
   ```
   Benchmark		Unit	Before		Score Error	After		Score Error	Uplift
   Byte128Vector.expand	ops/ms	1767.314171	12.431526	9630.892248	1.478813	5.44
   Double128Vector.expand	ops/ms	197.614381	0.945541	2416.075281	2.664325	12.22
   Float128Vector.expand	ops/ms	390.878183	2.089234	3844.011978	3.792751	9.83
   Int128Vector.expand	ops/ms	394.550044	2.025371	3843.280133	3.528017	9.74
   Long128Vector.expand	ops/ms	198.366863	0.651726	2423.234639	4.911434	12.21
   Short128Vector.expand	ops/ms	790.044704	3.339363	5885.595035	1.440598	7.44
   Byte64Vector.expand	ops/ms	853.479119	7.158898	5942.750116	1.054905	6.96
   Double64Vector.expand	ops/ms	46.550458	0.079191	46.423053	0.057554	0.99
   Float64Vector.expand	ops/ms	197.977215	1.156535	2445.010767	1.992358	12.34
   Int64Vector.expand	ops/ms	198.326857	1.02785		2444.211583	2.5432		12.32
   Long64Vector.expand	ops/ms	46.526513	0.25779		45.984253	0.566691	0.98
   Short64Vector.expand	ops/ms	398.649412	1.87764		3837.495773	3.528926	9.62
   ```

   On Nvidia Grace machine with option `-XX:UseSVE=0`:
   ```
   Benchmark		Unit	Before		Score Error	After		Score Error	Uplift
   Byte128Vector.expand	ops/ms	1802.98702	6.906394	9427.491602	2.067934	5.22
   Double128Vector.expand	ops/ms	198.498191	0.429071	1190.476326	0.247358	5.99
   Float128Vector.expand	ops/ms	392.849005	2.034676	2373.195574	2.006566	6.04
   Int128Vector.expand	ops/ms	395.69179	2.194773	2372.084745	2.058303	5.99
   Long128Vector.expand	ops/ms	198.191673	1.476362	1189.712301	1.006821	6
   Short128Vector.expand	ops/ms	795.785831	5.62611		4731.514053	2.365213	5.94
   Byte64Vector.expand	ops/ms	843.549268	7.174254	5865.556155	37.639415	6.95
   Double64Vector.expand	ops/ms	45.943599	0.484743	46.529755	0.111551	1.01
   Float64Vector.expand	ops/ms	193.945993	0.943338	1463.836772	0.618393	7.54
   Int64Vector.expand	ops/ms	194.168021	0.492286	1473.004575	8.802656	7.58
   Long64Vector.expand	ops/ms	46.570488	0.076372	46.696353	0.078649	1
   Short64Vector.expand	ops/ms	387.973334	2.367312	2920.428114	0.863635	7.52
   ```

   Some JTReg test cases are added for the above changes. And the patch was
   tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/26740/files
  - new: https://git.openjdk.org/jdk/pull/26740/files/86d011ac..a1777974

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=00-01

  Stats: 30300 lines in 941 files changed: 17180 ins; 9555 del; 3565 mod
  Patch: https://git.openjdk.org/jdk/pull/26740.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26740/head:pull/26740

PR: https://git.openjdk.org/jdk/pull/26740