RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation

Wed Aug 20 11:30:37 UTC 2025

On Tue, 12 Aug 2025 09:02:01 GMT, erifan <duke at openjdk.org> wrote:

> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
> 1. **Subword types** on SVE2-capable hardware.
> 2. **All types** on NEON and SVE1 environments.
> 
> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
> 
> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
> 
> To compute: dst = src.expand(mask)
> Data direction: high <== low
> Input:
>   src                         = p o n m l k j i h g f e d c b a
>   mask                        = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> Expected result:
>   dst                         = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
> 
> Step 1: calculate the index input of the TBL instruction.
> 
> // Set tmp1 as all 0 vector.
> tmp1                          = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 
> // Move the mask bits from the predicate register to a vector register.
> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
> tmp2 = mask                   = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> 
> // Shift the entire register. Prefix sum algorithm.
> dst = tmp2 << 8               = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
> tmp2 += dst                   = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
> 
> dst = tmp2 << 16              = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
> tmp2 += dst                   = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
> 
> dst = tmp2 << 32              = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
> tmp2 += dst                   = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
> 
> dst = tmp2 << 64              = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
> tmp2 += dst                   = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
> 
> // Clear inactive elements.
> dst = sel(mask, tmp2, tmp1)   = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
> 
> // Set the inactive lane value to -1 and set the active lane to the target index.
> dst -= 1                      = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
> 
> Step 2: shuffle the source vector elements to the target vector
> 
> tbl(dst, src, dst)            = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
> 
> 
> The same algorithm is used for NEON and SVE1, but with different instructions where appropriate.
> 
> The following benchmarks are from panama-...

The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3205780702