RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation
Andrew Haley
aph at openjdk.org
Wed Aug 20 11:30:37 UTC 2025
On Tue, 12 Aug 2025 09:02:01 GMT, erifan <duke at openjdk.org> wrote:
> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
> 1. **Subword types** on SVE2-capable hardware.
> 2. **All types** on NEON and SVE1 environments.
>
> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
>
> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
>
> To compute: dst = src.expand(mask)
> Data direction: high <== low
> Input:
> src = p o n m l k j i h g f e d c b a
> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> Expected result:
> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>
> Step 1: calculate the index input of the TBL instruction.
>
> // Set tmp1 as all 0 vector.
> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> // Move the mask bits from the predicate register to a vector register.
> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>
> // Shift the entire register. Prefix sum algorithm.
> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>
> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>
> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
>
> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
>
> // Clear inactive elements.
> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>
> // Set the inactive lane value to -1 and set the active lane to the target index.
> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
>
> Step 2: shuffle the source vector elements to the target vector
>
> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>
>
> The same algorithm is used for NEON and SVE1, but with different instructions where appropriate.
>
> The following benchmarks are from panama-...
The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3205780702
More information about the hotspot-compiler-dev
mailing list