RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v4]
erifan
duke at openjdk.org
Mon Sep 22 01:53:20 UTC 2025
On Mon, 15 Sep 2025 05:55:43 GMT, erifan <duke at openjdk.org> wrote:
>> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
>> 1. **Subword types** on SVE2-capable hardware.
>> 2. **All types** on NEON and SVE1 environments.
>>
>> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
>>
>> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
>>
>> To compute: dst = src.expand(mask)
>> Data direction: high <== low
>> Input:
>> src = p o n m l k j i h g f e d c b a
>> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>> Expected result:
>> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>>
>> Step 1: calculate the index input of the TBL instruction.
>>
>> // Set tmp1 as all 0 vector.
>> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>
>> // Move the mask bits from the predicate register to a vector register.
>> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
>> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>>
>> // Shift the entire register. Prefix sum algorithm.
>> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
>> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>>
>> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
>> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>>
>> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
>> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
>>
>> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
>> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
>>
>> // Clear inactive elements.
>> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>>
>> // Set the inactive lane value to -1 and set the active lane to the target index.
>> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
>>
>> Step 2: shuffle the source vector elements to the target vector
>>
>> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>>
>>
>> The same algorithm is used for NEON and...
>
> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
>
> - Merge branch 'master' into JDK-8363989
> - Align code example data for better reading
> - Merge branch 'master' into JDK-8363989
> - Improve the comment of the vector expand implementation
> - Merge branch 'master' into JDK-8363989
> - 8363989: AArch64: Add missing backend support of VectorAPI expand operation
>
> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified
> for 32-bit and 64-bit types only when SVE2 is available. In the following
> cases, `expand` has not yet been intrinsified:
> 1. **Subword types** on SVE2-capable hardware.
> 2. **All types** on NEON and SVE1 environments.
>
> As a result, `expand` API performance is very poor in these scenarios.
> This patch intrinsifies the `expand` operation in the above environments.
>
> Since there are no native instructions directly corresponding to `expand`
> in these cases, this patch mainly leverages the `TBL` instruction to
> implement `expand`. To compute the index input for `TBL`, the prefix sum
> algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used.
> Take a 128-bit byte vector on SVE2 as an example:
> ```
> To compute: dst = src.expand(mask)
> Data direction: high <== low
> Input:
> src = p o n m l k j i h g f e d c b a
> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
> Expected result:
> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
> ```
> Step 1: calculate the index input of the TBL instruction.
> ```
> // Set tmp1 as all 0 vector.
> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> // Move the mask bits from the predicate register to a vector register.
> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>
> // Shift the entire register. Prefix sum algorithm.
> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>
> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>
> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
>
> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
> ...
Thanks all for your help!
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3316494033
More information about the hotspot-compiler-dev
mailing list