RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation
erifan
duke at openjdk.org
Tue Sep 9 09:29:24 UTC 2025
On Wed, 20 Aug 2025 11:27:59 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
>> 1. **Subword types** on SVE2-capable hardware.
>> 2. **All types** on NEON and SVE1 environments.
>>
>> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
>>
>> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
>>
>> To compute: dst = src.expand(mask)
>> Data direction: high <== low
>> Input:
>> src = p o n m l k j i h g f e d c b a
>> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>> Expected result:
>> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>>
>> Step 1: calculate the index input of the TBL instruction.
>>
>> // Set tmp1 as all 0 vector.
>> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>
>> // Move the mask bits from the predicate register to a vector register.
>> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
>> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>>
>> // Shift the entire register. Prefix sum algorithm.
>> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
>> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>>
>> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
>> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>>
>> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
>> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
>>
>> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
>> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
>>
>> // Clear inactive elements.
>> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>>
>> // Set the inactive lane value to -1 and set the active lane to the target index.
>> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
>>
>> Step 2: shuffle the source vector elements to the target vector
>>
>> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>>
>>
>> The same algorithm is used for NEON and...
>
> The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation.
@theRealAph @e1iu could you help take another look of this PR, thanks !
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3269731241
More information about the hotspot-compiler-dev
mailing list