RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation

Tue Sep 9 09:29:24 UTC 2025

On Wed, 20 Aug 2025 11:27:59 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified:
>> 1. **Subword types** on SVE2-capable hardware.
>> 2. **All types** on NEON and SVE1 environments.
>> 
>> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments.
>> 
>> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example:
>> 
>> To compute: dst = src.expand(mask)
>> Data direction: high <== low
>> Input:
>>   src                         = p o n m l k j i h g f e d c b a
>>   mask                        = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>> Expected result:
>>   dst                         = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>> 
>> Step 1: calculate the index input of the TBL instruction.
>> 
>> // Set tmp1 as all 0 vector.
>> tmp1                          = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 
>> // Move the mask bits from the predicate register to a vector register.
>> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
>> tmp2 = mask                   = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>> 
>> // Shift the entire register. Prefix sum algorithm.
>> dst = tmp2 << 8               = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
>> tmp2 += dst                   = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>> 
>> dst = tmp2 << 16              = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
>> tmp2 += dst                   = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>> 
>> dst = tmp2 << 32              = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
>> tmp2 += dst                   = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1
>> 
>> dst = tmp2 << 64              = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0
>> tmp2 += dst                   = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1
>> 
>> // Clear inactive elements.
>> dst = sel(mask, tmp2, tmp1)   = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>> 
>> // Set the inactive lane value to -1 and set the active lane to the target index.
>> dst -= 1                      = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0
>> 
>> Step 2: shuffle the source vector elements to the target vector
>> 
>> tbl(dst, src, dst)            = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>> 
>> 
>> The same algorithm is used for NEON and...
>
> The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation.

@theRealAph @e1iu could you help take another look of this PR, thanks !

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3269731241