RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2]

Thu Sep 4 08:04:46 UTC 2025

On Wed, 3 Sep 2025 12:49:32 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>> 
>>  - Improve the comment of the vector expand implementation
>>  - Merge branch 'master' into JDK-8363989
>>  - 8363989: AArch64: Add missing backend support of VectorAPI expand operation
>>    
>>    Currently, on AArch64, the VectorAPI `expand` operation is intrinsified
>>    for 32-bit and 64-bit types only when SVE2 is available. In the following
>>    cases, `expand` has not yet been intrinsified:
>>    1. **Subword types** on SVE2-capable hardware.
>>    2. **All types** on NEON and SVE1 environments.
>>    
>>    As a result, `expand` API performance is very poor in these scenarios.
>>    This patch intrinsifies the `expand` operation in the above environments.
>>    
>>    Since there are no native instructions directly corresponding to `expand`
>>    in these cases, this patch mainly leverages the `TBL` instruction to
>>    implement `expand`. To compute the index input for `TBL`, the prefix sum
>>    algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used.
>>    Take a 128-bit byte vector on SVE2 as an example:
>>    ```
>>    To compute: dst = src.expand(mask)
>>    Data direction: high <== low
>>    Input:
>>      src                         = p o n m l k j i h g f e d c b a
>>      mask                        = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>>    Expected result:
>>      dst                         = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>>    ```
>>    Step 1: calculate the index input of the TBL instruction.
>>    ```
>>    // Set tmp1 as all 0 vector.
>>    tmp1                          = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>    
>>    // Move the mask bits from the predicate register to a vector register.
>>    // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
>>    tmp2 = mask                   = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>>    
>>    // Shift the entire register. Prefix sum algorithm.
>>    dst = tmp2 << 8               = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
>>    tmp2 += dst                   = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>>    
>>    dst = tmp2 << 16              = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
>>    tmp2 += dst                   = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>>    
>>    dst = tmp2 << 32              = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
>>    tmp2 += dst                   = 4 4 4 4 4 4 4 4 4 4 ...
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2819:
> 
>> 2817:   subv(dst, size, tmp2, tmp1);
>> 2818:   // dst = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>> 2819:   tbl(dst, size, src, 1, dst);
> 
> It would make it a little easier to read the example if the numbers were aligned.
> Now the minus sign disrupts that a little. Maybe leave 2 spaces if the number is positive?

Make sense, I'll update it in the following commit.

> test/hotspot/jtreg/compiler/vectorapi/VectorExpandTest.java line 48:
> 
>> 46:     static final VectorSpecies<Float> F_SPECIES = FloatVector.SPECIES_MAX;
>> 47:     static final VectorSpecies<Long> L_SPECIES = LongVector.SPECIES_MAX;
>> 48:     static final VectorSpecies<Double> D_SPECIES = DoubleVector.SPECIES_MAX;
> 
> Would it make sense to run these tests with various vector sizes?
> Because it seems your algorithm depends on `vector_length_in_bytes` in the prefix sum algo.

Since we already have correctness tests for `expand` on **all vector types** under `test/jdk/jdk/incubator/vector/`, such as https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5375, this test primarily verifies that the expected IR is generated. So, I think this is sufficient?

I've tested this PR locally on a 128-bit SVE2 machine, a 256-bit SVE machine, and a 512-bit QEMU environment, and all tests passed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2321198368
PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2321194040