RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2]
erifan
duke at openjdk.org
Thu Sep 4 08:04:46 UTC 2025
On Wed, 3 Sep 2025 12:49:32 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>>
>> - Improve the comment of the vector expand implementation
>> - Merge branch 'master' into JDK-8363989
>> - 8363989: AArch64: Add missing backend support of VectorAPI expand operation
>>
>> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified
>> for 32-bit and 64-bit types only when SVE2 is available. In the following
>> cases, `expand` has not yet been intrinsified:
>> 1. **Subword types** on SVE2-capable hardware.
>> 2. **All types** on NEON and SVE1 environments.
>>
>> As a result, `expand` API performance is very poor in these scenarios.
>> This patch intrinsifies the `expand` operation in the above environments.
>>
>> Since there are no native instructions directly corresponding to `expand`
>> in these cases, this patch mainly leverages the `TBL` instruction to
>> implement `expand`. To compute the index input for `TBL`, the prefix sum
>> algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used.
>> Take a 128-bit byte vector on SVE2 as an example:
>> ```
>> To compute: dst = src.expand(mask)
>> Data direction: high <== low
>> Input:
>> src = p o n m l k j i h g f e d c b a
>> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>> Expected result:
>> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a
>> ```
>> Step 1: calculate the index input of the TBL instruction.
>> ```
>> // Set tmp1 as all 0 vector.
>> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>
>> // Move the mask bits from the predicate register to a vector register.
>> // **1-bit** mask lane of P register to **8-bit** mask lane of V register.
>> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
>>
>> // Shift the entire register. Prefix sum algorithm.
>> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
>> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1
>>
>> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0
>> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
>>
>> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0
>> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 ...
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2819:
>
>> 2817: subv(dst, size, tmp2, tmp1);
>> 2818: // dst = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1
>> 2819: tbl(dst, size, src, 1, dst);
>
> It would make it a little easier to read the example if the numbers were aligned.
> Now the minus sign disrupts that a little. Maybe leave 2 spaces if the number is positive?
Make sense, I'll update it in the following commit.
> test/hotspot/jtreg/compiler/vectorapi/VectorExpandTest.java line 48:
>
>> 46: static final VectorSpecies<Float> F_SPECIES = FloatVector.SPECIES_MAX;
>> 47: static final VectorSpecies<Long> L_SPECIES = LongVector.SPECIES_MAX;
>> 48: static final VectorSpecies<Double> D_SPECIES = DoubleVector.SPECIES_MAX;
>
> Would it make sense to run these tests with various vector sizes?
> Because it seems your algorithm depends on `vector_length_in_bytes` in the prefix sum algo.
Since we already have correctness tests for `expand` on **all vector types** under `test/jdk/jdk/incubator/vector/`, such as https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5375, this test primarily verifies that the expected IR is generated. So, I think this is sufficient?
I've tested this PR locally on a 128-bit SVE2 machine, a 256-bit SVE machine, and a 512-bit QEMU environment, and all tests passed.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2321198368
PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2321194040
More information about the hotspot-compiler-dev
mailing list