RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6]
Emanuel Peter
epeter at openjdk.org
Thu Sep 18 12:30:52 UTC 2025
On Wed, 17 Sep 2025 08:48:16 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.
>>
>> ### Background
>> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register.
>>
>> ### Implementation
>>
>> #### Challenges
>> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.
>>
>> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches:
>> - SPECIES_64: Single operation with mask (8 elements, 256-bit)
>> - SPECIES_128: Single operation, full register (16 elements, 512-bit)
>> - SPECIES_256: Two operations + merge (32 elements, 1024-bit)
>> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)
>>
>> Use `ByteVector.SPECIES_512` as an example:
>> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size.
>> - It requires 4 times of vector gather-loads to finish the whole operation.
>>
>>
>> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
>> int[] idx = [0, 1, 2, 3, ..., 63, ...]
>>
>> 4 gather-load:
>> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
>> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
>> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
>> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
>> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]
>>
>>
>> #### Solution
>> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.
>>
>> Here is the main changes:
>> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher.
>> - Added `VectorSliceNode` for result mer...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>
> - Add more comments for IRs and added method
> - Merge branch 'jdk:master' into JDK-8351623-sve
> - Merge 'jdk:master' into JDK-8351623-sve
> - Address review comments
> - Refine IR pattern and clean backend rules
> - Fix indentation issue and move the helper matcher method to header files
> - Merge branch jdk:master into JDK-8351623-sve
> - 8351623: VectorAPI: Add SVE implementation of subword gather load operation
@XiaohongGong I'm going to be away on vacation for about 3 weeks now. So I won't be able to continue with the review until I'm back.
Maybe @vnkozlov or @iwanowww can review instead. Maybe @PaulSandoz or @jatin-bhateja would like to look at it too. If they do, I would want them to consider if the approach with the special vector nodes `VectorConcatenateAndNarrow` and `VectorMaskWiden` are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend.
In this PR, there are already a thread [here](https://github.com/openjdk/jdk/pull/26236#discussion_r2324740007) and [here](https://github.com/openjdk/jdk/pull/26236#discussion_r2324744990).
-------------
PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3239353455
More information about the hotspot-compiler-dev
mailing list