RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5]

Fri Sep 5 10:52:16 UTC 2025

On Mon, 4 Aug 2025 02:31:08 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.
>> 
>> ### Background
>> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register.
>> 
>> ### Implementation
>> 
>> #### Challenges
>> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.
>> 
>> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches:
>> - SPECIES_64: Single operation with mask (8 elements, 256-bit)
>> - SPECIES_128: Single operation, full register (16 elements, 512-bit)
>> - SPECIES_256: Two operations + merge (32 elements, 1024-bit)
>> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)
>> 
>> Use `ByteVector.SPECIES_512` as an example:
>> - It contains 64 elements. So the index vector size should be `64 * 32`  bits, which is 4 times of the SVE vector register size.
>> - It requires 4 times of vector gather-loads to finish the whole operation.
>> 
>> 
>> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
>> int[] idx = [0, 1, 2, 3, ..., 63, ...]
>> 
>> 4 gather-load:
>> idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
>> idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
>> idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
>> idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
>> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]
>> 
>> 
>> #### Solution
>> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.
>> 
>> Here is the main changes:
>> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher.
>> - Added `VectorSliceNode` for result mer...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
> 
>  - Merge 'jdk:master' into JDK-8351623-sve
>  - Address review comments
>  - Refine IR pattern and clean backend rules
>  - Fix indentation issue and move the helper matcher method to header files
>  - Merge branch jdk:master into JDK-8351623-sve
>  - 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Looks very interesting. I have a first series of questions / comments :)

There is definitively a tradeoff between complexity in the backend and in the C2 IR. So I'm yet trying to wrap my head around that decision. I'm just afraid that adding more very specific C2 IR nodes makes things more complicated to do optimizations in the C2 IR.

src/hotspot/cpu/aarch64/aarch64_vector.ad line 6008:

> 6006: // predicate and place in elements of twice their size within
> 6007: // the destination predicate.
> 6008: 

Suggestion:

unnecessary empty line

src/hotspot/share/opto/vectornode.hpp line 1123:

> 1121:   // The basic type of memory, which might be different with the vector element type
> 1122:   // when it is a subword type loading.
> 1123:   BasicType _mem_bt;

Can you make an example and add it to the comment?
Can you please also add some comment at the node about what we expect the index map to be? What basic type does it have?

src/hotspot/share/opto/vectornode.hpp line 1769:

> 1767: //      dst = [h g f e d c b a]
> 1768: //
> 1769: class VectorConcatenateNode : public VectorNode {

That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else?
`VectorConcatenateAndNarrowNode`?

src/hotspot/share/opto/vectornode.hpp line 1774:

> 1772:     : VectorNode(vec1, vec2, vt) {
> 1773:     assert(type2aelembytes(vec1->bottom_type()->is_vect()->element_basic_type()) ==
> 1774:            type2aelembytes(vt->element_basic_type()) * 2, "must be half size");

What about asserting that `vec1` and `vec2` have the same `vect`?

src/hotspot/share/opto/vectornode.hpp line 1841:

> 1839: 
> 1840: // Unpack the elements to twice size.
> 1841: class VectorMaskWidenNode : public VectorNode {

Can you add a visual example like above for `VectorConcatenateNode`, please?

-------------

PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3188813972
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324710079
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324736345
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324740007
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324741462
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324744990