RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3]

Wed Jul 30 15:01:02 UTC 2025

On Fri, 25 Jul 2025 03:26:36 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.
>> 
>> ### Background
>> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register.
>> 
>> ### Implementation
>> 
>> #### Challenges
>> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.
>> 
>> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches:
>> - SPECIES_64: Single operation with mask (8 elements, 256-bit)
>> - SPECIES_128: Single operation, full register (16 elements, 512-bit)
>> - SPECIES_256: Two operations + merge (32 elements, 1024-bit)
>> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)
>> 
>> Use `ByteVector.SPECIES_512` as an example:
>> - It contains 64 elements. So the index vector size should be `64 * 32`  bits, which is 4 times of the SVE vector register size.
>> - It requires 4 times of vector gather-loads to finish the whole operation.
>> 
>> 
>> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
>> int[] idx = [0, 1, 2, 3, ..., 63, ...]
>> 
>> 4 gather-load:
>> idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
>> idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
>> idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
>> idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
>> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]
>> 
>> 
>> #### Solution
>> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.
>> 
>> Here is the main changes:
>> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher.
>> - Added `VectorSliceNode` for result mer...
>
> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Refine IR pattern and clean backend rules

Thanks for updating it!

I've submitted a test on a 256-bit sve machine. I'll get back to you once it’s finished.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1176:

> 1174: }
> 1175: 
> 1176: // Generate a vector mask by casting the input mask from "byte|short" type to "int" type for vector

It seems that we're not doing "casting" here.
Suggestion:

// Widen the input mask "in" from "byte|short" to "int" for use in vector gather loads.
// The "part" parameter selects which segment of the original mask to extend.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1186:

> 1184:     assert(part < 4, "must be");
> 1185:     const TypeVect* temp_vt = TypeVect::makemask(T_SHORT, vt->length() * 2);
> 1186:     // If part == 0, the elements of the lowest 1/4 part are extended.

Suggestion:

    // If part == 0, extend elements from the lowest 1/4 of the input.
    // If part == 1, extend elements from the second 1/4.
    // If part == 2, extend elements from the third 1/4.
    // If part == 3, extend elements from the highest 1/4.

-------------

PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3071808794
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2242854832
PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2242873220