RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6]

Tue Oct 14 18:17:47 UTC 2025

On Wed, 17 Sep 2025 08:48:16 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.
>> 
>> ### Background
>> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register.
>> 
>> ### Implementation
>> 
>> #### Challenges
>> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.
>> 
>> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches:
>> - SPECIES_64: Single operation with mask (8 elements, 256-bit)
>> - SPECIES_128: Single operation, full register (16 elements, 512-bit)
>> - SPECIES_256: Two operations + merge (32 elements, 1024-bit)
>> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)
>> 
>> Use `ByteVector.SPECIES_512` as an example:
>> - It contains 64 elements. So the index vector size should be `64 * 32`  bits, which is 4 times of the SVE vector register size.
>> - It requires 4 times of vector gather-loads to finish the whole operation.
>> 
>> 
>> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
>> int[] idx = [0, 1, 2, 3, ..., 63, ...]
>> 
>> 4 gather-load:
>> idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
>> idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
>> idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
>> idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
>> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]
>> 
>> 
>> #### Solution
>> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.
>> 
>> Here is the main changes:
>> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher.
>> - Added `VectorSliceNode` for result mer...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
> 
>  - Add more comments for IRs and added method
>  - Merge branch 'jdk:master' into JDK-8351623-sve
>  - Merge 'jdk:master' into JDK-8351623-sve
>  - Address review comments
>  - Refine IR pattern and clean backend rules
>  - Fix indentation issue and move the helper matcher method to header files
>  - Merge branch jdk:master into JDK-8351623-sve
>  - 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Hi @iwanowww , @PaulSandoz , @eme64 ,

Hope you’re doing well!

I’ve created a prototype that moves the implementation to the Java API level, as suggested (see: https://github.com/XiaohongGong/jdk/pull/8). This refactoring has resulted in significantly cleaner and more maintainable code. Thanks for your insightful feedback @iwanowww !

However, it also introduces new issues that we have to consider. The codegen might **not be optimal**. If we want to generate the optimal instruction sequence, we need more effort. 

Following is the details:

1) We need a new API to cross-lane shift the lanes for a vector mask, which is used to extract different piece of a vector mask if the whole gather operation needs to be split.  Consider it has a `Vector.slice()` API which can implement such a function, I added a similar one for `VectorMask`. 

    There are two new issues that I need to address for this API:
     -  SVE lacks a native instruction for such a mask operation. I have to convert it to a vector, call the Vector.slice(), and then convert back to a mask. Please note that the whole progress is **not SVE friendly**. The performance of such an API will have large gap on SVE compared with other arches.
     - To generate a SVE optimal instruction, I have to do further IR transformation and optimize the pattern with match rule. I'm not sure whether the optimization will be common enough to be accepted in future.

    Do you have a better idea on the new added API? I'd like to avoid adding such a performance not friendly API, and the API might not be frequently used in real world.

2) To make the interface uniform across-platforms, each API is defined as the same vector type of the target result, although we need to do separation and merging. However, as the SVE gather-load instruction works with int vector type, we need special handling in compiler IR-level.

   I'd like to extend `LoadVectorGather{,Masked}` with `mem_bt` to handle subword loads, adjust mask with cast/resize before and append vector cast/reinterpret after. Splitting into simple IRs make it possible for further IR-level optimization.  This might make the compiler IRs different across platforms like what it is in current PR.  Hence, the compiler change might not be so clean.  Does this make sense to you?

3) Further compiler optimization is necessary to optimize out in-efficient instructions. This needs the combination of IR transformation and match rules. I think this might be more complex, and the result is not guaranteed now. I need further implementation.

As a summary, the implementation itself of this API is clean. But it introduces more overhead especially for SVE. It's not so easy for me to make a conclusion whether the Java change wins or not. Any suggestion on this?

Thanks,
Xiaohong

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3400397763