RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation

Emanuel Peter epeter at openjdk.org
Wed Apr 23 13:04:48 UTC 2025


On Thu, 17 Apr 2025 01:42:22 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> ### Summary:
>> [JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This patch aims at implementing the equivalent functionality for AArch64 SVE platform. In addition to the AArch64 backend support, this patch also refactors the API implementation in Java side and the compiler mid-end part to make the operations more efficient and maintainable across different architectures.
>> 
>> ### Background:
>> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices stored in an int array. SVE provides native vector gather load instructions for byte/short types using an int vector saving indices (see [2][3]).
>> 
>> The number of loaded elements must match the index vector's element count. Since int elements are 4/2 times larger than byte/short elements, and given `MaxVectorSize` constraints, the operation may need to be splitted into multiple parts.
>> 
>> Using a 128-bit byte vector gather load as an example, there are four scenarios with different `MaxVectorSize`:
>> 
>> 1. `MaxVectorSize = 16, byte_vector_size = 16`:
>>    - Can load 4 indices per vector register
>>    - So can finish 4 bytes per gather-load operation
>>    - Requires 4 times of gather-loads and final merge
>>    Example:
>>    ```
>>    byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
>>    int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>> 
>>    4 gather-load:
>>    idx_v1 = [1 4 2 3]    gather_v1 = [0000 0000 0000 becd]
>>    idx_v2 = [2 5 7 5]    gather_v2 = [0000 0000 0000 cfhf]
>>    idx_v3 = [1 7 6 0]    gather_v3 = [0000 0000 0000 bhga]
>>    idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
>>    merge: v = [jlkp bhga cfhf becd]
>>    ```
>> 
>> 2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
>>    - Can load 8 indices per vector register
>>    - So can finish 8 bytes per gather-load operation
>>    - Requires 2 times of gather-loads and merge
>>    Example:
>>    ```
>>    byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
>>    int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>> 
>>    2 gather-load:
>>    idx_v1 = [2 5 7 5 1 4 2 3]
>>    idx_v2 = [9 11 10 15 1 7 6 0]
>>    gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
>>    gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
>>    merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
>>    ```
>> 
>> 3. `MaxVectorSize = 64, byte_v...
>
> Hi @jatin-bhateja , could you please help take a look at this PR especially the X86 part? Thanks a lot!
> Hi @RealFYang , could you please help review the RVV part? Thanks a lot!

@XiaohongGong I had a quick look at your changes and PR description. I wonder if you could split some of the refactoring into a separate PR? That would make it easier to review. Currently, you basically have x64 changes, aarch64 changes, Java library changes, and C2 changes. That's a lot at once. And it would basically require the review from a lot of different people at once.

Splitting would make it easier to review, less work for the reviewer. It would ensure everybody can look at a smaller change set, and that would also increase the quality of the code after review, I think.

What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24679#issuecomment-2824229233


More information about the hotspot-compiler-dev mailing list