RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation
Xiaohong Gong
xgong at openjdk.org
Tue Apr 22 01:44:40 UTC 2025
On Sun, 20 Apr 2025 03:28:48 GMT, SendaoYan <syan at openjdk.org> wrote:
>> ### Summary:
>> [JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This patch aims at implementing the equivalent functionality for AArch64 SVE platform. In addition to the AArch64 backend support, this patch also refactors the API implementation in Java side and the compiler mid-end part to make the operations more efficient and maintainable across different architectures.
>>
>> ### Background:
>> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices stored in an int array. SVE provides native vector gather load instructions for byte/short types using an int vector saving indices (see [2][3]).
>>
>> The number of loaded elements must match the index vector's element count. Since int elements are 4/2 times larger than byte/short elements, and given `MaxVectorSize` constraints, the operation may need to be splitted into multiple parts.
>>
>> Using a 128-bit byte vector gather load as an example, there are four scenarios with different `MaxVectorSize`:
>>
>> 1. `MaxVectorSize = 16, byte_vector_size = 16`:
>> - Can load 4 indices per vector register
>> - So can finish 4 bytes per gather-load operation
>> - Requires 4 times of gather-loads and final merge
>> Example:
>> ```
>> byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
>> int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>>
>> 4 gather-load:
>> idx_v1 = [1 4 2 3] gather_v1 = [0000 0000 0000 becd]
>> idx_v2 = [2 5 7 5] gather_v2 = [0000 0000 0000 cfhf]
>> idx_v3 = [1 7 6 0] gather_v3 = [0000 0000 0000 bhga]
>> idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
>> merge: v = [jlkp bhga cfhf becd]
>> ```
>>
>> 2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
>> - Can load 8 indices per vector register
>> - So can finish 8 bytes per gather-load operation
>> - Requires 2 times of gather-loads and merge
>> Example:
>> ```
>> byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
>> int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>>
>> 2 gather-load:
>> idx_v1 = [2 5 7 5 1 4 2 3]
>> idx_v2 = [9 11 10 15 1 7 6 0]
>> gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
>> gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
>> merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
>> ```
>>
>> 3. `MaxVectorSize = 64, byte_v...
>
> test/hotspot/jtreg/compiler/vectorapi/VectorGatherSubwordTest.java line 39:
>
>> 37: * @modules jdk.incubator.vector
>> 38: *
>> 39: * @run driver compiler.vectorapi.VectorGatherSubwordTest
>
> Should we use `@run main` instead of `@run driver`
Thanks for taking a look at this PR! I think it's fine using `@run main` instead.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/24679#discussion_r2053187161
More information about the hotspot-compiler-dev
mailing list