RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation
Xiaohong Gong
xgong at openjdk.org
Thu Apr 17 01:44:45 UTC 2025
On Wed, 16 Apr 2025 08:58:34 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
> ### Summary:
> [JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This patch aims at implementing the equivalent functionality for AArch64 SVE platform. In addition to the AArch64 backend support, this patch also refactors the API implementation in Java side and the compiler mid-end part to make the operations more efficient and maintainable across different architectures.
>
> ### Background:
> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices stored in an int array. SVE provides native vector gather load instructions for byte/short types using an int vector saving indices (see [2][3]).
>
> The number of loaded elements must match the index vector's element count. Since int elements are 4/2 times larger than byte/short elements, and given `MaxVectorSize` constraints, the operation may need to be splitted into multiple parts.
>
> Using a 128-bit byte vector gather load as an example, there are four scenarios with different `MaxVectorSize`:
>
> 1. `MaxVectorSize = 16, byte_vector_size = 16`:
> - Can load 4 indices per vector register
> - So can finish 4 bytes per gather-load operation
> - Requires 4 times of gather-loads and final merge
> Example:
> ```
> byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
> int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>
> 4 gather-load:
> idx_v1 = [1 4 2 3] gather_v1 = [0000 0000 0000 becd]
> idx_v2 = [2 5 7 5] gather_v2 = [0000 0000 0000 cfhf]
> idx_v3 = [1 7 6 0] gather_v3 = [0000 0000 0000 bhga]
> idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
> merge: v = [jlkp bhga cfhf becd]
> ```
>
> 2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
> - Can load 8 indices per vector register
> - So can finish 8 bytes per gather-load operation
> - Requires 2 times of gather-loads and merge
> Example:
> ```
> byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
> int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>
> 2 gather-load:
> idx_v1 = [2 5 7 5 1 4 2 3]
> idx_v2 = [9 11 10 15 1 7 6 0]
> gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
> gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
> merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
> ```
>
> 3. `MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4`:
> - Can load 16 indices per vector register
> - So can ...
Hi @jatin-bhateja , could you please help take a look at this PR especially the X86 part? Thanks a lot!
Hi @RealFYang , could you please help review the RVV part? Thanks a lot!
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24679#issuecomment-2811506961
More information about the hotspot-dev
mailing list