RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation
Xiaohong Gong
xgong at openjdk.org
Wed Apr 16 09:03:43 UTC 2025
### Summary:
[JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This patch aims at implementing the equivalent functionality for AArch64 SVE platform. In addition to the AArch64 backend support, this patch also refactors the API implementation in Java side and the compiler mid-end part to make the operations more efficient and maintainable across different architectures.
### Background:
Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices stored in an int array. SVE provides native vector gather load instructions for byte/short types using an int vector saving indices (see [2][3]).
The number of loaded elements must match the index vector's element count. Since int elements are 4/2 times larger than byte/short elements, and given `MaxVectorSize` constraints, the operation may need to be splitted into multiple parts.
Using a 128-bit byte vector gather load as an example, there are four scenarios with different `MaxVectorSize`:
1. `MaxVectorSize = 16, byte_vector_size = 16`:
- Can load 4 indices per vector register
- So can finish 4 bytes per gather-load operation
- Requires 4 times of gather-loads and final merge
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
4 gather-load:
idx_v1 = [1 4 2 3] gather_v1 = [0000 0000 0000 becd]
idx_v2 = [2 5 7 5] gather_v2 = [0000 0000 0000 cfhf]
idx_v3 = [1 7 6 0] gather_v3 = [0000 0000 0000 bhga]
idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
merge: v = [jlkp bhga cfhf becd]
```
2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
- Can load 8 indices per vector register
- So can finish 8 bytes per gather-load operation
- Requires 2 times of gather-loads and merge
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
2 gather-load:
idx_v1 = [2 5 7 5 1 4 2 3]
idx_v2 = [9 11 10 15 1 7 6 0]
gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
```
3. `MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4`:
- Can load 16 indices per vector register
- So can finish 16 bytes per gather-load operation
- No splitting required
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
1 gather-load:
idx_v = [9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
v = [... 0000 0000 0000 0000 jlkp bhga cfhf becd]
```
4. `MaxVectorSize > 64, byte_vector_size < MaxVectorSize / 4`:
- Can load 32+ indices per vector register
- So can finish 16 bytes per gather-load operation
- Requires masking to allow loading 16 active elements to keep safe
memory access.
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
1 gather-load:
idx_v = [... 0 0 0 0 0 0 0 0 9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
v = [... 0000 0000 0000 0000 0000 jlkp bhga cfhf becd]
```
### Main changes:
1. Java-side API refactoring:
- Potential multiple index vectors have been generated for index checking in java-side. This patch passes all the generated index vectors to hotspot to eliminate the duplicate index vectors used for the vector gather load operations on architectures like AArch64. Existing IGVN cannot work due to the different control flow of the index vectors generated in java-side and compiler intrinsifying.
2. C2 compiler IR refactoring:
- Generate different IR patterns for different architectures like AArch64 and X86, based on the different index requirements.
- Added two new IRs in C2 compiler to help implement each part of vector gather operation and merge the results at last.
- Refactored the `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types. This patch removes the memory offset input and add it to the memory base `addr` in IR level for architectures that need the index array like X86. This not only simplifies the backend implementation, but also saves some add operations. Additionally, it unifies the IR for all types.
3. Backend changes:
- Added SVE match rules for subword gather load operations and the new added IRs.
- Refined the X86 implementation of subword gather since the offset input has been removed from the IR level.
4. Test:
- Added IR tests for verification.
### Testing:
- Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
- Passed vector api tests with all `UseAVX
` flags on X86 and `UseSVE` flags on AArch64
- No regressions found
### Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:
Benchmark (SIZE) Mode Cnt Units Before After Gain
GatherOperationsBenchmark.microByteGather128 64 thrpt 30 ops/ms 13447.414 43184.611 3.21
GatherOperationsBenchmark.microByteGather128 256 thrpt 30 ops/ms 3361.944 11165.006 3.32
GatherOperationsBenchmark.microByteGather128 1024 thrpt 30 ops/ms 843.501 2830.108 3.35
GatherOperationsBenchmark.microByteGather128 4096 thrpt 30 ops/ms 211.096 712.958 3.37
GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30 ops/ms 10627.297 42818.402 4.02
GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30 ops/ms 2675.144 11055.874 4.13
GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30 ops/ms 677.742 2783.920 4.10
GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30 ops/ms 169.416 686.783 4.05
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 10592.545 42282.802 3.99
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 2680.060 11039.563 4.11
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 678.941 2790.252 4.10
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 169.985 691.157 4.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30 ops/ms 13538.308 42954.988 3.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30 ops/ms 3414.237 11227.333 3.28
GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30 ops/ms 850.098 2821.821 3.31
GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30 ops/ms 213.295 705.015 3.30
GatherOperationsBenchmark.microByteGather64 64 thrpt 30 ops/ms 8705.935 44213.982 5.07
GatherOperationsBenchmark.microByteGather64 256 thrpt 30 ops/ms 2186.620 11407.364 5.21
GatherOperationsBenchmark.microByteGather64 1024 thrpt 30 ops/ms 545.364 2845.370 5.21
GatherOperationsBenchmark.microByteGather64 4096 thrpt 30 ops/ms 136.376 718.532 5.26
GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30 ops/ms 6530.636 42053.044 6.43
GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30 ops/ms 1644.069 11323.223 6.88
GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30 ops/ms 416.093 2844.712 6.83
GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30 ops/ms 105.777 716.685 6.77
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 6619.260 42204.919 6.37
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 1668.304 11318.298 6.78
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 422.085 2844.398 6.73
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 105.722 716.543 6.77
GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30 ops/ms 8754.073 44232.985 5.05
GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30 ops/ms 2195.009 11408.702 5.19
GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30 ops/ms 546.530 2845.369 5.20
GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30 ops/ms 137.713 718.391 5.21
GatherOperationsBenchmark.microShortGather128 64 thrpt 30 ops/ms 8695.558 33438.398 3.84
GatherOperationsBenchmark.microShortGather128 256 thrpt 30 ops/ms 2189.766 8533.643 3.89
GatherOperationsBenchmark.microShortGather128 1024 thrpt 30 ops/ms 546.322 2145.239 3.92
GatherOperationsBenchmark.microShortGather128 4096 thrpt 30 ops/ms 136.503 537.493 3.93
GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30 ops/ms 6656.883 33571.619 5.04
GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30 ops/ms 1649.233 8533.728 5.17
GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30 ops/ms 421.687 2135.280 5.06
GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30 ops/ms 105.355 537.418 5.10
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 6675.782 33441.402 5.00
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 1681.000 8532.770 5.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 424.024 2135.485 5.03
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 106.507 537.674 5.04
GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30 ops/ms 8796.279 33441.738 3.80
GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30 ops/ms 2198.774 8562.333 3.89
GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30 ops/ms 546.991 2133.496 3.90
GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30 ops/ms 137.191 537.390 3.91
GatherOperationsBenchmark.microShortGather64 64 thrpt 30 ops/ms 5286.569 38042.434 7.19
GatherOperationsBenchmark.microShortGather64 256 thrpt 30 ops/ms 1312.778 9755.474 7.43
GatherOperationsBenchmark.microShortGather64 1024 thrpt 30 ops/ms 327.475 2450.755 7.48
GatherOperationsBenchmark.microShortGather64 4096 thrpt 30 ops/ms 82.490 613.481 7.43
GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30 ops/ms 3525.102 37622.086 10.67
GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30 ops/ms 877.877 9740.673 11.09
GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30 ops/ms 219.688 2446.063 11.13
GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30 ops/ms 54.935 613.137 11.16
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 3509.264 35147.895 10.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 880.523 9733.536 11.05
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 220.578 2465.951 11.17
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 55.790 620.465 11.12
GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30 ops/ms 5271.218 35543.510 6.74
GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30 ops/ms 1318.470 9735.321 7.38
GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30 ops/ms 328.695 2466.311 7.50
GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30 ops/ms 81.959 621.065 7.57
And here is the performance data on a X86 avx512 system, which shows the performance can improve at most 39%.
Benchmark (SIZE) Mode Cnt Units Before After Gain
GatherOperationsBenchmark.microByteGather128 64 thrpt 30 ops/ms 44205.252 46829.437 1.05
GatherOperationsBenchmark.microByteGather128 256 thrpt 30 ops/ms 11243.202 12256.211 1.09
GatherOperationsBenchmark.microByteGather128 1024 thrpt 30 ops/ms 2824.094 3096.282 1.09
GatherOperationsBenchmark.microByteGather128 4096 thrpt 30 ops/ms 706.040 776.444 1.09
GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30 ops/ms 46911.410 46321.310 0.98
GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30 ops/ms 12850.712 12898.541 1.00
GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30 ops/ms 3099.038 3240.863 1.04
GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30 ops/ms 795.265 832.990 1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 43065.930 47164.936 1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 11537.805 13190.759 1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2763.036 3304.582 1.19
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 722.374 843.458 1.16
GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30 ops/ms 44145.297 46845.845 1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30 ops/ms 12172.421 12241.941 1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30 ops/ms 3097.042 3100.228 1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30 ops/ms 776.453 775.881 0.99
GatherOperationsBenchmark.microByteGather64 64 thrpt 30 ops/ms 58541.178 59464.156 1.01
GatherOperationsBenchmark.microByteGather64 256 thrpt 30 ops/ms 16063.284 17360.858 1.08
GatherOperationsBenchmark.microByteGather64 1024 thrpt 30 ops/ms 4126.798 4471.636 1.08
GatherOperationsBenchmark.microByteGather64 4096 thrpt 30 ops/ms 1045.116 1125.219 1.07
GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30 ops/ms 35344.320 49062.831 1.38
GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30 ops/ms 11946.622 13550.297 1.13
GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30 ops/ms 3275.053 3359.737 1.02
GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30 ops/ms 844.575 858.487 1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 43550.522 48875.831 1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 12216.995 13522.420 1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 3053.068 3391.067 1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 753.042 869.774 1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30 ops/ms 52082.307 58847.230 1.12
GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30 ops/ms 14210.930 17389.898 1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30 ops/ms 3697.996 4476.988 1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30 ops/ms 921.524 1125.308 1.22
GatherOperationsBenchmark.microShortGather128 64 thrpt 30 ops/ms 44325.212 44843.853 1.01
GatherOperationsBenchmark.microShortGather128 256 thrpt 30 ops/ms 11675.510 12630.103 1.08
GatherOperationsBenchmark.microShortGather128 1024 thrpt 30 ops/ms 1260.004 1373.395 1.09
GatherOperationsBenchmark.microShortGather128 4096 thrpt 30 ops/ms 761.857 814.790 1.06
GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30 ops/ms 36339.450 36951.803 1.01
GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30 ops/ms 9843.842 10018.754 1.01
GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30 ops/ms 2515.702 2595.312 1.03
GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30 ops/ms 616.450 661.402 1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 34078.747 33712.577 0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 9018.316 8515.947 0.94
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2250.813 2595.847 1.15
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 563.182 659.087 1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30 ops/ms 39909.543 44063.331 1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30 ops/ms 10690.582 12437.166 1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30 ops/ms 2677.219 3151.078 1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30 ops/ms 681.705 802.929 1.17
GatherOperationsBenchmark.microShortGather64 64 thrpt 30 ops/ms 45836.789 50883.505 1.11
GatherOperationsBenchmark.microShortGather64 256 thrpt 30 ops/ms 12269.355 13614.567 1.10
GatherOperationsBenchmark.microShortGather64 1024 thrpt 30 ops/ms 3010.548 3437.973 1.14
GatherOperationsBenchmark.microShortGather64 4096 thrpt 30 ops/ms 734.634 899.070 1.22
GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30 ops/ms 39753.487 39319.742 0.98
GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30 ops/ms 10615.540 10648.996 1.00
GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30 ops/ms 2653.485 2782.477 1.04
GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30 ops/ms 678.165 686.024 1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 37742.593 40491.965 1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 10096.251 11036.785 1.09
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2526.374 2812.550 1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 642.484 656.152 1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30 ops/ms 40602.930 50921.048 1.25
GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30 ops/ms 10972.083 14151.666 1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30 ops/ms 2726.248 3662.293 1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30 ops/ms 670.735 933.299 1.39
[1] https://bugs.openjdk.org/browse/JDK-8318650
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en
-------------
Commit messages:
- 8351623: VectorAPI: Refactor subword gather load and add SVE implementation
Changes: https://git.openjdk.org/jdk/pull/24679/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24679&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8351623
Stats: 1367 lines in 34 files changed: 915 ins; 180 del; 272 mod
Patch: https://git.openjdk.org/jdk/pull/24679.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/24679/head:pull/24679
PR: https://git.openjdk.org/jdk/pull/24679
More information about the hotspot-dev
mailing list