RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation
Xiaohong Gong
xgong at openjdk.org
Thu Jul 10 07:10:23 UTC 2025
This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.
### Background
Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register.
### Implementation
#### Challenges
Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.
For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches:
- SPECIES_64: Single operation with mask (8 elements, 256-bit)
- SPECIES_128: Single operation, full register (16 elements, 512-bit)
- SPECIES_256: Two operations + merge (32 elements, 1024-bit)
- SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)
Use `ByteVector.SPECIES_512` as an example:
- It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size.
- It requires 4 times of vector gather-loads to finish the whole operation.
byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
int[] idx = [0, 1, 2, 3, ..., 63, ...]
4 gather-load:
idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]
#### Solution
The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.
Here is the main changes:
- Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher.
- Added `VectorSliceNode` for result merging.
- Added `VectorMaskWidenNode` for mask spliting and type conversion for masked gather-load.
- Implemented SVE match rules for subword gather operations.
- Added comprehensive IR tests for verification.
### Testing:
- Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
- No regressions found
### Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:
Benchmark SIZE Mode Cnt Unit Before After Gain
GatherOperationsBenchmark.microByteGather128 64 thrpt 30 ops/ms 13500.891 46721.307 3.46
GatherOperationsBenchmark.microByteGather128 256 thrpt 30 ops/ms 3378.186 12321.847 3.64
GatherOperationsBenchmark.microByteGather128 1024 thrpt 30 ops/ms 844.871 3144.217 3.72
GatherOperationsBenchmark.microByteGather128 4096 thrpt 30 ops/ms 211.386 783.337 3.70
GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30 ops/ms 10605.664 46124.957 4.34
GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30 ops/ms 2668.531 12292.350 4.60
GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30 ops/ms 676.218 3074.224 4.54
GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30 ops/ms 169.402 817.227 4.82
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 10615.723 46122.380 4.34
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 2671.931 12222.473 4.57
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 678.437 3091.970 4.55
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 170.310 813.967 4.77
GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30 ops/ms 13524.671 47223.082 3.49
GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30 ops/ms 3411.813 12343.308 3.61
GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30 ops/ms 847.919 3129.065 3.69
GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30 ops/ms 212.790 787.953 3.70
GatherOperationsBenchmark.microByteGather64 64 thrpt 30 ops/ms 8717.294 48176.937 5.52
GatherOperationsBenchmark.microByteGather64 256 thrpt 30 ops/ms 2184.345 12347.113 5.65
GatherOperationsBenchmark.microByteGather64 1024 thrpt 30 ops/ms 546.093 3070.851 5.62
GatherOperationsBenchmark.microByteGather64 4096 thrpt 30 ops/ms 136.724 767.656 5.61
GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30 ops/ms 6576.504 48588.806 7.38
GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30 ops/ms 1653.073 12341.291 7.46
GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30 ops/ms 416.590 3070.680 7.37
GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30 ops/ms 105.743 767.790 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 6628.974 48628.463 7.33
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 1676.767 12338.116 7.35
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 422.612 3070.987 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 105.033 767.563 7.30
GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30 ops/ms 8754.635 48525.395 5.54
GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30 ops/ms 2182.044 12338.096 5.65
GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30 ops/ms 547.353 3071.666 5.61
GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30 ops/ms 137.853 767.745 5.56
GatherOperationsBenchmark.microShortGather128 64 thrpt 30 ops/ms 8713.480 37696.121 4.32
GatherOperationsBenchmark.microShortGather128 256 thrpt 30 ops/ms 2189.636 9479.710 4.32
GatherOperationsBenchmark.microShortGather128 1024 thrpt 30 ops/ms 545.435 2378.492 4.36
GatherOperationsBenchmark.microShortGather128 4096 thrpt 30 ops/ms 136.213 595.504 4.37
GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30 ops/ms 6665.844 37765.315 5.66
GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30 ops/ms 1673.950 9482.207 5.66
GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30 ops/ms 420.628 2378.813 5.65
GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30 ops/ms 105.128 595.412 5.66
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 6699.594 37698.398 5.62
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 1682.128 9480.355 5.63
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 421.942 2380.449 5.64
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 106.587 595.560 5.58
GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30 ops/ms 8788.830 37709.493 4.29
GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30 ops/ms 2199.706 9485.769 4.31
GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30 ops/ms 548.309 2380.494 4.34
GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30 ops/ms 137.434 595.448 4.33
GatherOperationsBenchmark.microShortGather64 64 thrpt 30 ops/ms 5296.860 37797.813 7.13
GatherOperationsBenchmark.microShortGather64 256 thrpt 30 ops/ms 1321.738 9602.510 7.26
GatherOperationsBenchmark.microShortGather64 1024 thrpt 30 ops/ms 330.520 2404.013 7.27
GatherOperationsBenchmark.microShortGather64 4096 thrpt 30 ops/ms 82.149 602.956 7.33
GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30 ops/ms 3458.968 37851.452 10.94
GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30 ops/ms 879.143 9616.554 10.93
GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30 ops/ms 220.256 2408.851 10.93
GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30 ops/ms 54.947 603.251 10.97
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 3521.856 37736.119 10.71
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 881.456 9602.649 10.89
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 220.122 2409.030 10.94
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 55.845 603.126 10.79
GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30 ops/ms 5279.815 37698.023 7.14
GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30 ops/ms 1307.935 9601.551 7.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30 ops/ms 329.707 2409.962 7.30
GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30 ops/ms 82.092 603.380 7.35
[1] https://bugs.openjdk.org/browse/JDK-8355563
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en
-------------
Commit messages:
- 8351623: VectorAPI: Add SVE implementation of subword gather load operation
Changes: https://git.openjdk.org/jdk/pull/26236/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8351623
Stats: 972 lines in 22 files changed: 841 ins; 12 del; 119 mod
Patch: https://git.openjdk.org/jdk/pull/26236.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236
PR: https://git.openjdk.org/jdk/pull/26236
More information about the hotspot-compiler-dev
mailing list