RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Xiaohong Gong xgong at openjdk.org
Thu Jul 10 07:10:23 UTC 2025


This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.

### Background
Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register.

### Implementation

#### Challenges
Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.

For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches:
- SPECIES_64: Single operation with mask (8 elements, 256-bit)
- SPECIES_128: Single operation, full register (16 elements, 512-bit)
- SPECIES_256: Two operations + merge (32 elements, 1024-bit)
- SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)

Use `ByteVector.SPECIES_512` as an example:
- It contains 64 elements. So the index vector size should be `64 * 32`  bits, which is 4 times of the SVE vector register size.
- It requires 4 times of vector gather-loads to finish the whole operation.


byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
int[] idx = [0, 1, 2, 3, ..., 63, ...]

4 gather-load:
idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]


#### Solution
The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.

Here is the main changes:
- Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher.
- Added `VectorSliceNode` for result merging.
- Added `VectorMaskWidenNode` for mask spliting and type conversion for masked gather-load.
- Implemented SVE match rules for subword gather operations.
- Added comprehensive IR tests for verification.


### Testing:
- Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
- No regressions found

### Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:


Benchmark                                                 SIZE Mode   Cnt Unit   Before      After   Gain
GatherOperationsBenchmark.microByteGather128              64   thrpt  30  ops/ms 13500.891 46721.307 3.46
GatherOperationsBenchmark.microByteGather128              256  thrpt  30  ops/ms  3378.186 12321.847 3.64
GatherOperationsBenchmark.microByteGather128              1024 thrpt  30  ops/ms   844.871  3144.217 3.72
GatherOperationsBenchmark.microByteGather128              4096 thrpt  30  ops/ms   211.386   783.337 3.70
GatherOperationsBenchmark.microByteGather128_MASK         64   thrpt  30  ops/ms 10605.664 46124.957 4.34
GatherOperationsBenchmark.microByteGather128_MASK         256  thrpt  30  ops/ms  2668.531 12292.350 4.60
GatherOperationsBenchmark.microByteGather128_MASK         1024 thrpt  30  ops/ms   676.218  3074.224 4.54
GatherOperationsBenchmark.microByteGather128_MASK         4096 thrpt  30  ops/ms   169.402   817.227 4.82
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  64   thrpt  30  ops/ms 10615.723 46122.380 4.34
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  256  thrpt  30  ops/ms  2671.931 12222.473 4.57
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  1024 thrpt  30  ops/ms   678.437  3091.970 4.55
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  4096 thrpt  30  ops/ms   170.310   813.967 4.77
GatherOperationsBenchmark.microByteGather128_NZ_OFF       64   thrpt  30  ops/ms 13524.671 47223.082 3.49
GatherOperationsBenchmark.microByteGather128_NZ_OFF       256  thrpt  30  ops/ms  3411.813 12343.308 3.61
GatherOperationsBenchmark.microByteGather128_NZ_OFF       1024 thrpt  30  ops/ms   847.919  3129.065 3.69
GatherOperationsBenchmark.microByteGather128_NZ_OFF       4096 thrpt  30  ops/ms   212.790   787.953 3.70
GatherOperationsBenchmark.microByteGather64               64   thrpt  30  ops/ms  8717.294 48176.937 5.52
GatherOperationsBenchmark.microByteGather64               256  thrpt  30  ops/ms  2184.345 12347.113 5.65
GatherOperationsBenchmark.microByteGather64               1024 thrpt  30  ops/ms   546.093  3070.851 5.62
GatherOperationsBenchmark.microByteGather64               4096 thrpt  30  ops/ms   136.724   767.656 5.61
GatherOperationsBenchmark.microByteGather64_MASK          64   thrpt  30  ops/ms  6576.504 48588.806 7.38
GatherOperationsBenchmark.microByteGather64_MASK          256  thrpt  30  ops/ms  1653.073 12341.291 7.46
GatherOperationsBenchmark.microByteGather64_MASK          1024 thrpt  30  ops/ms   416.590  3070.680 7.37
GatherOperationsBenchmark.microByteGather64_MASK          4096 thrpt  30  ops/ms   105.743   767.790 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   64   thrpt  30  ops/ms  6628.974 48628.463 7.33
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   256  thrpt  30  ops/ms  1676.767 12338.116 7.35
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   1024 thrpt  30  ops/ms   422.612  3070.987 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   4096 thrpt  30  ops/ms   105.033   767.563 7.30
GatherOperationsBenchmark.microByteGather64_NZ_OFF        64   thrpt  30  ops/ms  8754.635 48525.395 5.54
GatherOperationsBenchmark.microByteGather64_NZ_OFF        256  thrpt  30  ops/ms  2182.044 12338.096 5.65
GatherOperationsBenchmark.microByteGather64_NZ_OFF        1024 thrpt  30  ops/ms   547.353  3071.666 5.61
GatherOperationsBenchmark.microByteGather64_NZ_OFF        4096 thrpt  30  ops/ms   137.853   767.745 5.56
GatherOperationsBenchmark.microShortGather128             64   thrpt  30  ops/ms  8713.480 37696.121 4.32
GatherOperationsBenchmark.microShortGather128             256  thrpt  30  ops/ms  2189.636  9479.710 4.32
GatherOperationsBenchmark.microShortGather128             1024 thrpt  30  ops/ms   545.435  2378.492 4.36
GatherOperationsBenchmark.microShortGather128             4096 thrpt  30  ops/ms   136.213   595.504 4.37
GatherOperationsBenchmark.microShortGather128_MASK        64   thrpt  30  ops/ms  6665.844 37765.315 5.66
GatherOperationsBenchmark.microShortGather128_MASK        256  thrpt  30  ops/ms  1673.950  9482.207 5.66
GatherOperationsBenchmark.microShortGather128_MASK        1024 thrpt  30  ops/ms   420.628  2378.813 5.65
GatherOperationsBenchmark.microShortGather128_MASK        4096 thrpt  30  ops/ms   105.128   595.412 5.66
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64   thrpt  30  ops/ms  6699.594 37698.398 5.62
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256  thrpt  30  ops/ms  1682.128  9480.355 5.63
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt  30  ops/ms   421.942  2380.449 5.64
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt  30  ops/ms   106.587   595.560 5.58
GatherOperationsBenchmark.microShortGather128_NZ_OFF      64   thrpt  30  ops/ms  8788.830 37709.493 4.29
GatherOperationsBenchmark.microShortGather128_NZ_OFF      256  thrpt  30  ops/ms  2199.706  9485.769 4.31
GatherOperationsBenchmark.microShortGather128_NZ_OFF      1024 thrpt  30  ops/ms   548.309  2380.494 4.34
GatherOperationsBenchmark.microShortGather128_NZ_OFF      4096 thrpt  30  ops/ms   137.434   595.448 4.33
GatherOperationsBenchmark.microShortGather64              64   thrpt  30  ops/ms  5296.860 37797.813 7.13
GatherOperationsBenchmark.microShortGather64              256  thrpt  30  ops/ms  1321.738  9602.510 7.26
GatherOperationsBenchmark.microShortGather64              1024 thrpt  30  ops/ms   330.520  2404.013 7.27
GatherOperationsBenchmark.microShortGather64              4096 thrpt  30  ops/ms    82.149   602.956 7.33
GatherOperationsBenchmark.microShortGather64_MASK         64   thrpt  30  ops/ms  3458.968 37851.452 10.94
GatherOperationsBenchmark.microShortGather64_MASK         256  thrpt  30  ops/ms   879.143  9616.554 10.93
GatherOperationsBenchmark.microShortGather64_MASK         1024 thrpt  30  ops/ms   220.256  2408.851 10.93
GatherOperationsBenchmark.microShortGather64_MASK         4096 thrpt  30  ops/ms    54.947   603.251 10.97
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  64   thrpt  30  ops/ms  3521.856 37736.119 10.71
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  256  thrpt  30  ops/ms   881.456  9602.649 10.89
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  1024 thrpt  30  ops/ms   220.122  2409.030 10.94
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  4096 thrpt  30  ops/ms    55.845   603.126 10.79
GatherOperationsBenchmark.microShortGather64_NZ_OFF       64   thrpt  30  ops/ms  5279.815 37698.023 7.14
GatherOperationsBenchmark.microShortGather64_NZ_OFF       256  thrpt  30  ops/ms  1307.935  9601.551 7.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF       1024 thrpt  30  ops/ms   329.707  2409.962 7.30
GatherOperationsBenchmark.microShortGather64_NZ_OFF       4096 thrpt  30  ops/ms    82.092   603.380 7.35


[1] https://bugs.openjdk.org/browse/JDK-8355563
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en

-------------

Commit messages:
 - 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Changes: https://git.openjdk.org/jdk/pull/26236/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351623
  Stats: 972 lines in 22 files changed: 841 ins; 12 del; 119 mod
  Patch: https://git.openjdk.org/jdk/pull/26236.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236

PR: https://git.openjdk.org/jdk/pull/26236


More information about the hotspot-compiler-dev mailing list