RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation

Xiaohong Gong xgong at openjdk.org
Wed Apr 16 09:03:43 UTC 2025


### Summary:
[JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This patch aims at implementing the equivalent functionality for AArch64 SVE platform. In addition to the AArch64 backend support, this patch also refactors the API implementation in Java side and the compiler mid-end part to make the operations more efficient and maintainable across different architectures.

### Background:
Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices stored in an int array. SVE provides native vector gather load instructions for byte/short types using an int vector saving indices (see [2][3]).

The number of loaded elements must match the index vector's element count. Since int elements are 4/2 times larger than byte/short elements, and given `MaxVectorSize` constraints, the operation may need to be splitted into multiple parts.

Using a 128-bit byte vector gather load as an example, there are four scenarios with different `MaxVectorSize`:

1. `MaxVectorSize = 16, byte_vector_size = 16`:
   - Can load 4 indices per vector register
   - So can finish 4 bytes per gather-load operation
   - Requires 4 times of gather-loads and final merge
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
   int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   4 gather-load:
   idx_v1 = [1 4 2 3]    gather_v1 = [0000 0000 0000 becd]
   idx_v2 = [2 5 7 5]    gather_v2 = [0000 0000 0000 cfhf]
   idx_v3 = [1 7 6 0]    gather_v3 = [0000 0000 0000 bhga]
   idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
   merge: v = [jlkp bhga cfhf becd]
   ```

2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
   - Can load 8 indices per vector register
   - So can finish 8 bytes per gather-load operation
   - Requires 2 times of gather-loads and merge
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
   int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   2 gather-load:
   idx_v1 = [2 5 7 5 1 4 2 3]
   idx_v2 = [9 11 10 15 1 7 6 0]
   gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
   gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
   merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
   ```

3. `MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4`:
   - Can load 16 indices per vector register
   - So can finish 16 bytes per gather-load operation
   - No splitting required
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
   int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   1 gather-load:
   idx_v = [9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
   v = [... 0000 0000 0000 0000 jlkp bhga cfhf becd]
   ```

4. `MaxVectorSize > 64, byte_vector_size < MaxVectorSize / 4`:
   - Can load 32+ indices per vector register
   - So can finish 16 bytes per gather-load operation
   - Requires masking to allow loading 16 active elements to keep safe
     memory access.
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
   int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   1 gather-load:
   idx_v = [... 0 0 0 0 0 0 0 0 9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
   v = [... 0000 0000 0000 0000 0000 jlkp bhga cfhf becd]
   ```

### Main changes:
1. Java-side API refactoring:
   - Potential multiple index vectors have been generated for index checking in java-side. This patch passes all the generated index vectors to hotspot to eliminate the duplicate index vectors used for the vector gather load operations on architectures like AArch64. Existing IGVN cannot work due to the different control flow of the index vectors generated in java-side and compiler intrinsifying.
2. C2 compiler IR refactoring:
   - Generate different IR patterns for different architectures like AArch64 and X86, based on the different index requirements.
   - Added two new IRs in C2 compiler to help implement each part of vector gather operation and merge the results at last.
   - Refactored the `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types. This patch removes the memory offset input and add it to the memory base `addr` in IR level for architectures that need the index array like X86. This not only simplifies the backend implementation, but also saves some add operations. Additionally, it unifies the IR for all types.
3. Backend changes:
   - Added SVE match rules for subword gather load operations and the new added IRs.
   - Refined the X86 implementation of subword gather since the offset input has been removed from the IR level.
4. Test:
   - Added IR tests for verification.

### Testing:
- Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
- Passed vector api tests with all `UseAVX
` flags on X86 and `UseSVE` flags on AArch64
- No regressions found

### Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:


Benchmark                                                (SIZE)   Mode Cnt  Units    Before     After    Gain
GatherOperationsBenchmark.microByteGather128                 64  thrpt  30  ops/ms  13447.414 43184.611  3.21
GatherOperationsBenchmark.microByteGather128                256  thrpt  30  ops/ms   3361.944 11165.006  3.32
GatherOperationsBenchmark.microByteGather128               1024  thrpt  30  ops/ms    843.501  2830.108  3.35
GatherOperationsBenchmark.microByteGather128               4096  thrpt  30  ops/ms    211.096   712.958  3.37
GatherOperationsBenchmark.microByteGather128_MASK            64  thrpt  30  ops/ms  10627.297 42818.402  4.02
GatherOperationsBenchmark.microByteGather128_MASK           256  thrpt  30  ops/ms   2675.144 11055.874  4.13
GatherOperationsBenchmark.microByteGather128_MASK          1024  thrpt  30  ops/ms    677.742  2783.920  4.10
GatherOperationsBenchmark.microByteGather128_MASK          4096  thrpt  30  ops/ms    169.416   686.783  4.05
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF     64  thrpt  30  ops/ms  10592.545 42282.802  3.99
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF    256  thrpt  30  ops/ms   2680.060 11039.563  4.11
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   1024  thrpt  30  ops/ms    678.941  2790.252  4.10
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   4096  thrpt  30  ops/ms    169.985   691.157  4.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF          64  thrpt  30  ops/ms  13538.308 42954.988  3.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF         256  thrpt  30  ops/ms   3414.237 11227.333  3.28
GatherOperationsBenchmark.microByteGather128_NZ_OFF        1024  thrpt  30  ops/ms    850.098  2821.821  3.31
GatherOperationsBenchmark.microByteGather128_NZ_OFF        4096  thrpt  30  ops/ms    213.295   705.015  3.30
GatherOperationsBenchmark.microByteGather64                  64  thrpt  30  ops/ms   8705.935 44213.982  5.07
GatherOperationsBenchmark.microByteGather64                 256  thrpt  30  ops/ms   2186.620 11407.364  5.21
GatherOperationsBenchmark.microByteGather64                1024  thrpt  30  ops/ms    545.364  2845.370  5.21
GatherOperationsBenchmark.microByteGather64                4096  thrpt  30  ops/ms    136.376   718.532  5.26
GatherOperationsBenchmark.microByteGather64_MASK             64  thrpt  30  ops/ms   6530.636 42053.044  6.43
GatherOperationsBenchmark.microByteGather64_MASK            256  thrpt  30  ops/ms   1644.069 11323.223  6.88
GatherOperationsBenchmark.microByteGather64_MASK           1024  thrpt  30  ops/ms    416.093  2844.712  6.83
GatherOperationsBenchmark.microByteGather64_MASK           4096  thrpt  30  ops/ms    105.777   716.685  6.77
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF      64  thrpt  30  ops/ms   6619.260 42204.919  6.37
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF     256  thrpt  30  ops/ms   1668.304 11318.298  6.78
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    1024  thrpt  30  ops/ms    422.085  2844.398  6.73
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    4096  thrpt  30  ops/ms    105.722   716.543  6.77
GatherOperationsBenchmark.microByteGather64_NZ_OFF           64  thrpt  30  ops/ms   8754.073 44232.985  5.05
GatherOperationsBenchmark.microByteGather64_NZ_OFF          256  thrpt  30  ops/ms   2195.009 11408.702  5.19
GatherOperationsBenchmark.microByteGather64_NZ_OFF         1024  thrpt  30  ops/ms    546.530  2845.369  5.20
GatherOperationsBenchmark.microByteGather64_NZ_OFF         4096  thrpt  30  ops/ms    137.713   718.391  5.21
GatherOperationsBenchmark.microShortGather128                64  thrpt  30  ops/ms   8695.558 33438.398  3.84
GatherOperationsBenchmark.microShortGather128               256  thrpt  30  ops/ms   2189.766  8533.643  3.89
GatherOperationsBenchmark.microShortGather128              1024  thrpt  30  ops/ms    546.322  2145.239  3.92
GatherOperationsBenchmark.microShortGather128              4096  thrpt  30  ops/ms    136.503   537.493  3.93
GatherOperationsBenchmark.microShortGather128_MASK           64  thrpt  30  ops/ms   6656.883 33571.619  5.04
GatherOperationsBenchmark.microShortGather128_MASK          256  thrpt  30  ops/ms   1649.233  8533.728  5.17
GatherOperationsBenchmark.microShortGather128_MASK         1024  thrpt  30  ops/ms    421.687  2135.280  5.06
GatherOperationsBenchmark.microShortGather128_MASK         4096  thrpt  30  ops/ms    105.355   537.418  5.10
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF    64  thrpt  30  ops/ms   6675.782 33441.402  5.00
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF   256  thrpt  30  ops/ms   1681.000  8532.770  5.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  1024  thrpt  30  ops/ms    424.024  2135.485  5.03
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  4096  thrpt  30  ops/ms    106.507   537.674  5.04
GatherOperationsBenchmark.microShortGather128_NZ_OFF         64  thrpt  30  ops/ms   8796.279 33441.738  3.80
GatherOperationsBenchmark.microShortGather128_NZ_OFF        256  thrpt  30  ops/ms   2198.774  8562.333  3.89
GatherOperationsBenchmark.microShortGather128_NZ_OFF       1024  thrpt  30  ops/ms    546.991  2133.496  3.90
GatherOperationsBenchmark.microShortGather128_NZ_OFF       4096  thrpt  30  ops/ms    137.191   537.390  3.91
GatherOperationsBenchmark.microShortGather64                 64  thrpt  30  ops/ms   5286.569 38042.434  7.19
GatherOperationsBenchmark.microShortGather64                256  thrpt  30  ops/ms   1312.778  9755.474  7.43
GatherOperationsBenchmark.microShortGather64               1024  thrpt  30  ops/ms    327.475  2450.755  7.48
GatherOperationsBenchmark.microShortGather64               4096  thrpt  30  ops/ms     82.490   613.481  7.43
GatherOperationsBenchmark.microShortGather64_MASK            64  thrpt  30  ops/ms   3525.102 37622.086  10.67
GatherOperationsBenchmark.microShortGather64_MASK           256  thrpt  30  ops/ms    877.877  9740.673  11.09
GatherOperationsBenchmark.microShortGather64_MASK          1024  thrpt  30  ops/ms    219.688  2446.063  11.13
GatherOperationsBenchmark.microShortGather64_MASK          4096  thrpt  30  ops/ms     54.935   613.137  11.16
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF     64  thrpt  30  ops/ms   3509.264 35147.895  10.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF    256  thrpt  30  ops/ms    880.523  9733.536  11.05
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   1024  thrpt  30  ops/ms    220.578  2465.951  11.17
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   4096  thrpt  30  ops/ms     55.790   620.465  11.12
GatherOperationsBenchmark.microShortGather64_NZ_OFF          64  thrpt  30  ops/ms   5271.218 35543.510  6.74
GatherOperationsBenchmark.microShortGather64_NZ_OFF         256  thrpt  30  ops/ms   1318.470  9735.321  7.38
GatherOperationsBenchmark.microShortGather64_NZ_OFF        1024  thrpt  30  ops/ms    328.695  2466.311  7.50
GatherOperationsBenchmark.microShortGather64_NZ_OFF        4096  thrpt  30  ops/ms     81.959   621.065  7.57



And here is the performance data on a X86 avx512 system, which shows the performance can improve at most 39%.


Benchmark                                                (SIZE)   Mode Cnt  Units    Before      After    Gain
GatherOperationsBenchmark.microByteGather128                 64  thrpt  30  ops/ms  44205.252  46829.437  1.05
GatherOperationsBenchmark.microByteGather128                256  thrpt  30  ops/ms  11243.202  12256.211  1.09
GatherOperationsBenchmark.microByteGather128               1024  thrpt  30  ops/ms   2824.094   3096.282  1.09
GatherOperationsBenchmark.microByteGather128               4096  thrpt  30  ops/ms    706.040    776.444  1.09
GatherOperationsBenchmark.microByteGather128_MASK            64  thrpt  30  ops/ms  46911.410  46321.310  0.98
GatherOperationsBenchmark.microByteGather128_MASK           256  thrpt  30  ops/ms  12850.712  12898.541  1.00
GatherOperationsBenchmark.microByteGather128_MASK          1024  thrpt  30  ops/ms   3099.038   3240.863  1.04
GatherOperationsBenchmark.microByteGather128_MASK          4096  thrpt  30  ops/ms    795.265    832.990  1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF     64  thrpt  30  ops/ms  43065.930  47164.936  1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF    256  thrpt  30  ops/ms  11537.805  13190.759  1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   1024  thrpt  30  ops/ms   2763.036   3304.582  1.19
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   4096  thrpt  30  ops/ms    722.374    843.458  1.16
GatherOperationsBenchmark.microByteGather128_NZ_OFF          64  thrpt  30  ops/ms  44145.297  46845.845  1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF         256  thrpt  30  ops/ms  12172.421  12241.941  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF        1024  thrpt  30  ops/ms   3097.042   3100.228  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF        4096  thrpt  30  ops/ms    776.453    775.881  0.99
GatherOperationsBenchmark.microByteGather64                  64  thrpt  30  ops/ms  58541.178  59464.156  1.01
GatherOperationsBenchmark.microByteGather64                 256  thrpt  30  ops/ms  16063.284  17360.858  1.08
GatherOperationsBenchmark.microByteGather64                1024  thrpt  30  ops/ms   4126.798   4471.636  1.08
GatherOperationsBenchmark.microByteGather64                4096  thrpt  30  ops/ms   1045.116   1125.219  1.07
GatherOperationsBenchmark.microByteGather64_MASK             64  thrpt  30  ops/ms  35344.320  49062.831  1.38
GatherOperationsBenchmark.microByteGather64_MASK            256  thrpt  30  ops/ms  11946.622  13550.297  1.13
GatherOperationsBenchmark.microByteGather64_MASK           1024  thrpt  30  ops/ms   3275.053   3359.737  1.02
GatherOperationsBenchmark.microByteGather64_MASK           4096  thrpt  30  ops/ms    844.575    858.487  1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF      64  thrpt  30  ops/ms  43550.522  48875.831  1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF     256  thrpt  30  ops/ms  12216.995  13522.420  1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    1024  thrpt  30  ops/ms   3053.068   3391.067  1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    4096  thrpt  30  ops/ms    753.042    869.774  1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF           64  thrpt  30  ops/ms  52082.307  58847.230  1.12
GatherOperationsBenchmark.microByteGather64_NZ_OFF          256  thrpt  30  ops/ms  14210.930  17389.898  1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF         1024  thrpt  30  ops/ms   3697.996   4476.988  1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF         4096  thrpt  30  ops/ms    921.524   1125.308  1.22
GatherOperationsBenchmark.microShortGather128                64  thrpt  30  ops/ms  44325.212  44843.853  1.01
GatherOperationsBenchmark.microShortGather128               256  thrpt  30  ops/ms  11675.510  12630.103  1.08
GatherOperationsBenchmark.microShortGather128              1024  thrpt  30  ops/ms   1260.004   1373.395  1.09
GatherOperationsBenchmark.microShortGather128              4096  thrpt  30  ops/ms    761.857    814.790  1.06
GatherOperationsBenchmark.microShortGather128_MASK           64  thrpt  30  ops/ms  36339.450  36951.803  1.01
GatherOperationsBenchmark.microShortGather128_MASK          256  thrpt  30  ops/ms   9843.842  10018.754  1.01
GatherOperationsBenchmark.microShortGather128_MASK         1024  thrpt  30  ops/ms   2515.702   2595.312  1.03
GatherOperationsBenchmark.microShortGather128_MASK         4096  thrpt  30  ops/ms    616.450    661.402  1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF    64  thrpt  30  ops/ms  34078.747  33712.577  0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF   256  thrpt  30  ops/ms   9018.316   8515.947  0.94
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  1024  thrpt  30  ops/ms   2250.813   2595.847  1.15
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  4096  thrpt  30  ops/ms    563.182    659.087  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF         64  thrpt  30  ops/ms  39909.543  44063.331  1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF        256  thrpt  30  ops/ms  10690.582  12437.166  1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF       1024  thrpt  30  ops/ms   2677.219   3151.078  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF       4096  thrpt  30  ops/ms    681.705    802.929  1.17
GatherOperationsBenchmark.microShortGather64                 64  thrpt  30  ops/ms  45836.789  50883.505  1.11
GatherOperationsBenchmark.microShortGather64                256  thrpt  30  ops/ms  12269.355  13614.567  1.10
GatherOperationsBenchmark.microShortGather64               1024  thrpt  30  ops/ms   3010.548   3437.973  1.14
GatherOperationsBenchmark.microShortGather64               4096  thrpt  30  ops/ms    734.634    899.070  1.22
GatherOperationsBenchmark.microShortGather64_MASK            64  thrpt  30  ops/ms  39753.487  39319.742  0.98
GatherOperationsBenchmark.microShortGather64_MASK           256  thrpt  30  ops/ms  10615.540  10648.996  1.00
GatherOperationsBenchmark.microShortGather64_MASK          1024  thrpt  30  ops/ms   2653.485   2782.477  1.04
GatherOperationsBenchmark.microShortGather64_MASK          4096  thrpt  30  ops/ms    678.165    686.024  1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF     64  thrpt  30  ops/ms  37742.593  40491.965  1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF    256  thrpt  30  ops/ms  10096.251  11036.785  1.09
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   1024  thrpt  30  ops/ms   2526.374   2812.550  1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   4096  thrpt  30  ops/ms    642.484    656.152  1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF          64  thrpt  30  ops/ms  40602.930  50921.048  1.25
GatherOperationsBenchmark.microShortGather64_NZ_OFF         256  thrpt  30  ops/ms  10972.083  14151.666  1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF        1024  thrpt  30  ops/ms   2726.248   3662.293  1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF        4096  thrpt  30  ops/ms    670.735    933.299  1.39


[1] https://bugs.openjdk.org/browse/JDK-8318650
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en

-------------

Commit messages:
 - 8351623: VectorAPI: Refactor subword gather load and add SVE implementation

Changes: https://git.openjdk.org/jdk/pull/24679/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24679&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351623
  Stats: 1367 lines in 34 files changed: 915 ins; 180 del; 272 mod
  Patch: https://git.openjdk.org/jdk/pull/24679.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24679/head:pull/24679

PR: https://git.openjdk.org/jdk/pull/24679


More information about the hotspot-dev mailing list