RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API

Xiaohong Gong xgong at openjdk.org
Wed Jun 25 09:16:48 UTC 2025


On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]).
> 
> Two key areas require improvement:
> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance.
> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance.
> 
> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures.
> 
> Main changes:
> 1. Java-side API refactoring:
>    - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on
>      architectures like AArch64.
> 2. C2 compiler IR refactoring:
>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types.
> 3. Backend changes:
>    - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level.
> 
> Performance:
> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below:
> 
> Benchmark                                                 Mode   Cnt Unit    SIZE    Before      After    Gain
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  64    53682.012   52650.325  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  256   14484.252   14255.156  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  1024   3664.900    3595.615  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  4096    908.312     935.269  1.02
> GatherOperationsBenchmark.micr...

Hi the above counted loop recognizer patch is merged. Hence I'v rebased this PR to latest jdk master. Following is the new performance data of the subword gather JMHs on X86:

Benchmark                                                 SIZE Mode   Cnt Unit    Before      After    Gain
GatherOperationsBenchmark.microByteGather128                64 thrpt  30  ops/ms 44221.691  46837.124  1.05
GatherOperationsBenchmark.microByteGather128               256 thrpt  30  ops/ms 11245.455  12243.045  1.08
GatherOperationsBenchmark.microByteGather128              1024 thrpt  30  ops/ms  2825.246   3096.460  1.09
GatherOperationsBenchmark.microByteGather128              4096 thrpt  30  ops/ms   705.927    775.039  1.09
GatherOperationsBenchmark.microByteGather128_MASK           64 thrpt  30  ops/ms 46783.479  46357.684  0.99
GatherOperationsBenchmark.microByteGather128_MASK          256 thrpt  30  ops/ms 12810.405  12880.347  1.00
GatherOperationsBenchmark.microByteGather128_MASK         1024 thrpt  30  ops/ms  3150.320   3239.281  1.02
GatherOperationsBenchmark.microByteGather128_MASK         4096 thrpt  30  ops/ms   794.151    830.464  1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF    64 thrpt  30  ops/ms 43189.395  47127.449  1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   256 thrpt  30  ops/ms 11543.128  13196.158  1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  1024 thrpt  30  ops/ms  2835.053   3300.357  1.16
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  4096 thrpt  30  ops/ms   719.470    843.290  1.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF         64 thrpt  30  ops/ms 44143.887  46836.788  1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF        256 thrpt  30  ops/ms 12206.908  12255.677  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF       1024 thrpt  30  ops/ms  3094.232   3095.931  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF       4096 thrpt  30  ops/ms   776.293    774.336  0.99
GatherOperationsBenchmark.microByteGather256                64 thrpt  30  ops/ms 46247.977  46803.899  1.01
GatherOperationsBenchmark.microByteGather256               256 thrpt  30  ops/ms 12198.878  12250.315  1.00
GatherOperationsBenchmark.microByteGather256              1024 thrpt  30  ops/ms  3093.356   3100.107  1.00
GatherOperationsBenchmark.microByteGather256              4096 thrpt  30  ops/ms   774.611    774.890  1.00
GatherOperationsBenchmark.microByteGather256_MASK           64 thrpt  30  ops/ms 46873.725  47967.422  1.02
GatherOperationsBenchmark.microByteGather256_MASK          256 thrpt  30  ops/ms 13025.578  13481.477  1.03
GatherOperationsBenchmark.microByteGather256_MASK         1024 thrpt  30  ops/ms  3317.651   3396.208  1.02
GatherOperationsBenchmark.microByteGather256_MASK         4096 thrpt  30  ops/ms  846.0888   864.8407  1.02
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF    64 thrpt  30  ops/ms 44488.365  48769.036  1.09
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF   256 thrpt  30  ops/ms 11988.552  13326.306  1.11
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  1024 thrpt  30  ops/ms  2851.132   3377.599  1.18
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  4096 thrpt  30  ops/ms   734.368    872.331  1.18
GatherOperationsBenchmark.microByteGather256_NZ_OFF         64 thrpt  30  ops/ms 44716.846  46816.743  1.04
GatherOperationsBenchmark.microByteGather256_NZ_OFF        256 thrpt  30  ops/ms 11885.251  12255.916  1.03
GatherOperationsBenchmark.microByteGather256_NZ_OFF       1024 thrpt  30  ops/ms  3016.645   3096.172  1.02
GatherOperationsBenchmark.microByteGather256_NZ_OFF       4096 thrpt  30  ops/ms   756.903    776.363  1.02
GatherOperationsBenchmark.microByteGather512                64 thrpt  30  ops/ms 44742.221  46848.590  1.04
GatherOperationsBenchmark.microByteGather512               256 thrpt  30  ops/ms 12081.443  12236.973  1.01
GatherOperationsBenchmark.microByteGather512              1024 thrpt  30  ops/ms  3086.873   3088.040  1.00
GatherOperationsBenchmark.microByteGather512              4096 thrpt  30  ops/ms   774.243    770.209  0.99
GatherOperationsBenchmark.microByteGather512_MASK           64 thrpt  30  ops/ms 50588.210  48220.741  0.95
GatherOperationsBenchmark.microByteGather512_MASK          256 thrpt  30  ops/ms 13535.785  13675.499  1.01
GatherOperationsBenchmark.microByteGather512_MASK         1024 thrpt  30  ops/ms  3355.724   3421.323  1.01
GatherOperationsBenchmark.microByteGather512_MASK         4096 thrpt  30  ops/ms   859.103    872.009  1.01
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF    64 thrpt  30  ops/ms 44139.269  48320.364  1.09
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF   256 thrpt  30  ops/ms 12500.697  13801.124  1.10
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  1024 thrpt  30  ops/ms  3135.082   3492.312  1.11
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  4096 thrpt  30  ops/ms   794.338    897.249  1.12
GatherOperationsBenchmark.microByteGather512_NZ_OFF         64 thrpt  30  ops/ms 45754.147  46421.300  1.01
GatherOperationsBenchmark.microByteGather512_NZ_OFF        256 thrpt  30  ops/ms 12133.467  12253.848  1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF       1024 thrpt  30  ops/ms  3074.637   3091.207  1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF       4096 thrpt  30  ops/ms   755.250    774.367  1.02
GatherOperationsBenchmark.microByteGather64                 64 thrpt  30  ops/ms 58625.196  59263.141  1.01
GatherOperationsBenchmark.microByteGather64                256 thrpt  30  ops/ms 15745.329  17377.889  1.10
GatherOperationsBenchmark.microByteGather64               1024 thrpt  30  ops/ms  4121.997   4471.261  1.08
GatherOperationsBenchmark.microByteGather64               4096 thrpt  30  ops/ms  1044.419   1125.721  1.07
GatherOperationsBenchmark.microByteGather64_MASK            64 thrpt  30  ops/ms 48754.131  49028.183  1.00
GatherOperationsBenchmark.microByteGather64_MASK           256 thrpt  30  ops/ms 13248.349  13537.811  1.02
GatherOperationsBenchmark.microByteGather64_MASK          1024 thrpt  30  ops/ms  3308.839   3356.109  1.01
GatherOperationsBenchmark.microByteGather64_MASK          4096 thrpt  30  ops/ms   843.688    859.161  1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF     64 thrpt  30  ops/ms 43523.662  48868.373  1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    256 thrpt  30  ops/ms 12242.984  13519.719  1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   1024 thrpt  30  ops/ms  3055.772   3394.342  1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   4096 thrpt  30  ops/ms   754.532    870.302  1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF          64 thrpt  30  ops/ms 51858.935  58869.325  1.13
GatherOperationsBenchmark.microByteGather64_NZ_OFF         256 thrpt  30  ops/ms 14235.928  17381.117  1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF        1024 thrpt  30  ops/ms  3684.506   4483.270  1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF        4096 thrpt  30  ops/ms   922.368    1127.66  1.22
GatherOperationsBenchmark.microShortGather128               64 thrpt  30  ops/ms 44399.870  45016.972  1.01
GatherOperationsBenchmark.microShortGather128              256 thrpt  30  ops/ms 11679.775  12629.207  1.08
GatherOperationsBenchmark.microShortGather128             1024 thrpt  30  ops/ms  1277.328   3206.762  2.51
GatherOperationsBenchmark.microShortGather128             4096 thrpt  30  ops/ms   761.846    817.159  1.07
GatherOperationsBenchmark.microShortGather128_MASK          64 thrpt  30  ops/ms 37165.399  36484.534  0.98
GatherOperationsBenchmark.microShortGather128_MASK         256 thrpt  30  ops/ms  9875.757   9958.754  1.00
GatherOperationsBenchmark.microShortGather128_MASK        1024 thrpt  30  ops/ms  2519.580   2554.210  1.01
GatherOperationsBenchmark.microShortGather128_MASK        4096 thrpt  30  ops/ms   615.867    652.092  1.05
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF   64 thrpt  30  ops/ms 34049.203  33669.772  0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  256 thrpt  30  ops/ms  9010.587   8779.455  0.97
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt  30  ops/ms  2253.432   2415.560  1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt  30  ops/ms   559.163    577.659  1.03
GatherOperationsBenchmark.microShortGather128_NZ_OFF        64 thrpt  30  ops/ms 39892.023  43978.899  1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF       256 thrpt  30  ops/ms 10697.817  12424.189  1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF      1024 thrpt  30  ops/ms  2681.286   3145.941  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF      4096 thrpt  30  ops/ms   682.330    803.364  1.17
GatherOperationsBenchmark.microShortGather256               64 thrpt  30  ops/ms 42335.033  43194.212  1.02
GatherOperationsBenchmark.microShortGather256              256 thrpt  30  ops/ms 10760.015  11149.020  1.03
GatherOperationsBenchmark.microShortGather256             1024 thrpt  30  ops/ms  2688.410   2806.389  1.04
GatherOperationsBenchmark.microShortGather256             4096 thrpt  30  ops/ms   675.401    703.849  1.04
GatherOperationsBenchmark.microShortGather256_MASK          64 thrpt  30  ops/ms 38760.990  41844.197  1.07
GatherOperationsBenchmark.microShortGather256_MASK         256 thrpt  30  ops/ms 11339.217  10951.141  0.96
GatherOperationsBenchmark.microShortGather256_MASK        1024 thrpt  30  ops/ms  2840.081   2718.823  0.95
GatherOperationsBenchmark.microShortGather256_MASK        4096 thrpt  30  ops/ms   725.334    696.343  0.96
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF   64 thrpt  30  ops/ms 39059.271  42199.055  1.08
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF  256 thrpt  30  ops/ms 10440.036  11467.941  1.09
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 1024 thrpt  30  ops/ms  2563.378   2790.541  1.08
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 4096 thrpt  30  ops/ms   642.642    751.287  1.16
GatherOperationsBenchmark.microShortGather256_NZ_OFF        64 thrpt  30  ops/ms 38963.881  42675.099  1.09
GatherOperationsBenchmark.microShortGather256_NZ_OFF       256 thrpt  30  ops/ms 10628.469  11168.949  1.05
GatherOperationsBenchmark.microShortGather256_NZ_OFF      1024 thrpt  30  ops/ms  2702.591   2806.074  1.03
GatherOperationsBenchmark.microShortGather256_NZ_OFF      4096 thrpt  30  ops/ms   683.690    704.498  1.03
GatherOperationsBenchmark.microShortGather512               64 thrpt  30  ops/ms 41117.094  41269.397  1.00
GatherOperationsBenchmark.microShortGather512              256 thrpt  30  ops/ms 10565.519  10652.618  1.00
GatherOperationsBenchmark.microShortGather512             1024 thrpt  30  ops/ms  2681.894   2705.963  1.00
GatherOperationsBenchmark.microShortGather512             4096 thrpt  30  ops/ms   673.821    679.631  1.00
GatherOperationsBenchmark.microShortGather512_MASK          64 thrpt  30  ops/ms 41318.510  42372.271  1.02
GatherOperationsBenchmark.microShortGather512_MASK         256 thrpt  30  ops/ms 11587.465  10674.598  0.92
GatherOperationsBenchmark.microShortGather512_MASK        1024 thrpt  30  ops/ms  2902.731   2629.739  0.90
GatherOperationsBenchmark.microShortGather512_MASK        4096 thrpt  30  ops/ms   741.546    671.124  0.90
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF   64 thrpt  30  ops/ms 39524.127  40623.622  1.02
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF  256 thrpt  30  ops/ms 10642.152  11392.025  1.07
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 1024 thrpt  30  ops/ms  2650.143   2819.185  1.06
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 4096 thrpt  30  ops/ms   672.674    739.882  1.09
GatherOperationsBenchmark.microShortGather512_NZ_OFF        64 thrpt  30  ops/ms 39861.745  41600.729  1.04
GatherOperationsBenchmark.microShortGather512_NZ_OFF       256 thrpt  30  ops/ms 10531.312  10586.255  1.00
GatherOperationsBenchmark.microShortGather512_NZ_OFF      1024 thrpt  30  ops/ms  2667.839   2678.026  1.00
GatherOperationsBenchmark.microShortGather512_NZ_OFF      4096 thrpt  30  ops/ms   667.607    677.434  1.01
GatherOperationsBenchmark.microShortGather64                64 thrpt  30  ops/ms 45716.109  50726.590  1.10
GatherOperationsBenchmark.microShortGather64               256 thrpt  30  ops/ms 12383.842  13608.216  1.09
GatherOperationsBenchmark.microShortGather64              1024 thrpt  30  ops/ms  3025.989   3443.097  1.13
GatherOperationsBenchmark.microShortGather64              4096 thrpt  30  ops/ms   771.995    897.890  1.16
GatherOperationsBenchmark.microShortGather64_MASK           64 thrpt  30  ops/ms 39758.975  39155.984  0.98
GatherOperationsBenchmark.microShortGather64_MASK          256 thrpt  30  ops/ms 10594.260  10622.428  1.00
GatherOperationsBenchmark.microShortGather64_MASK         1024 thrpt  30  ops/ms  2654.849   2771.674  1.04
GatherOperationsBenchmark.microShortGather64_MASK         4096 thrpt  30  ops/ms   677.508    684.557  1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF    64 thrpt  30  ops/ms 37729.191  40552.172  1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   256 thrpt  30  ops/ms 10087.184  11121.611  1.10
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  1024 thrpt  30  ops/ms  2510.133   2788.778  1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  4096 thrpt  30  ops/ms   642.370    658.808  1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF         64 thrpt  30  ops/ms 40632.099  50718.706  1.24
GatherOperationsBenchmark.microShortGather64_NZ_OFF        256 thrpt  30  ops/ms 10984.671  14155.624  1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF       1024 thrpt  30  ops/ms  2733.285   3668.118  1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF       4096 thrpt  30  ops/ms   679.524    932.748  1.37

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3004026787


More information about the hotspot-dev mailing list