RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API
Xiaohong Gong
xgong at openjdk.org
Wed Jun 25 09:16:48 UTC 2025
On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]).
>
> Two key areas require improvement:
> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance.
> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance.
>
> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures.
>
> Main changes:
> 1. Java-side API refactoring:
> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on
> architectures like AArch64.
> 2. C2 compiler IR refactoring:
> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types.
> 3. Backend changes:
> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level.
>
> Performance:
> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below:
>
> Benchmark Mode Cnt Unit SIZE Before After Gain
> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98
> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98
> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98
> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.312 935.269 1.02
> GatherOperationsBenchmark.micr...
Hi the above counted loop recognizer patch is merged. Hence I'v rebased this PR to latest jdk master. Following is the new performance data of the subword gather JMHs on X86:
Benchmark SIZE Mode Cnt Unit Before After Gain
GatherOperationsBenchmark.microByteGather128 64 thrpt 30 ops/ms 44221.691 46837.124 1.05
GatherOperationsBenchmark.microByteGather128 256 thrpt 30 ops/ms 11245.455 12243.045 1.08
GatherOperationsBenchmark.microByteGather128 1024 thrpt 30 ops/ms 2825.246 3096.460 1.09
GatherOperationsBenchmark.microByteGather128 4096 thrpt 30 ops/ms 705.927 775.039 1.09
GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30 ops/ms 46783.479 46357.684 0.99
GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30 ops/ms 12810.405 12880.347 1.00
GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30 ops/ms 3150.320 3239.281 1.02
GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30 ops/ms 794.151 830.464 1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 43189.395 47127.449 1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 11543.128 13196.158 1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2835.053 3300.357 1.16
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 719.470 843.290 1.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30 ops/ms 44143.887 46836.788 1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30 ops/ms 12206.908 12255.677 1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30 ops/ms 3094.232 3095.931 1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30 ops/ms 776.293 774.336 0.99
GatherOperationsBenchmark.microByteGather256 64 thrpt 30 ops/ms 46247.977 46803.899 1.01
GatherOperationsBenchmark.microByteGather256 256 thrpt 30 ops/ms 12198.878 12250.315 1.00
GatherOperationsBenchmark.microByteGather256 1024 thrpt 30 ops/ms 3093.356 3100.107 1.00
GatherOperationsBenchmark.microByteGather256 4096 thrpt 30 ops/ms 774.611 774.890 1.00
GatherOperationsBenchmark.microByteGather256_MASK 64 thrpt 30 ops/ms 46873.725 47967.422 1.02
GatherOperationsBenchmark.microByteGather256_MASK 256 thrpt 30 ops/ms 13025.578 13481.477 1.03
GatherOperationsBenchmark.microByteGather256_MASK 1024 thrpt 30 ops/ms 3317.651 3396.208 1.02
GatherOperationsBenchmark.microByteGather256_MASK 4096 thrpt 30 ops/ms 846.0888 864.8407 1.02
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 64 thrpt 30 ops/ms 44488.365 48769.036 1.09
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 256 thrpt 30 ops/ms 11988.552 13326.306 1.11
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2851.132 3377.599 1.18
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 4096 thrpt 30 ops/ms 734.368 872.331 1.18
GatherOperationsBenchmark.microByteGather256_NZ_OFF 64 thrpt 30 ops/ms 44716.846 46816.743 1.04
GatherOperationsBenchmark.microByteGather256_NZ_OFF 256 thrpt 30 ops/ms 11885.251 12255.916 1.03
GatherOperationsBenchmark.microByteGather256_NZ_OFF 1024 thrpt 30 ops/ms 3016.645 3096.172 1.02
GatherOperationsBenchmark.microByteGather256_NZ_OFF 4096 thrpt 30 ops/ms 756.903 776.363 1.02
GatherOperationsBenchmark.microByteGather512 64 thrpt 30 ops/ms 44742.221 46848.590 1.04
GatherOperationsBenchmark.microByteGather512 256 thrpt 30 ops/ms 12081.443 12236.973 1.01
GatherOperationsBenchmark.microByteGather512 1024 thrpt 30 ops/ms 3086.873 3088.040 1.00
GatherOperationsBenchmark.microByteGather512 4096 thrpt 30 ops/ms 774.243 770.209 0.99
GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95
GatherOperationsBenchmark.microByteGather512_MASK 256 thrpt 30 ops/ms 13535.785 13675.499 1.01
GatherOperationsBenchmark.microByteGather512_MASK 1024 thrpt 30 ops/ms 3355.724 3421.323 1.01
GatherOperationsBenchmark.microByteGather512_MASK 4096 thrpt 30 ops/ms 859.103 872.009 1.01
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 64 thrpt 30 ops/ms 44139.269 48320.364 1.09
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 256 thrpt 30 ops/ms 12500.697 13801.124 1.10
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 1024 thrpt 30 ops/ms 3135.082 3492.312 1.11
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 4096 thrpt 30 ops/ms 794.338 897.249 1.12
GatherOperationsBenchmark.microByteGather512_NZ_OFF 64 thrpt 30 ops/ms 45754.147 46421.300 1.01
GatherOperationsBenchmark.microByteGather512_NZ_OFF 256 thrpt 30 ops/ms 12133.467 12253.848 1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF 1024 thrpt 30 ops/ms 3074.637 3091.207 1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF 4096 thrpt 30 ops/ms 755.250 774.367 1.02
GatherOperationsBenchmark.microByteGather64 64 thrpt 30 ops/ms 58625.196 59263.141 1.01
GatherOperationsBenchmark.microByteGather64 256 thrpt 30 ops/ms 15745.329 17377.889 1.10
GatherOperationsBenchmark.microByteGather64 1024 thrpt 30 ops/ms 4121.997 4471.261 1.08
GatherOperationsBenchmark.microByteGather64 4096 thrpt 30 ops/ms 1044.419 1125.721 1.07
GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30 ops/ms 48754.131 49028.183 1.00
GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30 ops/ms 13248.349 13537.811 1.02
GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30 ops/ms 3308.839 3356.109 1.01
GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30 ops/ms 843.688 859.161 1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 43523.662 48868.373 1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 12242.984 13519.719 1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 3055.772 3394.342 1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 754.532 870.302 1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30 ops/ms 51858.935 58869.325 1.13
GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30 ops/ms 14235.928 17381.117 1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30 ops/ms 3684.506 4483.270 1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30 ops/ms 922.368 1127.66 1.22
GatherOperationsBenchmark.microShortGather128 64 thrpt 30 ops/ms 44399.870 45016.972 1.01
GatherOperationsBenchmark.microShortGather128 256 thrpt 30 ops/ms 11679.775 12629.207 1.08
GatherOperationsBenchmark.microShortGather128 1024 thrpt 30 ops/ms 1277.328 3206.762 2.51
GatherOperationsBenchmark.microShortGather128 4096 thrpt 30 ops/ms 761.846 817.159 1.07
GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30 ops/ms 37165.399 36484.534 0.98
GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30 ops/ms 9875.757 9958.754 1.00
GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30 ops/ms 2519.580 2554.210 1.01
GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30 ops/ms 615.867 652.092 1.05
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 34049.203 33669.772 0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 9010.587 8779.455 0.97
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2253.432 2415.560 1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 559.163 577.659 1.03
GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30 ops/ms 39892.023 43978.899 1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30 ops/ms 10697.817 12424.189 1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30 ops/ms 2681.286 3145.941 1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30 ops/ms 682.330 803.364 1.17
GatherOperationsBenchmark.microShortGather256 64 thrpt 30 ops/ms 42335.033 43194.212 1.02
GatherOperationsBenchmark.microShortGather256 256 thrpt 30 ops/ms 10760.015 11149.020 1.03
GatherOperationsBenchmark.microShortGather256 1024 thrpt 30 ops/ms 2688.410 2806.389 1.04
GatherOperationsBenchmark.microShortGather256 4096 thrpt 30 ops/ms 675.401 703.849 1.04
GatherOperationsBenchmark.microShortGather256_MASK 64 thrpt 30 ops/ms 38760.990 41844.197 1.07
GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96
GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95
GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 64 thrpt 30 ops/ms 39059.271 42199.055 1.08
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 256 thrpt 30 ops/ms 10440.036 11467.941 1.09
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2563.378 2790.541 1.08
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 4096 thrpt 30 ops/ms 642.642 751.287 1.16
GatherOperationsBenchmark.microShortGather256_NZ_OFF 64 thrpt 30 ops/ms 38963.881 42675.099 1.09
GatherOperationsBenchmark.microShortGather256_NZ_OFF 256 thrpt 30 ops/ms 10628.469 11168.949 1.05
GatherOperationsBenchmark.microShortGather256_NZ_OFF 1024 thrpt 30 ops/ms 2702.591 2806.074 1.03
GatherOperationsBenchmark.microShortGather256_NZ_OFF 4096 thrpt 30 ops/ms 683.690 704.498 1.03
GatherOperationsBenchmark.microShortGather512 64 thrpt 30 ops/ms 41117.094 41269.397 1.00
GatherOperationsBenchmark.microShortGather512 256 thrpt 30 ops/ms 10565.519 10652.618 1.00
GatherOperationsBenchmark.microShortGather512 1024 thrpt 30 ops/ms 2681.894 2705.963 1.00
GatherOperationsBenchmark.microShortGather512 4096 thrpt 30 ops/ms 673.821 679.631 1.00
GatherOperationsBenchmark.microShortGather512_MASK 64 thrpt 30 ops/ms 41318.510 42372.271 1.02
GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92
GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90
GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 64 thrpt 30 ops/ms 39524.127 40623.622 1.02
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 256 thrpt 30 ops/ms 10642.152 11392.025 1.07
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2650.143 2819.185 1.06
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 4096 thrpt 30 ops/ms 672.674 739.882 1.09
GatherOperationsBenchmark.microShortGather512_NZ_OFF 64 thrpt 30 ops/ms 39861.745 41600.729 1.04
GatherOperationsBenchmark.microShortGather512_NZ_OFF 256 thrpt 30 ops/ms 10531.312 10586.255 1.00
GatherOperationsBenchmark.microShortGather512_NZ_OFF 1024 thrpt 30 ops/ms 2667.839 2678.026 1.00
GatherOperationsBenchmark.microShortGather512_NZ_OFF 4096 thrpt 30 ops/ms 667.607 677.434 1.01
GatherOperationsBenchmark.microShortGather64 64 thrpt 30 ops/ms 45716.109 50726.590 1.10
GatherOperationsBenchmark.microShortGather64 256 thrpt 30 ops/ms 12383.842 13608.216 1.09
GatherOperationsBenchmark.microShortGather64 1024 thrpt 30 ops/ms 3025.989 3443.097 1.13
GatherOperationsBenchmark.microShortGather64 4096 thrpt 30 ops/ms 771.995 897.890 1.16
GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30 ops/ms 39758.975 39155.984 0.98
GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30 ops/ms 10594.260 10622.428 1.00
GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30 ops/ms 2654.849 2771.674 1.04
GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30 ops/ms 677.508 684.557 1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 37729.191 40552.172 1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 10087.184 11121.611 1.10
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2510.133 2788.778 1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 642.370 658.808 1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30 ops/ms 40632.099 50718.706 1.24
GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30 ops/ms 10984.671 14155.624 1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30 ops/ms 2733.285 3668.118 1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30 ops/ms 679.524 932.748 1.37
-------------
PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3004026787
More information about the hotspot-dev
mailing list