RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Fri Jul 25 03:43:54 UTC 2025

On Thu, 17 Jul 2025 11:28:18 GMT, Fei Gao <fgao at openjdk.org> wrote:

>>> I like this idea! The first one looks better, in which `concate` would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios.
>> 
>> Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!
>
>> 
>> Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!
> 
> Thanks! I’d suggest also highlighting `aarch64` in the JBS title, so others who are interested won’t miss it.

Hi @fg1417 , the latest commit refactored the whole IR patterns and `LoadVectorGather[Masked]` IR based on above discussions. Could you please help take another look? Thanks~

### Main changes
- Type of `LoadVectorGather[Masked]` are changed from original subword vector type to `int` vector type. Additionally, a `_mem_bt` member is added to denote the load type.
   - backend rules are clean
   - mask generation for partial cases are clean
- Define `VectorConcatenateNode` and remove `VectorSliceNode`.
   - `VectorConcatenateNode` has the same function with SVE/NEON's `uzp1`. It is used to narrow the element size of input to half size and concatenate narrowed results from src1 and src2  to dst (src1 is in lower part and src2 is in higher part of dst).
-  The matcher helper function `vector_idea_reg_size()` is needless and removed. Originally it is used by `VectorSlice`.
-  More IR tests are added for kinds of different vector species.

### IR implementation
- It needs one gather-load
  - `LoadVectorGather (bt: int)` + `VectorCastI2X (bt: byte|short)`
- It needs two gather-loads and merge
  - step-1: `v1 = LoadVectorGather (bt: int)`, `v2 = LoadVectorGather (bt: int)`
  - step-2: `merge = VectorConcatenate(v1, v2) (bt: short)`
  - step-3: (only byte) `v = VectorCastS2X(merge)  (bt: byte)`
- It needs four gather-loads and merge - (only byte vector)
  - step-1: `v1 = LoadVectorGather (bt: int)`, `v2 = LoadVectorGather (bt: int)`
  - step-2: `merge1 = VectorConcatenate(v1, v2) (bt: short)`
  - step-3: `v3 = LoadVectorGather (bt: int)`, `v4 = LoadVectorGather (bt: int)`
  - step-4: `merge2 = VectorConcatenate(v3, v4) (bt: short)`
  - step-5: `v = VectorConcatenate(merge1, merge2) (bt: byte)`

### Performance change
It can observe about 4% ~ 9% uplifts on some micro benchmarks. No significant regressions are observed.
 Following is the performance change on NVIDIA Grace with latest commit:

Benchmark                        (SIZE)   Mode   Units      Before     After   Gain
microByteGather128                   64  thrpt   ops/ms  48405.283  48668.502  1.005
microByteGather128                  256  thrpt   ops/ms  12821.924  12662.342  0.987
microByteGather128                 1024  thrpt   ops/ms   3253.778   3198.608  0.983
microByteGather128                 4096  thrpt   ops/ms    817.604    801.250  0.979
microByteGather128_MASK              64  thrpt   ops/ms  46124.722  48334.916  1.047
microByteGather128_MASK             256  thrpt   ops/ms  12152.575  12652.821  1.041
microByteGather128_MASK            1024  thrpt   ops/ms   3075.066   3193.787  1.038
microByteGather128_MASK            4096  thrpt   ops/ms    812.738    803.017  0.988
microByteGather128_MASK_NZ_OFF       64  thrpt   ops/ms  46130.244  48384.633  1.048
microByteGather128_MASK_NZ_OFF      256  thrpt   ops/ms  12139.800  12624.298  1.039
microByteGather128_MASK_NZ_OFF     1024  thrpt   ops/ms   3078.040   3203.049  1.040
microByteGather128_MASK_NZ_OFF     4096  thrpt   ops/ms    812.716    802.712  0.987
microByteGather128_NZ_OFF            64  thrpt   ops/ms  48369.524  48643.937  1.005
microByteGather128_NZ_OFF           256  thrpt   ops/ms  12814.552  12672.757  0.988
microByteGather128_NZ_OFF          1024  thrpt   ops/ms   3253.294   3202.016  0.984
microByteGather128_NZ_OFF          4096  thrpt   ops/ms    818.389    805.488  0.984
microByteGather64                    64  thrpt   ops/ms  48491.633  50615.848  1.043
microByteGather64                   256  thrpt   ops/ms  12340.778  13156.762  1.066
microByteGather64                  1024  thrpt   ops/ms   3067.592   3322.777  1.083
microByteGather64                  4096  thrpt   ops/ms    767.111    832.409  1.085
microByteGather64_MASK               64  thrpt   ops/ms  48526.894  50730.468  1.045
microByteGather64_MASK              256  thrpt   ops/ms  12340.398  13159.723  1.066
microByteGather64_MASK             1024  thrpt   ops/ms   3066.227   3327.964  1.085
microByteGather64_MASK             4096  thrpt   ops/ms    767.390    833.327  1.085
microByteGather64_MASK_NZ_OFF        64  thrpt   ops/ms  48472.912  51287.634  1.058
microByteGather64_MASK_NZ_OFF       256  thrpt   ops/ms  12331.578  13258.954  1.075
microByteGather64_MASK_NZ_OFF      1024  thrpt   ops/ms   3070.319   3345.911  1.089
microByteGather64_MASK_NZ_OFF      4096  thrpt   ops/ms    767.097    838.008  1.092
microByteGather64_NZ_OFF             64  thrpt   ops/ms  48492.984  51224.743  1.056
microByteGather64_NZ_OFF            256  thrpt   ops/ms  12334.944  13240.494  1.073
microByteGather64_NZ_OFF           1024  thrpt   ops/ms   3067.754   3343.387  1.089
microByteGather64_NZ_OFF           4096  thrpt   ops/ms    767.123    837.642  1.091
microShortGather128                  64  thrpt   ops/ms  37717.835  37041.162  0.982
microShortGather128                 256  thrpt   ops/ms   9467.160   9890.109  1.044
microShortGather128                1024  thrpt   ops/ms   2376.520   2481.753  1.044
microShortGather128                4096  thrpt   ops/ms    595.030    621.274  1.044
microShortGather128_MASK             64  thrpt   ops/ms  37655.017  37036.887  0.983
microShortGather128_MASK            256  thrpt   ops/ms   9471.324   9859.461  1.040
microShortGather128_MASK           1024  thrpt   ops/ms   2376.811   2477.106  1.042
microShortGather128_MASK           4096  thrpt   ops/ms    595.049    620.082  1.042
microShortGather128_MASK_NZ_OFF      64  thrpt   ops/ms  37636.229  37029.468  0.983
microShortGather128_MASK_NZ_OFF     256  thrpt   ops/ms   9483.674   9867.427  1.040
microShortGather128_MASK_NZ_OFF    1024  thrpt   ops/ms   2379.877   2478.608  1.041
microShortGather128_MASK_NZ_OFF    4096  thrpt   ops/ms    594.710    620.455  1.043
microShortGather128_NZ_OFF           64  thrpt   ops/ms  37706.896  37044.505  0.982
microShortGather128_NZ_OFF          256  thrpt   ops/ms   9487.006   9882.079  1.041
microShortGather128_NZ_OFF         1024  thrpt   ops/ms   2379.571   2482.341  1.043
microShortGather128_NZ_OFF         4096  thrpt   ops/ms    595.099    621.392  1.044
microShortGather64                   64  thrpt   ops/ms  37773.485  37502.698  0.992
microShortGather64                  256  thrpt   ops/ms   9591.046   9640.225  1.005
microShortGather64                 1024  thrpt   ops/ms   2406.013   2420.376  1.005
microShortGather64                 4096  thrpt   ops/ms    603.270    606.541  1.005
microShortGather64_MASK              64  thrpt   ops/ms  37781.860  37479.295  0.991
microShortGather64_MASK             256  thrpt   ops/ms   9608.015   9657.010  1.005
microShortGather64_MASK            1024  thrpt   ops/ms   2406.828   2422.170  1.006
microShortGather64_MASK            4096  thrpt   ops/ms    602.965    606.283  1.005
microShortGather64_MASK_NZ_OFF       64  thrpt   ops/ms  37740.577  37487.740  0.993
microShortGather64_MASK_NZ_OFF      256  thrpt   ops/ms   9593.611   9663.041  1.007
microShortGather64_MASK_NZ_OFF     1024  thrpt   ops/ms   2404.846   2423.493  1.007
microShortGather64_MASK_NZ_OFF     4096  thrpt   ops/ms    602.691    605.911  1.005
microShortGather64_NZ_OFF            64  thrpt   ops/ms  37723.586  37507.899  0.994
microShortGather64_NZ_OFF           256  thrpt   ops/ms   9589.985   9630.033  1.004
microShortGather64_NZ_OFF          1024  thrpt   ops/ms   2405.774   2423.655  1.007
microShortGather64_NZ_OFF          4096  thrpt   ops/ms    602.778    606.151  1.005

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3116280179