RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation
Xiaohong Gong
xgong at openjdk.org
Fri Jul 25 03:43:54 UTC 2025
On Thu, 17 Jul 2025 11:28:18 GMT, Fei Gao <fgao at openjdk.org> wrote:
>>> I like this idea! The first one looks better, in which `concate` would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios.
>>
>> Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!
>
>>
>> Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!
>
> Thanks! I’d suggest also highlighting `aarch64` in the JBS title, so others who are interested won’t miss it.
Hi @fg1417 , the latest commit refactored the whole IR patterns and `LoadVectorGather[Masked]` IR based on above discussions. Could you please help take another look? Thanks~
### Main changes
- Type of `LoadVectorGather[Masked]` are changed from original subword vector type to `int` vector type. Additionally, a `_mem_bt` member is added to denote the load type.
- backend rules are clean
- mask generation for partial cases are clean
- Define `VectorConcatenateNode` and remove `VectorSliceNode`.
- `VectorConcatenateNode` has the same function with SVE/NEON's `uzp1`. It is used to narrow the element size of input to half size and concatenate narrowed results from src1 and src2 to dst (src1 is in lower part and src2 is in higher part of dst).
- The matcher helper function `vector_idea_reg_size()` is needless and removed. Originally it is used by `VectorSlice`.
- More IR tests are added for kinds of different vector species.
### IR implementation
- It needs one gather-load
- `LoadVectorGather (bt: int)` + `VectorCastI2X (bt: byte|short)`
- It needs two gather-loads and merge
- step-1: `v1 = LoadVectorGather (bt: int)`, `v2 = LoadVectorGather (bt: int)`
- step-2: `merge = VectorConcatenate(v1, v2) (bt: short)`
- step-3: (only byte) `v = VectorCastS2X(merge) (bt: byte)`
- It needs four gather-loads and merge - (only byte vector)
- step-1: `v1 = LoadVectorGather (bt: int)`, `v2 = LoadVectorGather (bt: int)`
- step-2: `merge1 = VectorConcatenate(v1, v2) (bt: short)`
- step-3: `v3 = LoadVectorGather (bt: int)`, `v4 = LoadVectorGather (bt: int)`
- step-4: `merge2 = VectorConcatenate(v3, v4) (bt: short)`
- step-5: `v = VectorConcatenate(merge1, merge2) (bt: byte)`
### Performance change
It can observe about 4% ~ 9% uplifts on some micro benchmarks. No significant regressions are observed.
Following is the performance change on NVIDIA Grace with latest commit:
Benchmark (SIZE) Mode Units Before After Gain
microByteGather128 64 thrpt ops/ms 48405.283 48668.502 1.005
microByteGather128 256 thrpt ops/ms 12821.924 12662.342 0.987
microByteGather128 1024 thrpt ops/ms 3253.778 3198.608 0.983
microByteGather128 4096 thrpt ops/ms 817.604 801.250 0.979
microByteGather128_MASK 64 thrpt ops/ms 46124.722 48334.916 1.047
microByteGather128_MASK 256 thrpt ops/ms 12152.575 12652.821 1.041
microByteGather128_MASK 1024 thrpt ops/ms 3075.066 3193.787 1.038
microByteGather128_MASK 4096 thrpt ops/ms 812.738 803.017 0.988
microByteGather128_MASK_NZ_OFF 64 thrpt ops/ms 46130.244 48384.633 1.048
microByteGather128_MASK_NZ_OFF 256 thrpt ops/ms 12139.800 12624.298 1.039
microByteGather128_MASK_NZ_OFF 1024 thrpt ops/ms 3078.040 3203.049 1.040
microByteGather128_MASK_NZ_OFF 4096 thrpt ops/ms 812.716 802.712 0.987
microByteGather128_NZ_OFF 64 thrpt ops/ms 48369.524 48643.937 1.005
microByteGather128_NZ_OFF 256 thrpt ops/ms 12814.552 12672.757 0.988
microByteGather128_NZ_OFF 1024 thrpt ops/ms 3253.294 3202.016 0.984
microByteGather128_NZ_OFF 4096 thrpt ops/ms 818.389 805.488 0.984
microByteGather64 64 thrpt ops/ms 48491.633 50615.848 1.043
microByteGather64 256 thrpt ops/ms 12340.778 13156.762 1.066
microByteGather64 1024 thrpt ops/ms 3067.592 3322.777 1.083
microByteGather64 4096 thrpt ops/ms 767.111 832.409 1.085
microByteGather64_MASK 64 thrpt ops/ms 48526.894 50730.468 1.045
microByteGather64_MASK 256 thrpt ops/ms 12340.398 13159.723 1.066
microByteGather64_MASK 1024 thrpt ops/ms 3066.227 3327.964 1.085
microByteGather64_MASK 4096 thrpt ops/ms 767.390 833.327 1.085
microByteGather64_MASK_NZ_OFF 64 thrpt ops/ms 48472.912 51287.634 1.058
microByteGather64_MASK_NZ_OFF 256 thrpt ops/ms 12331.578 13258.954 1.075
microByteGather64_MASK_NZ_OFF 1024 thrpt ops/ms 3070.319 3345.911 1.089
microByteGather64_MASK_NZ_OFF 4096 thrpt ops/ms 767.097 838.008 1.092
microByteGather64_NZ_OFF 64 thrpt ops/ms 48492.984 51224.743 1.056
microByteGather64_NZ_OFF 256 thrpt ops/ms 12334.944 13240.494 1.073
microByteGather64_NZ_OFF 1024 thrpt ops/ms 3067.754 3343.387 1.089
microByteGather64_NZ_OFF 4096 thrpt ops/ms 767.123 837.642 1.091
microShortGather128 64 thrpt ops/ms 37717.835 37041.162 0.982
microShortGather128 256 thrpt ops/ms 9467.160 9890.109 1.044
microShortGather128 1024 thrpt ops/ms 2376.520 2481.753 1.044
microShortGather128 4096 thrpt ops/ms 595.030 621.274 1.044
microShortGather128_MASK 64 thrpt ops/ms 37655.017 37036.887 0.983
microShortGather128_MASK 256 thrpt ops/ms 9471.324 9859.461 1.040
microShortGather128_MASK 1024 thrpt ops/ms 2376.811 2477.106 1.042
microShortGather128_MASK 4096 thrpt ops/ms 595.049 620.082 1.042
microShortGather128_MASK_NZ_OFF 64 thrpt ops/ms 37636.229 37029.468 0.983
microShortGather128_MASK_NZ_OFF 256 thrpt ops/ms 9483.674 9867.427 1.040
microShortGather128_MASK_NZ_OFF 1024 thrpt ops/ms 2379.877 2478.608 1.041
microShortGather128_MASK_NZ_OFF 4096 thrpt ops/ms 594.710 620.455 1.043
microShortGather128_NZ_OFF 64 thrpt ops/ms 37706.896 37044.505 0.982
microShortGather128_NZ_OFF 256 thrpt ops/ms 9487.006 9882.079 1.041
microShortGather128_NZ_OFF 1024 thrpt ops/ms 2379.571 2482.341 1.043
microShortGather128_NZ_OFF 4096 thrpt ops/ms 595.099 621.392 1.044
microShortGather64 64 thrpt ops/ms 37773.485 37502.698 0.992
microShortGather64 256 thrpt ops/ms 9591.046 9640.225 1.005
microShortGather64 1024 thrpt ops/ms 2406.013 2420.376 1.005
microShortGather64 4096 thrpt ops/ms 603.270 606.541 1.005
microShortGather64_MASK 64 thrpt ops/ms 37781.860 37479.295 0.991
microShortGather64_MASK 256 thrpt ops/ms 9608.015 9657.010 1.005
microShortGather64_MASK 1024 thrpt ops/ms 2406.828 2422.170 1.006
microShortGather64_MASK 4096 thrpt ops/ms 602.965 606.283 1.005
microShortGather64_MASK_NZ_OFF 64 thrpt ops/ms 37740.577 37487.740 0.993
microShortGather64_MASK_NZ_OFF 256 thrpt ops/ms 9593.611 9663.041 1.007
microShortGather64_MASK_NZ_OFF 1024 thrpt ops/ms 2404.846 2423.493 1.007
microShortGather64_MASK_NZ_OFF 4096 thrpt ops/ms 602.691 605.911 1.005
microShortGather64_NZ_OFF 64 thrpt ops/ms 37723.586 37507.899 0.994
microShortGather64_NZ_OFF 256 thrpt ops/ms 9589.985 9630.033 1.004
microShortGather64_NZ_OFF 1024 thrpt ops/ms 2405.774 2423.655 1.007
microShortGather64_NZ_OFF 4096 thrpt ops/ms 602.778 606.151 1.005
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3116280179
More information about the hotspot-compiler-dev
mailing list