RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
Eric Fang
erfang at openjdk.org
Thu Jan 22 12:39:48 UTC 2026
When optimizing some VectorMask related APIs , we found an optimization opportunity related to the `cpy (immediate, zeroing)` instruction [1]. Implementing the functionality of this instruction using `cpy (immediate, merging)` instruction [2] leads to better performance.
Currently the `cpy (imm, zeroing)` instruction is used in code generated by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this optimization benefits all vector APIs that generate these two IRs potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
Microbenchmarks show this change brings performance uplift ranging from **11%** to **33%**, depending on the specific operation and data types.
The specific changes in this PR:
1. Achieve the functionality of the `cpy (imm, zeroing)` instruction with the `movi + cpy (imm, merging)` instructions in assembler:
cpy z17.d, p1/z, #1 =>
movi v17.2d, #0 // this instruction is zero cost
cpy z17.d, p1/m, #1
2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to apply this optimization or not.
- This option belongs to the Arch product category.
- The default value is true on Neoverse-V1/V2 where the improvement has been confirmed, false on others.
- When its value is true, the change is applied.
3. Add a jtreg test to verify the behavior of this option.
This PR was tested on aarch64 and x86 machines with different configurations, and all tests passed.
JMH benchmarks:
On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
Benchmark Unit size Before Error After Error Uplift
byteIndexInRange ops/ms 7.00 471816.15 1125.96 473237.77 1593.92 1.00
byteIndexInRange ops/ms 256.00 149654.21 416.57 149259.95 116.59 1.00
byteIndexInRange ops/ms 259.00 177850.31 991.13 179785.19 1110.07 1.01
byteIndexInRange ops/ms 512.00 133393.26 167.26 133484.61 281.83 1.00
doubleIndexInRange ops/ms 7.00 302176.39 12848.8 299813.02 37.76 0.99
doubleIndexInRange ops/ms 256.00 47831.93 56.70 46708.70 56.11 0.98
doubleIndexInRange ops/ms 259.00 11550.02 27.95 15333.50 10.40 1.33
doubleIndexInRange ops/ms 512.00 23687.76 61.65 23996.08 69.52 1.01
floatIndexInRange ops/ms 7.00 412195.79 124.71 411770.23 78.73 1.00
floatIndexInRange ops/ms 256.00 84479.98 70.69 84237.31 70.15 1.00
floatIndexInRange ops/ms 259.00 22585.65 80.07 28296.21 7.98 1.25
floatIndexInRange ops/ms 512.00 46902.99 51.60 46686.68 66.01 1.00
intIndexInRange ops/ms 7.00 413411.70 50.59 420684.66 253.55 1.02
intIndexInRange ops/ms 256.00 84652.41 191.45 86758.74 193.66 1.02
intIndexInRange ops/ms 259.00 61825.20 291.71 62037.58 2355.43 1.00
intIndexInRange ops/ms 512.00 46754.89 149.72 46972.06 40.13 1.00
longIndexInRange ops/ms 7.00 329385.10 3292.7 318538.75 11103.9 0.97
longIndexInRange ops/ms 256.00 46910.36 53.41 46927.82 138.29 1.00
longIndexInRange ops/ms 259.00 33126.45 3210.07 32245.59 1347.58 0.97
longIndexInRange ops/ms 512.00 23931.64 215.55 23805.65 312.39 0.99
shortIndexInRange ops/ms 7.00 479265.67 1055.89 468452.89 433.15 0.98
shortIndexInRange ops/ms 256.00 138657.38 317.72 138695.29 505.69 1.00
shortIndexInRange ops/ms 259.00 113353.87 913.13 108912.75 1125.60 0.96
shortIndexInRange ops/ms 512.00 84652.74 171.37 84447.01 91.99 1.00
On an AWS Graviton3 (Neoverse-V1) machine with 128-bit SVE1:
Benchmark Unit size Before Error After Error Uplift
byteIndexInRange ops/ms 7.00 320073.86 669.91 318557.87 1285.42 1.00
byteIndexInRange ops/ms 256.00 119246.71 43.13 120658.01 28.27 1.01
byteIndexInRange ops/ms 259.00 137664.23 12001.6 150378.59 70.41 1.09
byteIndexInRange ops/ms 512.00 97187.13 18.60 95356.43 78.60 0.98
doubleIndexInRange ops/ms 7.00 291076.68 603.08 287383.75 518.59 0.99
doubleIndexInRange ops/ms 256.00 57473.11 123.34 61559.58 687.21 1.07
doubleIndexInRange ops/ms 259.00 19396.73 40.03 22046.65 8.66 1.14
doubleIndexInRange ops/ms 512.00 33619.28 33.58 34715.40 157.72 1.03
floatIndexInRange ops/ms 7.00 317295.18 627.76 303857.78 465.78 0.96
floatIndexInRange ops/ms 256.00 91734.27 183.61 91851.31 394.35 1.00
floatIndexInRange ops/ms 259.00 38103.12 129.44 42237.38 92.17 1.11
floatIndexInRange ops/ms 512.00 57219.58 366.00 57769.07 264.71 1.01
intIndexInRange ops/ms 7.00 317063.25 830.81 304289.56 541.12 0.96
intIndexInRange ops/ms 256.00 91535.60 315.36 98143.40 142.44 1.07
intIndexInRange ops/ms 259.00 73827.89 472.28 73781.80 21.53 1.00
intIndexInRange ops/ms 512.00 57552.09 20.19 62348.87 37.45 1.08
longIndexInRange ops/ms 7.00 301886.14 381.89 301636.82 184.80 1.00
longIndexInRange ops/ms 256.00 62246.77 69.29 62093.75 88.72 1.00
longIndexInRange ops/ms 259.00 40642.36 861.47 41566.43 256.04 1.02
longIndexInRange ops/ms 512.00 34850.70 154.39 34884.42 149.17 1.00
shortIndexInRange ops/ms 7.00 318133.03 593.20 313469.12 528.73 0.99
shortIndexInRange ops/ms 256.00 105019.58 21.38 105014.90 21.81 1.00
shortIndexInRange ops/ms 259.00 116235.93 1985.27 118697.74 48.41 1.02
shortIndexInRange ops/ms 512.00 91981.84 166.84 91874.82 78.28 1.00
[1] https://developer.arm.com/documentation/ddi0602/2025-06/SVE-Instructions/CPY--immediate--zeroing---Copy-signed-integer-immediate-to-vector-elements--zeroing--?lang=en
[2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/CPY--immediate--merging---Copy-signed-integer-immediate-to-vector-elements--merging--?lang=en
-------------
Commit messages:
- 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
Changes: https://git.openjdk.org/jdk/pull/29359/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29359&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8374349
Stats: 193 lines in 6 files changed: 171 ins; 7 del; 15 mod
Patch: https://git.openjdk.org/jdk/pull/29359.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/29359/head:pull/29359
PR: https://git.openjdk.org/jdk/pull/29359
More information about the hotspot-dev
mailing list