RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v4]
Eric Fang
erfang at openjdk.org
Tue Feb 24 06:12:53 UTC 2026
On Tue, 24 Feb 2026 03:15:14 GMT, Eric Fang <erfang at openjdk.org> wrote:
>> When optimizing some VectorMask related APIs , we found an optimization opportunity related to the `cpy (immediate, zeroing)` instruction [1]. Implementing the functionality of this instruction using `cpy (immediate, merging)` instruction [2] leads to better performance.
>>
>> Currently the `cpy (imm, zeroing)` instruction is used in code generated by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this optimization benefits all vector APIs that generate these two IRs potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
>>
>> Microbenchmarks show this change brings performance uplift ranging from **11%** to **33%**, depending on the specific operation and data types.
>>
>> The specific changes in this PR:
>> 1. Achieve the functionality of the `cpy (imm, zeroing)` instruction with the `movi + cpy (imm, merging)` instructions in assembler:
>>
>> cpy z17.d, p1/z, #1 =>
>>
>> movi v17.2d, #0 // this instruction is zero cost
>> cpy z17.d, p1/m, #1
>>
>>
>> 2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to apply this optimization or not.
>> - This option belongs to the Arch product category.
>> - The default value is true on Neoverse-V1/V2 where the improvement has been confirmed, false on others.
>> - When its value is true, the change is applied.
>>
>> 3. Add a jtreg test to verify the behavior of this option.
>>
>> This PR was tested on aarch64 and x86 machines with different configurations, and all tests passed.
>>
>> JMH benchmarks:
>>
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>>
>> Benchmark Unit size Before Error After Error Uplift
>> byteIndexInRange ops/ms 7.00 471816.15 1125.96 473237.77 1593.92 1.00
>> byteIndexInRange ops/ms 256.00 149654.21 416.57 149259.95 116.59 1.00
>> byteIndexInRange ops/ms 259.00 177850.31 991.13 179785.19 1110.07 1.01
>> byteIndexInRange ops/ms 512.00 133393.26 167.26 133484.61 281.83 1.00
>> doubleIndexInRange ops/ms 7.00 302176.39 12848.8 299813.02 37.76 0.99
>> doubleIndexInRange ops/ms 256.00 47831.93 56.70 46708.70 56.11 0.98
>> doubleIndexInRange ops/ms 259.00 11550.02 27.95 15333.50 10.40 1.33
>> doubleIndexInRange ops/ms 512.00 23687.76 61.65 23996.08 69.52 1.01
>> floatIndexInRange ops/ms 7.00 412195.79 124.71 411770.23 78.73 1.00
>> floatIndexInRange ops/ms 256.00 84479.98 70.69 84237.31 70.15 1.00
>> floatIndexInRange ops/ms 259.00 22585.65 80.07 28296.21 7.98 1.25
>> floatIndexInRange ops/ms 512.00 46902.99 51.60 46686.68 66.01 1.00
>> intInd...
>
> Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
>
> - Revert renaming sve_cpy as sve_cpy_optimized
> - Merge branch 'master' into JDK-8374349-sve-cpy-opt
> - Refine the code comments
> - Move the implementation into C2_MacroAssembler
> - 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
>
> When optimizing some VectorMask related APIs , we found an optimization
> opportunity related to the `cpy (immediate, zeroing)` instruction [1].
> Implementing the functionality of this instruction using `cpy (immediate,
> merging)` instruction [2] leads to better performance.
>
> Currently the `cpy (imm, zeroing)` instruction is used in code generated
> by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this
> optimization benefits all vector APIs that generate these two IRs
> potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
>
> Microbenchmarks show this change brings performance uplift ranging from
> **11%** to **33%**, depending on the specific operation and data types.
>
> The specific changes in this PR:
> 1. Achieve the functionality of the `cpy (imm, zeroing)` instruction
> with the `movi + cpy (imm, merging)` instructions in assembler:
> ```
> cpy z17.d, p1/z, #1 =>
>
> movi v17.2d, #0 // this instruction is zero cost
> cpy z17.d, p1/m, #1
> ```
>
> 2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to
> apply this optimization or not.
> - This option belongs to the Arch product category.
> - The default value is true on Neoverse-V1/V2 where the improvement
> has been confirmed, false on others.
> - When its value is true, the change is applied.
>
> 3. Add a jtreg test to verify the behavior of this option.
>
> This PR was tested on aarch64 and x86 machines with different
> configurations, and all tests passed.
>
> JMH benchmarks:
>
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
> ```
> Benchmark Unit size Before Error After Error Uplift
> byteIndexInRange ops/ms 7.00 471816.15 1125.96 473237.77 1593.92 1.00
> byteIndexInRange ops/ms 256.00 149654.21 416.57 149259.95 116.59 1.00
> byteIndexInRange ops/ms 259.00 177850.31 991.13 179785.19 1110.07 1.01
> byteIndexInRange ops/ms 512.00 133393.26 167.26 133484.61 281.83 1.00
> doubleIndexInRange ops/ms 7.00 302176.39 12848.8 299813.02 37.76 0.99
> doubleIndexInRange ops/ms ...
The test failure (java/lang/ProcessBuilder/PipelineLeaksFD.java) should be unrelated to this PR, This test fails intermittently, as we can see in this PR and https://github.com/bradfordwetmore/jdk/actions/runs/22332773501/job/64620481585
-------------
PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3949395060
More information about the hotspot-dev
mailing list