RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v4]

Tue Feb 24 06:12:53 UTC 2026

On Tue, 24 Feb 2026 03:15:14 GMT, Eric Fang <erfang at openjdk.org> wrote:

>> When optimizing some VectorMask related APIs , we found an optimization opportunity related to the `cpy (immediate, zeroing)` instruction [1]. Implementing the functionality of this instruction using `cpy (immediate, merging)` instruction [2] leads to better performance.
>> 
>> Currently the `cpy (imm, zeroing)` instruction is used in code generated by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this optimization benefits all vector APIs that generate these two IRs potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
>> 
>> Microbenchmarks show this change brings performance uplift ranging from **11%** to **33%**, depending on the specific operation and data types.
>> 
>> The specific changes in this PR:
>> 1. Achieve the functionality of the `cpy (imm, zeroing)` instruction with the `movi + cpy (imm, merging)` instructions in assembler:
>> 
>> cpy  z17.d, p1/z, #1 =>
>> 
>> movi v17.2d, #0       // this instruction is zero cost
>> cpy  z17.d, p1/m, #1
>> 
>> 
>> 2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to apply this optimization or not.
>> - This option belongs to the Arch product category.
>> - The default value is true on Neoverse-V1/V2 where the improvement has been confirmed, false on others.
>> - When its value is true, the change is applied.
>> 
>> 3. Add a jtreg test to verify the behavior of this option.
>> 
>> This PR was tested on aarch64 and x86 machines with different configurations, and all tests passed.
>> 
>> JMH benchmarks:
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> 
>> Benchmark	        	Unit	size	Before		Error	After		Error	Uplift
>> byteIndexInRange		ops/ms	7.00	471816.15	1125.96	473237.77	1593.92	1.00
>> byteIndexInRange		ops/ms	256.00	149654.21	416.57	149259.95	116.59	1.00
>> byteIndexInRange		ops/ms	259.00	177850.31	991.13	179785.19	1110.07	1.01
>> byteIndexInRange		ops/ms	512.00	133393.26	167.26	133484.61	281.83	1.00
>> doubleIndexInRange		ops/ms	7.00	302176.39	12848.8	299813.02	37.76	0.99
>> doubleIndexInRange		ops/ms	256.00	47831.93	56.70	46708.70	56.11	0.98
>> doubleIndexInRange		ops/ms	259.00	11550.02	27.95	15333.50	10.40	1.33
>> doubleIndexInRange		ops/ms	512.00	23687.76	61.65	23996.08	69.52	1.01
>> floatIndexInRange		ops/ms	7.00	412195.79	124.71	411770.23	78.73	1.00
>> floatIndexInRange		ops/ms	256.00	84479.98	70.69	84237.31	70.15	1.00
>> floatIndexInRange		ops/ms	259.00	22585.65	80.07	28296.21	7.98	1.25
>> floatIndexInRange		ops/ms	512.00	46902.99	51.60	46686.68	66.01	1.00
>> intInd...
>
> Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
> 
>  - Revert renaming sve_cpy as sve_cpy_optimized
>  - Merge branch 'master' into JDK-8374349-sve-cpy-opt
>  - Refine the code comments
>  - Move the implementation into C2_MacroAssembler
>  - 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
>    
>    When optimizing some VectorMask related APIs , we found an optimization
>    opportunity related to the `cpy (immediate, zeroing)` instruction [1].
>    Implementing the functionality of this instruction using `cpy (immediate,
>    merging)` instruction [2] leads to better performance.
>    
>    Currently the `cpy (imm, zeroing)` instruction is used in code generated
>    by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this
>    optimization benefits all vector APIs that generate these two IRs
>    potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
>    
>    Microbenchmarks show this change brings performance uplift ranging from
>    **11%** to **33%**, depending on the specific operation and data types.
>    
>    The specific changes in this PR:
>    1. Achieve the functionality of the `cpy (imm, zeroing)` instruction
>    with the `movi + cpy (imm, merging)` instructions in assembler:
>    ```
>    cpy  z17.d, p1/z, #1 =>
>    
>    movi v17.2d, #0       // this instruction is zero cost
>    cpy  z17.d, p1/m, #1
>    ```
>    
>    2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to
>    apply this optimization or not.
>    - This option belongs to the Arch product category.
>    - The default value is true on Neoverse-V1/V2 where the improvement
>      has been confirmed, false on others.
>    - When its value is true, the change is applied.
>    
>    3. Add a jtreg test to verify the behavior of this option.
>    
>    This PR was tested on aarch64 and x86 machines with different
>    configurations, and all tests passed.
>    
>    JMH benchmarks:
>    
>    On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>    ```
>    Benchmark	        	Unit	size	Before		Error	After		Error	Uplift
>    byteIndexInRange		ops/ms	7.00	471816.15	1125.96	473237.77	1593.92	1.00
>    byteIndexInRange		ops/ms	256.00	149654.21	416.57	149259.95	116.59	1.00
>    byteIndexInRange		ops/ms	259.00	177850.31	991.13	179785.19	1110.07	1.01
>    byteIndexInRange		ops/ms	512.00	133393.26	167.26	133484.61	281.83	1.00
>    doubleIndexInRange		ops/ms	7.00	302176.39	12848.8	299813.02	37.76	0.99
>    doubleIndexInRange		ops/ms	...

The test failure (java/lang/ProcessBuilder/PipelineLeaksFD.java) should be unrelated to this PR, This test fails intermittently, as we can see in this PR and https://github.com/bradfordwetmore/jdk/actions/runs/22332773501/job/64620481585

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3949395060