RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v4]

Wed Feb 25 11:08:06 UTC 2026

On Tue, 24 Feb 2026 10:10:58 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
>> 
>>  - Revert renaming sve_cpy as sve_cpy_optimized
>>  - Merge branch 'master' into JDK-8374349-sve-cpy-opt
>>  - Refine the code comments
>>  - Move the implementation into C2_MacroAssembler
>>  - 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
>>    
>>    When optimizing some VectorMask related APIs , we found an optimization
>>    opportunity related to the `cpy (immediate, zeroing)` instruction [1].
>>    Implementing the functionality of this instruction using `cpy (immediate,
>>    merging)` instruction [2] leads to better performance.
>>    
>>    Currently the `cpy (imm, zeroing)` instruction is used in code generated
>>    by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this
>>    optimization benefits all vector APIs that generate these two IRs
>>    potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
>>    
>>    Microbenchmarks show this change brings performance uplift ranging from
>>    **11%** to **33%**, depending on the specific operation and data types.
>>    
>>    The specific changes in this PR:
>>    1. Achieve the functionality of the `cpy (imm, zeroing)` instruction
>>    with the `movi + cpy (imm, merging)` instructions in assembler:
>>    ```
>>    cpy  z17.d, p1/z, #1 =>
>>    
>>    movi v17.2d, #0       // this instruction is zero cost
>>    cpy  z17.d, p1/m, #1
>>    ```
>>    
>>    2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to
>>    apply this optimization or not.
>>    - This option belongs to the Arch product category.
>>    - The default value is true on Neoverse-V1/V2 where the improvement
>>      has been confirmed, false on others.
>>    - When its value is true, the change is applied.
>>    
>>    3. Add a jtreg test to verify the behavior of this option.
>>    
>>    This PR was tested on aarch64 and x86 machines with different
>>    configurations, and all tests passed.
>>    
>>    JMH benchmarks:
>>    
>>    On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>>    ```
>>    Benchmark	        	Unit	size	Before		Error	After		Error	Uplift
>>    byteIndexInRange		ops/ms	7.00	471816.15	1125.96	473237.77	1593.92	1.00
>>    byteIndexInRange		ops/ms	256.00	149654.21	416.57	149259.95	116.59	1.00
>>    byteIndexInRange		ops/ms	259.00	177850.31	991.13	179785.19	1110.07	1.01
>>    byteIndexInRange		ops/ms	512.00	133393.26	167.26	133484.61	281.83	1.00
> ...
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3846:
> 
>> 3844:   // SVE copy floating-point immediate to vector elements (predicated)
>> 3845:   void sve_cpy(FloatRegister Zd, SIMD_RegVariant T, PRegister Pg, double d) {
>> 3846:     _sve_cpy(Zd, T, Pg, checked_cast<uint8_t>(pack(d)), /*isMerge*/true, /*isFloat*/true);
> 
> I can't see any purpose in this renaming.

> Hi [@theRealAph](https://github.com/theRealAph) , it's a private method,

It isn't. The `public:` declaration is right there.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/29359#discussion_r2852159061