RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v4]
Andrew Haley
aph at openjdk.org
Wed Feb 25 11:08:06 UTC 2026
On Tue, 24 Feb 2026 10:10:58 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
>>
>> - Revert renaming sve_cpy as sve_cpy_optimized
>> - Merge branch 'master' into JDK-8374349-sve-cpy-opt
>> - Refine the code comments
>> - Move the implementation into C2_MacroAssembler
>> - 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
>>
>> When optimizing some VectorMask related APIs , we found an optimization
>> opportunity related to the `cpy (immediate, zeroing)` instruction [1].
>> Implementing the functionality of this instruction using `cpy (immediate,
>> merging)` instruction [2] leads to better performance.
>>
>> Currently the `cpy (imm, zeroing)` instruction is used in code generated
>> by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this
>> optimization benefits all vector APIs that generate these two IRs
>> potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
>>
>> Microbenchmarks show this change brings performance uplift ranging from
>> **11%** to **33%**, depending on the specific operation and data types.
>>
>> The specific changes in this PR:
>> 1. Achieve the functionality of the `cpy (imm, zeroing)` instruction
>> with the `movi + cpy (imm, merging)` instructions in assembler:
>> ```
>> cpy z17.d, p1/z, #1 =>
>>
>> movi v17.2d, #0 // this instruction is zero cost
>> cpy z17.d, p1/m, #1
>> ```
>>
>> 2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to
>> apply this optimization or not.
>> - This option belongs to the Arch product category.
>> - The default value is true on Neoverse-V1/V2 where the improvement
>> has been confirmed, false on others.
>> - When its value is true, the change is applied.
>>
>> 3. Add a jtreg test to verify the behavior of this option.
>>
>> This PR was tested on aarch64 and x86 machines with different
>> configurations, and all tests passed.
>>
>> JMH benchmarks:
>>
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> ```
>> Benchmark Unit size Before Error After Error Uplift
>> byteIndexInRange ops/ms 7.00 471816.15 1125.96 473237.77 1593.92 1.00
>> byteIndexInRange ops/ms 256.00 149654.21 416.57 149259.95 116.59 1.00
>> byteIndexInRange ops/ms 259.00 177850.31 991.13 179785.19 1110.07 1.01
>> byteIndexInRange ops/ms 512.00 133393.26 167.26 133484.61 281.83 1.00
> ...
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3846:
>
>> 3844: // SVE copy floating-point immediate to vector elements (predicated)
>> 3845: void sve_cpy(FloatRegister Zd, SIMD_RegVariant T, PRegister Pg, double d) {
>> 3846: _sve_cpy(Zd, T, Pg, checked_cast<uint8_t>(pack(d)), /*isMerge*/true, /*isFloat*/true);
>
> I can't see any purpose in this renaming.
> Hi [@theRealAph](https://github.com/theRealAph) , it's a private method,
It isn't. The `public:` declaration is right there.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/29359#discussion_r2852159061
More information about the hotspot-dev
mailing list