RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction

Fri Jan 23 10:31:45 UTC 2026

On Fri, 23 Jan 2026 09:24:31 GMT, Andrew Haley <aph at openjdk.org> wrote:

>>> Can we do without `PreferSVEMergingModeCPY`?
>> 
>> Thanks, @theRealAph. That’s a fair question – in general, fewer options are definitely preferable.
>> 
>> For this change, the main reason I introduced `PreferSVEMergingModeCPY` as an Arch-level flag is that the benefit and trade-offs of using the merging-mode sequence can be quite **microarchitecture-dependent**. At the moment, I’ve only been able to systematically evaluate it on `Neoverse V1/V2`, where the optimization is consistently neutral or positive in our VectorMask-related benchmarks, so it feels safe to keep it enabled by default there. On other AArch64 SVE implementations, we don’t yet have the same level of data. Keeping this knob gives us two advantages:
>> 
>> 1. It provides a simple, low-friction escape hatch if some future core shows an unexpected regression with the merging-mode sequence.
>> 2. It allows us (or downstream distributions) to selectively enable/disable the optimization per platform without needing to change the generated code shape again.
>> 
>> From a user’s point of view, the default behaviour should still be sensible: the option is enabled by default on `Neoverse V1/V2`, where we’ve confirmed the improvement, and disabled elsewhere. If, over time, we gain enough confidence that the merging-mode sequence is strictly preferable across a wider range of hardware, I’m happy to follow up with a separate change to hard-wire the behaviour and drop the flag.
>
>> > Can we do without `PreferSVEMergingModeCPY`?
>> 
>> Thanks, @theRealAph. That’s a fair question – in general, fewer options are definitely preferable.
>> 
>> For this change, the main reason I introduced `PreferSVEMergingModeCPY` as an Arch-level flag is that the benefit and trade-offs of using the merging-mode sequence can be quite **microarchitecture-dependent**. 
> 
> The question is not whether something may possibly be better or worse, but is it significantly so? If you're going to add another flag, you have to provide evidence that the difference matters. The burden is on you to show that it does.
> 
>> From a user’s point of view, the default behaviour should still be sensible: the option is enabled by default on `Neoverse V1/V2`, where we’ve confirmed the improvement, and disabled elsewhere. If, over time, we gain enough confidence that the merging-mode sequence is strictly preferable across a wider range of hardware, I’m happy to follow up with a separate change to hard-wire the behaviour and drop the flag.
> 
> That's the wrong way to think about it. There are thousands of tiny decisions we've made in the AArch64 port, and the number of possible tweaks is almost infinite.  If we need a flag in the future we can add one, but then we'll do a better job once we understand what we really need.

Thanks @theRealAph. I don't have test data on other aarch64 microarchitectures to show the difference with  and without this optimization. I'm asking my Arm partners for help testing to see if they have more AArch64 environments.

Adding a new flag for this optimization might not be appropriate. I'm thinking of handling it this way: **during JVM initialization (in `vm_version.cpp`), check the microarchitecture, set a global variable, and then use it in the assembler to determine whether to apply the optimization.**

Maintaining microarchitecture-specific handling might be worthwhile. This provides some flexibility for different/future architectures. Because we're unsure how the future microarchitecture will behave, from a performance and code size perspective, we generally prefer single instruction over multiple instructions. Is it fine to you ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3789562233