RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction

Wed Jan 28 04:08:08 UTC 2026

On Fri, 23 Jan 2026 10:28:46 GMT, Eric Fang <erfang at openjdk.org> wrote:

>>> > Can we do without `PreferSVEMergingModeCPY`?
>>> 
>>> Thanks, @theRealAph. That’s a fair question – in general, fewer options are definitely preferable.
>>> 
>>> For this change, the main reason I introduced `PreferSVEMergingModeCPY` as an Arch-level flag is that the benefit and trade-offs of using the merging-mode sequence can be quite **microarchitecture-dependent**. 
>> 
>> The question is not whether something may possibly be better or worse, but is it significantly so? If you're going to add another flag, you have to provide evidence that the difference matters. The burden is on you to show that it does.
>> 
>>> From a user’s point of view, the default behaviour should still be sensible: the option is enabled by default on `Neoverse V1/V2`, where we’ve confirmed the improvement, and disabled elsewhere. If, over time, we gain enough confidence that the merging-mode sequence is strictly preferable across a wider range of hardware, I’m happy to follow up with a separate change to hard-wire the behaviour and drop the flag.
>> 
>> That's the wrong way to think about it. There are thousands of tiny decisions we've made in the AArch64 port, and the number of possible tweaks is almost infinite.  If we need a flag in the future we can add one, but then we'll do a better job once we understand what we really need.
>
> Thanks @theRealAph. I don't have test data on other aarch64 microarchitectures to show the difference with  and without this optimization. I'm asking my Arm partners for help testing to see if they have more AArch64 environments.
> 
> Adding a new flag for this optimization might not be appropriate. I'm thinking of handling it this way: **during JVM initialization (in `vm_version.cpp`), check the microarchitecture, set a global variable, and then use it in the assembler to determine whether to apply the optimization.**
> 
> Maintaining microarchitecture-specific handling might be worthwhile. This provides some flexibility for different/future architectures. Because we're unsure how the future microarchitecture will behave, from a performance and code size perspective, we generally prefer single instruction over multiple instructions. Is it fine to you ?

Hi @erifan ,

I ran some tests on a `Neoverse N2` platform, which supports `128-bit SVE2`.

I executed `IndexInRangeBenchmark.java` with `@Warmup(iterations = 10, time = 1)` and `@Measurement(iterations = 10, time = 2)`, comparing runs with `PreferSVEMergingModeCPY` enabled and disabled in your patch.

Benchmark                                 (size)    Mode    Cnt   Unit         true/false
IndexInRangeBenchmark.byteIndexInRange      7      thrpt    10    ops/ms        0.95
IndexInRangeBenchmark.byteIndexInRange      256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.byteIndexInRange      259    thrpt    10    ops/ms        1.02
IndexInRangeBenchmark.byteIndexInRange      512    thrpt    10    ops/ms        1.01
IndexInRangeBenchmark.doubleIndexInRange    7      thrpt    10    ops/ms        0.96
IndexInRangeBenchmark.doubleIndexInRange    256    thrpt    10    ops/ms        1.02
IndexInRangeBenchmark.doubleIndexInRange    259    thrpt    10    ops/ms        1.10
IndexInRangeBenchmark.doubleIndexInRange    512    thrpt    10    ops/ms        1.01
IndexInRangeBenchmark.floatIndexInRange     7      thrpt    10    ops/ms        0.96
IndexInRangeBenchmark.floatIndexInRange     256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.floatIndexInRange     259    thrpt    10    ops/ms        1.06
IndexInRangeBenchmark.floatIndexInRange     512    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.intIndexInRange       7      thrpt    10    ops/ms        0.96
IndexInRangeBenchmark.intIndexInRange       256    thrpt    10    ops/ms        0.98
IndexInRangeBenchmark.intIndexInRange       259    thrpt    10    ops/ms        1.04
IndexInRangeBenchmark.intIndexInRange       512    thrpt    10    ops/ms        0.98
IndexInRangeBenchmark.longIndexInRange      7      thrpt    10    ops/ms        1.01
IndexInRangeBenchmark.longIndexInRange      256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.longIndexInRange      259    thrpt    10    ops/ms        0.92
IndexInRangeBenchmark.longIndexInRange      512    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.shortIndexInRange     7      thrpt    10    ops/ms        0.95
IndexInRangeBenchmark.shortIndexInRange     256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.shortIndexInRange     259    thrpt    10    ops/ms        0.96
IndexInRangeBenchmark.shortIndexInRange     512    thrpt    10    ops/ms        1.00

The uplift in `doubleIndexInRange` and `floatIndexInRange` appears significant. For the other data types, however, the JMH results are quite noisy.

To better isolate JMH noise, I also ran some simple C microbenchmark loops directly.
For `zeroing immediate` mode, the generated C assembly looks like:

.L3:
        ld1d    z1.d, p0/z, [x3]
        add     w0, w0, 1
        cmpne   p1.d, p0/z, z1.d, #0
        mov     z0.d, p1/z, #1
        add     z0.d, z0.d, z1.d
        st1d    z0.d, p0, [x2]
        cmp     w1, w0
        bne     .L3

For `merging immediate` mode, the generated code is:

.L12:
        ld1d    z1.d, p0/z, [x3]
        mov     z0.d, x0
        cmpne   p1.d, p0/z, z1.d, #0
        add     x0, x0, 1
        mov     z0.d, p1/m, #1
        add     z0.d, z0.d, z1.d
        st1d    z0.d, p0, [x2]
        cmp     w1, w0
        bgt     .L12

The `mov + merging immediate` variant performs noticeably better than the `zeroing immediate` version. Based on this, I believe your change should also benefit the `Neoverse N2` platform.

We suspect that the minor regressions observed in some JMH cases (e.g., `longIndexInRange`) may be due to code layout or alignment effects, since enabling the optimization introduces an extra `mov`. See Section 4.9: Branch instruction alignment of N2 software optimization guide: https://developer.arm.com/documentation/109914/0500/?lang=en.

To test this hypothesis, I made a small experimental modification to your patch by adding an extra `nop` to the existing `zeroing` mode:

diff --git a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp
index 4dd19574f30..295a763f260 100644
--- a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp
+++ b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp
@@ -3846,6 +3846,8 @@ template<typename R, typename... Rx>
       // According to the Arm Software Optimization Guide, `movi` is zero cost.
       movi(Zd, T2D, 0);
       isMerge = true;
+    } else if(!isMerge){
+      nop();
     }
     sve_cpy(Zd, T, Pg, imm8, isMerge, /*isFloat*/false);
   }

After rerunning the JMH benchmarks, the results appeared more stable:

Benchmark                                 (size)    Mode    Cnt    Unit        true/false
IndexInRangeBenchmark.byteIndexInRange      7      thrpt    10    ops/ms        0.98
IndexInRangeBenchmark.byteIndexInRange      256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.byteIndexInRange      259    thrpt    10    ops/ms        0.99
IndexInRangeBenchmark.byteIndexInRange      512    thrpt    10    ops/ms        0.99
IndexInRangeBenchmark.doubleIndexInRange    7      thrpt    10    ops/ms        0.97
IndexInRangeBenchmark.doubleIndexInRange    256    thrpt    10    ops/ms        1.01
IndexInRangeBenchmark.doubleIndexInRange    259    thrpt    10    ops/ms        1.13
IndexInRangeBenchmark.doubleIndexInRange    512    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.floatIndexInRange     7      thrpt    10    ops/ms        0.99
IndexInRangeBenchmark.floatIndexInRange     256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.floatIndexInRange     259    thrpt    10    ops/ms        1.07
IndexInRangeBenchmark.floatIndexInRange     512    thrpt    10    ops/ms        1.01
IndexInRangeBenchmark.intIndexInRange       7      thrpt    10    ops/ms        1.04
IndexInRangeBenchmark.intIndexInRange       256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.intIndexInRange       259    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.intIndexInRange       512    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.longIndexInRange      7      thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.longIndexInRange      256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.longIndexInRange      259    thrpt    10    ops/ms        1.04
IndexInRangeBenchmark.longIndexInRange      512    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.shortIndexInRange     7      thrpt    10    ops/ms        0.99
IndexInRangeBenchmark.shortIndexInRange     256    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.shortIndexInRange     259    thrpt    10    ops/ms        1.00
IndexInRangeBenchmark.shortIndexInRange     512    thrpt    10    ops/ms        1.00

Overall, these results suggest that the patch should also be beneficial on `Neoverse N2`.
Thanks for your work!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3804576835