RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
Fei Gao
fgao at openjdk.org
Wed Jan 28 04:08:08 UTC 2026
On Fri, 23 Jan 2026 10:28:46 GMT, Eric Fang <erfang at openjdk.org> wrote:
>>> > Can we do without `PreferSVEMergingModeCPY`?
>>>
>>> Thanks, @theRealAph. That’s a fair question – in general, fewer options are definitely preferable.
>>>
>>> For this change, the main reason I introduced `PreferSVEMergingModeCPY` as an Arch-level flag is that the benefit and trade-offs of using the merging-mode sequence can be quite **microarchitecture-dependent**.
>>
>> The question is not whether something may possibly be better or worse, but is it significantly so? If you're going to add another flag, you have to provide evidence that the difference matters. The burden is on you to show that it does.
>>
>>> From a user’s point of view, the default behaviour should still be sensible: the option is enabled by default on `Neoverse V1/V2`, where we’ve confirmed the improvement, and disabled elsewhere. If, over time, we gain enough confidence that the merging-mode sequence is strictly preferable across a wider range of hardware, I’m happy to follow up with a separate change to hard-wire the behaviour and drop the flag.
>>
>> That's the wrong way to think about it. There are thousands of tiny decisions we've made in the AArch64 port, and the number of possible tweaks is almost infinite. If we need a flag in the future we can add one, but then we'll do a better job once we understand what we really need.
>
> Thanks @theRealAph. I don't have test data on other aarch64 microarchitectures to show the difference with and without this optimization. I'm asking my Arm partners for help testing to see if they have more AArch64 environments.
>
> Adding a new flag for this optimization might not be appropriate. I'm thinking of handling it this way: **during JVM initialization (in `vm_version.cpp`), check the microarchitecture, set a global variable, and then use it in the assembler to determine whether to apply the optimization.**
>
> Maintaining microarchitecture-specific handling might be worthwhile. This provides some flexibility for different/future architectures. Because we're unsure how the future microarchitecture will behave, from a performance and code size perspective, we generally prefer single instruction over multiple instructions. Is it fine to you ?
Hi @erifan ,
I ran some tests on a `Neoverse N2` platform, which supports `128-bit SVE2`.
I executed `IndexInRangeBenchmark.java` with `@Warmup(iterations = 10, time = 1)` and `@Measurement(iterations = 10, time = 2)`, comparing runs with `PreferSVEMergingModeCPY` enabled and disabled in your patch.
Benchmark (size) Mode Cnt Unit true/false
IndexInRangeBenchmark.byteIndexInRange 7 thrpt 10 ops/ms 0.95
IndexInRangeBenchmark.byteIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.byteIndexInRange 259 thrpt 10 ops/ms 1.02
IndexInRangeBenchmark.byteIndexInRange 512 thrpt 10 ops/ms 1.01
IndexInRangeBenchmark.doubleIndexInRange 7 thrpt 10 ops/ms 0.96
IndexInRangeBenchmark.doubleIndexInRange 256 thrpt 10 ops/ms 1.02
IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 10 ops/ms 1.10
IndexInRangeBenchmark.doubleIndexInRange 512 thrpt 10 ops/ms 1.01
IndexInRangeBenchmark.floatIndexInRange 7 thrpt 10 ops/ms 0.96
IndexInRangeBenchmark.floatIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.floatIndexInRange 259 thrpt 10 ops/ms 1.06
IndexInRangeBenchmark.floatIndexInRange 512 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.intIndexInRange 7 thrpt 10 ops/ms 0.96
IndexInRangeBenchmark.intIndexInRange 256 thrpt 10 ops/ms 0.98
IndexInRangeBenchmark.intIndexInRange 259 thrpt 10 ops/ms 1.04
IndexInRangeBenchmark.intIndexInRange 512 thrpt 10 ops/ms 0.98
IndexInRangeBenchmark.longIndexInRange 7 thrpt 10 ops/ms 1.01
IndexInRangeBenchmark.longIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.longIndexInRange 259 thrpt 10 ops/ms 0.92
IndexInRangeBenchmark.longIndexInRange 512 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.shortIndexInRange 7 thrpt 10 ops/ms 0.95
IndexInRangeBenchmark.shortIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.shortIndexInRange 259 thrpt 10 ops/ms 0.96
IndexInRangeBenchmark.shortIndexInRange 512 thrpt 10 ops/ms 1.00
The uplift in `doubleIndexInRange` and `floatIndexInRange` appears significant. For the other data types, however, the JMH results are quite noisy.
To better isolate JMH noise, I also ran some simple C microbenchmark loops directly.
For `zeroing immediate` mode, the generated C assembly looks like:
.L3:
ld1d z1.d, p0/z, [x3]
add w0, w0, 1
cmpne p1.d, p0/z, z1.d, #0
mov z0.d, p1/z, #1
add z0.d, z0.d, z1.d
st1d z0.d, p0, [x2]
cmp w1, w0
bne .L3
For `merging immediate` mode, the generated code is:
.L12:
ld1d z1.d, p0/z, [x3]
mov z0.d, x0
cmpne p1.d, p0/z, z1.d, #0
add x0, x0, 1
mov z0.d, p1/m, #1
add z0.d, z0.d, z1.d
st1d z0.d, p0, [x2]
cmp w1, w0
bgt .L12
The `mov + merging immediate` variant performs noticeably better than the `zeroing immediate` version. Based on this, I believe your change should also benefit the `Neoverse N2` platform.
We suspect that the minor regressions observed in some JMH cases (e.g., `longIndexInRange`) may be due to code layout or alignment effects, since enabling the optimization introduces an extra `mov`. See Section 4.9: Branch instruction alignment of N2 software optimization guide: https://developer.arm.com/documentation/109914/0500/?lang=en.
To test this hypothesis, I made a small experimental modification to your patch by adding an extra `nop` to the existing `zeroing` mode:
diff --git a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp
index 4dd19574f30..295a763f260 100644
--- a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp
+++ b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp
@@ -3846,6 +3846,8 @@ template<typename R, typename... Rx>
// According to the Arm Software Optimization Guide, `movi` is zero cost.
movi(Zd, T2D, 0);
isMerge = true;
+ } else if(!isMerge){
+ nop();
}
sve_cpy(Zd, T, Pg, imm8, isMerge, /*isFloat*/false);
}
After rerunning the JMH benchmarks, the results appeared more stable:
Benchmark (size) Mode Cnt Unit true/false
IndexInRangeBenchmark.byteIndexInRange 7 thrpt 10 ops/ms 0.98
IndexInRangeBenchmark.byteIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.byteIndexInRange 259 thrpt 10 ops/ms 0.99
IndexInRangeBenchmark.byteIndexInRange 512 thrpt 10 ops/ms 0.99
IndexInRangeBenchmark.doubleIndexInRange 7 thrpt 10 ops/ms 0.97
IndexInRangeBenchmark.doubleIndexInRange 256 thrpt 10 ops/ms 1.01
IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 10 ops/ms 1.13
IndexInRangeBenchmark.doubleIndexInRange 512 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.floatIndexInRange 7 thrpt 10 ops/ms 0.99
IndexInRangeBenchmark.floatIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.floatIndexInRange 259 thrpt 10 ops/ms 1.07
IndexInRangeBenchmark.floatIndexInRange 512 thrpt 10 ops/ms 1.01
IndexInRangeBenchmark.intIndexInRange 7 thrpt 10 ops/ms 1.04
IndexInRangeBenchmark.intIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.intIndexInRange 259 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.intIndexInRange 512 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.longIndexInRange 7 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.longIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.longIndexInRange 259 thrpt 10 ops/ms 1.04
IndexInRangeBenchmark.longIndexInRange 512 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.shortIndexInRange 7 thrpt 10 ops/ms 0.99
IndexInRangeBenchmark.shortIndexInRange 256 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.shortIndexInRange 259 thrpt 10 ops/ms 1.00
IndexInRangeBenchmark.shortIndexInRange 512 thrpt 10 ops/ms 1.00
Overall, these results suggest that the patch should also be beneficial on `Neoverse N2`.
Thanks for your work!
-------------
PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3804576835
More information about the hotspot-dev
mailing list