RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction
Eric Fang
erfang at openjdk.org
Wed Jan 28 04:08:09 UTC 2026
On Tue, 27 Jan 2026 11:15:55 GMT, Fei Gao <fgao at openjdk.org> wrote:
>> Thanks @theRealAph. I don't have test data on other aarch64 microarchitectures to show the difference with and without this optimization. I'm asking my Arm partners for help testing to see if they have more AArch64 environments.
>>
>> Adding a new flag for this optimization might not be appropriate. I'm thinking of handling it this way: **during JVM initialization (in `vm_version.cpp`), check the microarchitecture, set a global variable, and then use it in the assembler to determine whether to apply the optimization.**
>>
>> Maintaining microarchitecture-specific handling might be worthwhile. This provides some flexibility for different/future architectures. Because we're unsure how the future microarchitecture will behave, from a performance and code size perspective, we generally prefer single instruction over multiple instructions. Is it fine to you ?
>
> Hi @erifan ,
>
> I ran some tests on a `Neoverse N2` platform, which supports `128-bit SVE2`.
>
> I executed `IndexInRangeBenchmark.java` with `@Warmup(iterations = 10, time = 1)` and `@Measurement(iterations = 10, time = 2)`, comparing runs with `PreferSVEMergingModeCPY` enabled and disabled in your patch.
>
> Benchmark (size) Mode Cnt Unit true/false
> IndexInRangeBenchmark.byteIndexInRange 7 thrpt 10 ops/ms 0.95
> IndexInRangeBenchmark.byteIndexInRange 256 thrpt 10 ops/ms 1.00
> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 10 ops/ms 1.02
> IndexInRangeBenchmark.byteIndexInRange 512 thrpt 10 ops/ms 1.01
> IndexInRangeBenchmark.doubleIndexInRange 7 thrpt 10 ops/ms 0.96
> IndexInRangeBenchmark.doubleIndexInRange 256 thrpt 10 ops/ms 1.02
> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 10 ops/ms 1.10
> IndexInRangeBenchmark.doubleIndexInRange 512 thrpt 10 ops/ms 1.01
> IndexInRangeBenchmark.floatIndexInRange 7 thrpt 10 ops/ms 0.96
> IndexInRangeBenchmark.floatIndexInRange 256 thrpt 10 ops/ms 1.00
> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 10 ops/ms 1.06
> IndexInRangeBenchmark.floatIndexInRange 512 thrpt 10 ops/ms 1.00
> IndexInRangeBenchmark.intIndexInRange 7 thrpt 10 ops/ms 0.96
> IndexInRangeBenchmark.intIndexInRange 256 thrpt 10 ops/ms 0.98
> IndexInRangeBenchmark.intIndexInRange 259 thrpt 10 ops/ms 1.04
> IndexInRangeBenchmark.intIndexInRange 512 thrpt 10 ops/ms 0.98
> IndexInRangeBenchmark.longIndexInRange 7 thrpt 10 ops/ms 1.01
> IndexInRangeBenchmark.longIndexInRange 256 thrpt 10 ops/ms 1.00
> IndexInRangeBenchmark.longIndexInRange 259 thrpt 10 ops/ms 0.92
> IndexInRangeBenchmark.longIndexInRange 512 thrpt 10 ops/ms 1.00
> IndexInRangeBenchmark.shortIndexInRange 7 thrpt 10 ops/ms 0.95
> IndexInRangeBenchmark.shortIndexInRange 256 thrpt 10 ops/ms 1.00
> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 10 ops/ms 0.96
> IndexInRangeBenchmark.shortIndexInRange 512 thrpt 10 ops/ms 1.00
>
> The uplift in `doubleIndexInRange` and `floatIndexInRange` appear...
@fg1417 thanks for your help, this is really helpful!
You've also noticed slight regression in a few cases, which is reasonable. The optimization effect is influenced by multiple factors, such as the alignment you mentioned on N2, as well as code generation and register allocation. The underlying principle of this optimization is that the latency of the `cpy(imm, zeroing)` instruction seems quite high, while the `movi + cpy(imm, merging)` combination improves the parallelism of the program. In some cases, a `mov` or other instruction with the same effect is already generated before the `cpy(imm, zeroing)` instruction, thus achieving the optimization effect of the `movi + cpy(imm, merging)` instruction combination. Therefore, the slight regression caused by the extra `movi` instruction in these cases is reasonable. However, for cases where this optimization applies, the performance improvement will be more significant. For example, in the following case, I even saw a **2x** performance improvement on Neoverse-V2.
@Param({"128"})
private int loop_iteration;
private static final VectorSpecies<Integer> ispecies = VectorSpecies.ofLargestShape(int.class);
private boolean[] mask_arr;
@Setup(Level.Trial)
public void BmSetup() {
int array_size = loop_iteration * bspecies.length();
mask_arr = new boolean[array_size];
Random r = new Random();
for (int i = 0; i < array_size; i++) {
mask_arr[i] = r.nextBoolean();
}
}
@CompilerControl(CompilerControl.Mode.INLINE)
private <E> long testIndexInRangeToLongKernel(VectorSpecies<E> species) {
long sum = 0;
VectorMask<E> m = VectorMask.fromArray(species, mask_arr, 0);
for (int i = 0; i < loop_iteration; i++) {
sum += m.indexInRange(i & (m.length() - 1), m.length()).toLong();
}
return sum;
}
@Benchmark
public long indexInRangeToLongInt() {
return testIndexInRangeToLongKernel(ispecies);
}
Therefore, when you test this change using the C case, you will see a significant performance improvement.
> I see 2% uplift on these numbers.
@theRealAph And I think this also explains your question on these numbers.
> One thing you can do is add a flag to control this minor optimization, but make it constexpr bool = true until we know what other SVE implementations might do.
In general:
Dead code in HotSpot is dangerous. Over time it rots, leading to hard-to-diagnose bugs when if it does get exercised. If we don't know that we need something, we shouldn't do it. Don't speculatively add code. We don't need the AArch64 to be come (even more of) a jungle of microarchitecture-specific tweaks.
Yeah I agree with you, we shouldn't add dead code that makes the project increasingly difficult to maintain. I tried adding a user invisible **constexpr** global variable `constexpr bool _prefer_merging_mode_sve_cpy = true` to `vm_version` class, like [_max_supported_sve_vector_length](https://github.com/openjdk/jdk/blob/fa1b1d677ac492dfdd3110b9303a4c2b009046c8/src/hotspot/cpu/aarch64/vm_version_aarch64.hpp#L53), but then we had to delete these ASM instruction tests for [cpy(imm, zeroing)](https://github.com/openjdk/jdk/blob/fa1b1d677ac492dfdd3110b9303a4c2b009046c8/test/hotspot/gtest/aarch64/aarch64-asmtest.py#L1958). Because now the `cpy(imm, zeroing)` instruction corresponds to two instructions `movi + cpy(imm, merging)`. So I'm inclined to make this global variable **non-constexpr**, so we can set it to false while doing this test and restore it after the test. Is it fine to you?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3808829111
More information about the hotspot-dev
mailing list