RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction

Wed Jan 28 04:08:09 UTC 2026

On Tue, 27 Jan 2026 11:15:55 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> Thanks @theRealAph. I don't have test data on other aarch64 microarchitectures to show the difference with  and without this optimization. I'm asking my Arm partners for help testing to see if they have more AArch64 environments.
>> 
>> Adding a new flag for this optimization might not be appropriate. I'm thinking of handling it this way: **during JVM initialization (in `vm_version.cpp`), check the microarchitecture, set a global variable, and then use it in the assembler to determine whether to apply the optimization.**
>> 
>> Maintaining microarchitecture-specific handling might be worthwhile. This provides some flexibility for different/future architectures. Because we're unsure how the future microarchitecture will behave, from a performance and code size perspective, we generally prefer single instruction over multiple instructions. Is it fine to you ?
>
> Hi @erifan ,
> 
> I ran some tests on a `Neoverse N2` platform, which supports `128-bit SVE2`.
> 
> I executed `IndexInRangeBenchmark.java` with `@Warmup(iterations = 10, time = 1)` and `@Measurement(iterations = 10, time = 2)`, comparing runs with `PreferSVEMergingModeCPY` enabled and disabled in your patch.
> 
> Benchmark                                 (size)    Mode    Cnt   Unit         true/false
> IndexInRangeBenchmark.byteIndexInRange      7      thrpt    10    ops/ms        0.95
> IndexInRangeBenchmark.byteIndexInRange      256    thrpt    10    ops/ms        1.00
> IndexInRangeBenchmark.byteIndexInRange      259    thrpt    10    ops/ms        1.02
> IndexInRangeBenchmark.byteIndexInRange      512    thrpt    10    ops/ms        1.01
> IndexInRangeBenchmark.doubleIndexInRange    7      thrpt    10    ops/ms        0.96
> IndexInRangeBenchmark.doubleIndexInRange    256    thrpt    10    ops/ms        1.02
> IndexInRangeBenchmark.doubleIndexInRange    259    thrpt    10    ops/ms        1.10
> IndexInRangeBenchmark.doubleIndexInRange    512    thrpt    10    ops/ms        1.01
> IndexInRangeBenchmark.floatIndexInRange     7      thrpt    10    ops/ms        0.96
> IndexInRangeBenchmark.floatIndexInRange     256    thrpt    10    ops/ms        1.00
> IndexInRangeBenchmark.floatIndexInRange     259    thrpt    10    ops/ms        1.06
> IndexInRangeBenchmark.floatIndexInRange     512    thrpt    10    ops/ms        1.00
> IndexInRangeBenchmark.intIndexInRange       7      thrpt    10    ops/ms        0.96
> IndexInRangeBenchmark.intIndexInRange       256    thrpt    10    ops/ms        0.98
> IndexInRangeBenchmark.intIndexInRange       259    thrpt    10    ops/ms        1.04
> IndexInRangeBenchmark.intIndexInRange       512    thrpt    10    ops/ms        0.98
> IndexInRangeBenchmark.longIndexInRange      7      thrpt    10    ops/ms        1.01
> IndexInRangeBenchmark.longIndexInRange      256    thrpt    10    ops/ms        1.00
> IndexInRangeBenchmark.longIndexInRange      259    thrpt    10    ops/ms        0.92
> IndexInRangeBenchmark.longIndexInRange      512    thrpt    10    ops/ms        1.00
> IndexInRangeBenchmark.shortIndexInRange     7      thrpt    10    ops/ms        0.95
> IndexInRangeBenchmark.shortIndexInRange     256    thrpt    10    ops/ms        1.00
> IndexInRangeBenchmark.shortIndexInRange     259    thrpt    10    ops/ms        0.96
> IndexInRangeBenchmark.shortIndexInRange     512    thrpt    10    ops/ms        1.00
> 
> The uplift in `doubleIndexInRange` and `floatIndexInRange` appear...

@fg1417 thanks for your help, this is really helpful! 

You've also noticed slight regression in a few cases, which is reasonable. The optimization effect is influenced by multiple factors, such as the alignment you mentioned on N2, as well as code generation and register allocation. The underlying principle of this optimization is that the latency of the `cpy(imm, zeroing)` instruction seems quite high, while the `movi + cpy(imm, merging)` combination improves the parallelism of the program. In some cases, a `mov` or other instruction with the same effect is already generated before the `cpy(imm, zeroing)` instruction, thus achieving the optimization effect of the `movi + cpy(imm, merging)` instruction combination. Therefore, the slight regression caused by the extra `movi` instruction in these cases is reasonable. However, for cases where this optimization applies, the performance improvement will be more significant. For example, in the following case, I even saw a **2x** performance improvement on Neoverse-V2.

    @Param({"128"})
    private int loop_iteration;
    private static final VectorSpecies<Integer> ispecies = VectorSpecies.ofLargestShape(int.class);
    private boolean[] mask_arr;

    @Setup(Level.Trial)
    public void BmSetup() {
        int array_size = loop_iteration * bspecies.length();
        mask_arr = new boolean[array_size];
        Random r = new Random();
        for (int i = 0; i < array_size; i++) {
            mask_arr[i] = r.nextBoolean();
        }
    }

    @CompilerControl(CompilerControl.Mode.INLINE)
    private <E> long testIndexInRangeToLongKernel(VectorSpecies<E> species) {
        long sum = 0;
        VectorMask<E> m = VectorMask.fromArray(species, mask_arr, 0);
        for (int i = 0; i < loop_iteration; i++) {
            sum += m.indexInRange(i & (m.length() - 1), m.length()).toLong();
        }
        return sum;
    }

    @Benchmark
    public long indexInRangeToLongInt() {
        return testIndexInRangeToLongKernel(ispecies);
    }

Therefore, when you test this change using the C case, you will see a significant performance improvement.
> I see 2% uplift on these numbers.

@theRealAph And I think this also explains your question on these numbers.

> One thing you can do is add a flag to control this minor optimization, but make it constexpr bool = true until we know what other SVE implementations might do.
In general:
Dead code in HotSpot is dangerous. Over time it rots, leading to hard-to-diagnose bugs when if it does get exercised. If we don't know that we need something, we shouldn't do it. Don't speculatively add code. We don't need the AArch64 to be come (even more of) a jungle of microarchitecture-specific tweaks.

Yeah I agree with you, we shouldn't add dead code that makes the project increasingly difficult to maintain. I tried adding a user invisible **constexpr** global variable `constexpr bool _prefer_merging_mode_sve_cpy = true` to `vm_version` class, like [_max_supported_sve_vector_length](https://github.com/openjdk/jdk/blob/fa1b1d677ac492dfdd3110b9303a4c2b009046c8/src/hotspot/cpu/aarch64/vm_version_aarch64.hpp#L53), but then we had to delete these ASM instruction tests for [cpy(imm, zeroing)](https://github.com/openjdk/jdk/blob/fa1b1d677ac492dfdd3110b9303a4c2b009046c8/test/hotspot/gtest/aarch64/aarch64-asmtest.py#L1958). Because now the `cpy(imm, zeroing)` instruction corresponds to two instructions `movi + cpy(imm, merging)`. So I'm inclined to make this global variable **non-constexpr**, so we can set it to false while doing this test and restore it after the test. Is it fine to you?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3808829111