RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction

Tue Feb 3 08:33:21 UTC 2026

On Wed, 28 Jan 2026 10:17:30 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> @fg1417 thanks for your help, this is really helpful! 
>> 
>> You've also noticed slight regression in a few cases, which is reasonable. The optimization effect is influenced by multiple factors, such as the alignment you mentioned on N2, as well as code generation and register allocation. The underlying principle of this optimization is that the latency of the `cpy(imm, zeroing)` instruction seems quite high, while the `movi + cpy(imm, merging)` combination improves the parallelism of the program. In some cases, a `mov` or other instruction with the same effect is already generated before the `cpy(imm, zeroing)` instruction, thus achieving the optimization effect of the `movi + cpy(imm, merging)` instruction combination. Therefore, the slight regression caused by the extra `movi` instruction in these cases is reasonable. However, for cases where this optimization applies, the performance improvement will be more significant. For example, in the following case, I even saw a **2x** performance improvement on Neoverse-V2.
>> 
>>     @Param({"128"})
>>     private int loop_iteration;
>>     private static final VectorSpecies<Integer> ispecies = VectorSpecies.ofLargestShape(int.class);
>>     private boolean[] mask_arr;
>> 
>>     @Setup(Level.Trial)
>>     public void BmSetup() {
>>         int array_size = loop_iteration * bspecies.length();
>>         mask_arr = new boolean[array_size];
>>         Random r = new Random();
>>         for (int i = 0; i < array_size; i++) {
>>             mask_arr[i] = r.nextBoolean();
>>         }
>>     }
>> 
>>     @CompilerControl(CompilerControl.Mode.INLINE)
>>     private <E> long testIndexInRangeToLongKernel(VectorSpecies<E> species) {
>>         long sum = 0;
>>         VectorMask<E> m = VectorMask.fromArray(species, mask_arr, 0);
>>         for (int i = 0; i < loop_iteration; i++) {
>>             sum += m.indexInRange(i & (m.length() - 1), m.length()).toLong();
>>         }
>>         return sum;
>>     }
>> 
>>     @Benchmark
>>     public long indexInRangeToLongInt() {
>>         return testIndexInRangeToLongKernel(ispecies);
>>     }
>> 
>> 
>> Therefore, when you test this change using the C case, you will see a significant performance improvement.
>>> I see 2% uplift on these numbers.
>> 
>> @theRealAph And I think this also explains your question on these numbers.
>> 
>>> One thing you can do is add a flag to control this minor optimization, but make it constexpr bool = true until we know what other SVE implementations might do.
>> In general:
>> Dea...
>
>> Therefore, when you test this change using the C case, you will see a significant performance improvement.
>> 
>> > I see 2% uplift on these numbers.
>> 
>> @theRealAph And I think this also explains your question on these numbers.
> 
> Not at all.
> 
> The performance claim above was:
> 
>> Microbenchmarks show this change brings performance uplift ranging from 11% to 33%, depending on the specific operation and data types.
> 
> But the real performance uplift, as measured in Java microbenchmarks, is 2%.

Hi @theRealAph, I’ve updated the code comments based on your suggestions. Thank you for your patient review!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29359#issuecomment-3839845869