RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v2]

Mon Feb 2 09:59:19 UTC 2026

On Mon, 2 Feb 2026 09:04:21 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Eric Fang has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Move the implementation into C2_MacroAssembler
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2846:
> 
>> 2844: void C2_MacroAssembler::sve_cpy_optimized(FloatRegister dst, SIMD_RegVariant T,
>> 2845:                                           PRegister pg, int imm8, bool isMerge) {
>> 2846:   // When prefer_sve_merging_mode_cpy is enabled, optimize the SVE `cpy
> 
> This comment says nothing that is not obvious from the code.

I’d like to briefly document the main idea of this method. How about adding a brief comment before the method like `Provide an optimized implementation for cpy (imm, zeroing) instruction`, or do you think it would be better to remove the comment?

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2855:
> 
>> 2853:     // Z<dst> above 128, so this `movi` instruction effectively zeroes the
>> 2854:     // entire Z<dst> register. According to the Arm Software Optimization
>> 2855:     // Guide, `movi` is zero cost.
> 
> I don't think it says that exactly. movi is handled early during renaming, but still occupies a decode slot.

Yeah you are right, and the movi uop gets eliminated shortly downstream of the decoder. I should say `zero latency`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/29359#discussion_r2753482758
PR Review Comment: https://git.openjdk.org/jdk/pull/29359#discussion_r2753500143