RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v2]
Eric Fang
erfang at openjdk.org
Mon Feb 2 09:59:19 UTC 2026
On Mon, 2 Feb 2026 09:04:21 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Eric Fang has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Move the implementation into C2_MacroAssembler
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2846:
>
>> 2844: void C2_MacroAssembler::sve_cpy_optimized(FloatRegister dst, SIMD_RegVariant T,
>> 2845: PRegister pg, int imm8, bool isMerge) {
>> 2846: // When prefer_sve_merging_mode_cpy is enabled, optimize the SVE `cpy
>
> This comment says nothing that is not obvious from the code.
I’d like to briefly document the main idea of this method. How about adding a brief comment before the method like `Provide an optimized implementation for cpy (imm, zeroing) instruction`, or do you think it would be better to remove the comment?
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2855:
>
>> 2853: // Z<dst> above 128, so this `movi` instruction effectively zeroes the
>> 2854: // entire Z<dst> register. According to the Arm Software Optimization
>> 2855: // Guide, `movi` is zero cost.
>
> I don't think it says that exactly. movi is handled early during renaming, but still occupies a decode slot.
Yeah you are right, and the movi uop gets eliminated shortly downstream of the decoder. I should say `zero latency`.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/29359#discussion_r2753482758
PR Review Comment: https://git.openjdk.org/jdk/pull/29359#discussion_r2753500143
More information about the hotspot-dev
mailing list