RFR: 8362504: AArch64: Replace MOVZ+MOVK+MOVK with ADRP+ADD
Fei Gao
fgao at openjdk.org
Thu Aug 14 15:59:10 UTC 2025
On Fri, 8 Aug 2025 11:37:29 GMT, Andrew Haley <aph at openjdk.org> wrote:
>>> > I've done some modelling using llvm-mca and it looks like `adrp; add` is a win on recent Apple processors as well as on Arm processors, so go ahead with making this the default.
>>>
>>> Thanks for testing that — really great to hear! I’ll update the patch with a constraint for AOT cache shortly.
>>
>> Correction: I'm afraid that the llvm-mca results are nonsense. It says that this sequence
>>
>>
>> movk w0, #0x1234, lsl 16
>> movk w1, #0x1234, lsl 16
>> movk w2, #0x1234, lsl 16
>> movk w3, #0x1234, lsl 16
>> movk w4, #0x1234, lsl 16
>> movk w5, #0x1234, lsl 16
>> movk w6, #0x1234, lsl 16
>> movk w7, #0x1234, lsl 16
>>
>> takes 2 clock cycles on Apple M1, but Dougall Johnson measured real hardware executing this at 1 clock cycle.
>>
>> I'm not going to believe any more without numbers we can trust.
>
>> I'm not going to believe any more without numbers we can trust.
>
> Sorry, it's 6 movz/movk per cycle, which I just confirmed by measuring it myself.
Hi @theRealAph, I conducted some performance `C` tests across different platforms to compare `adrp + add` pairs with `movz + movk + movk` triples.
The test loop looks like this:
clock_t start, end;
double cpu_time_used;
long iteration = 200000000;
start = clock();
for (long i = 0; i < iteration; i++) {
test();
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("Test %s time: %f\n", name(NAME), cpu_time_used);
`test()` Function Contents:
`adrp + add` pattern (48 pairs using different registers `x0-18`):
adrp x0, .LFB0
add x0, x0, #0xf00
adrp x1, .LFB0
add x1, x1, #0xf00
adrp x2, .LFB0
add x2, x2, #0xf00
...
`mov` triple pattern (48 sets using different registers `x0-18`):
mov x0, #0x1218
movk x0, #0xc801, lsl #16
movk x0, #0xeee4, lsl #32
mov x1, #0x1218
movk x1, #0xc801, lsl #16
movk x1, #0xeee4, lsl #32
mov x2, #0x1218
movk x2, #0xc801, lsl #16
movk x2, #0xeee4, lsl #32
...
Here are results on different platforms:
n1 : Test adrp time: 2.245888 Test movks time: 3.335474
v1 : Test adrp time: 1.953309 Test movks time: 2.870085
v2 : Test adrp time: 1.561889 Test movks time: 2.271725
m2 : Test adrp time: 0.965305 Test movks time: 1.264298
I also tested sequences reusing the same registers `x8`, and `adrp + add` still showed performance advantages across all platforms.
Do you think these numbers are trustworthy? Thanks!
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26653#issuecomment-3188968224
More information about the hotspot-dev
mailing list