RFR: 8362504: AArch64: Replace MOVZ+MOVK+MOVK with ADRP+ADD

Thu Aug 14 15:59:10 UTC 2025

On Fri, 8 Aug 2025 11:37:29 GMT, Andrew Haley <aph at openjdk.org> wrote:

>>> > I've done some modelling using llvm-mca and it looks like `adrp; add` is a win on recent Apple processors as well as on Arm processors, so go ahead with making this the default.
>>> 
>>> Thanks for testing that — really great to hear! I’ll update the patch with a constraint for AOT cache shortly.
>> 
>> Correction: I'm afraid that the llvm-mca results are nonsense. It says that this sequence
>> 
>> 
>>   movk w0, #0x1234, lsl 16
>>   movk w1, #0x1234, lsl 16
>>   movk w2, #0x1234, lsl 16
>>   movk w3, #0x1234, lsl 16
>>   movk w4, #0x1234, lsl 16
>>   movk w5, #0x1234, lsl 16
>>   movk w6, #0x1234, lsl 16
>>   movk w7, #0x1234, lsl 16
>> 
>> takes 2 clock cycles on Apple M1, but Dougall Johnson measured real hardware executing this at 1 clock cycle.
>> 
>> I'm not going to believe any more without numbers we can trust.
>
>> I'm not going to believe any more without numbers we can trust.
> 
> Sorry, it's 6 movz/movk per cycle, which I just confirmed by measuring it myself.

Hi @theRealAph, I conducted some performance `C` tests across different platforms to compare `adrp + add` pairs with `movz + movk + movk` triples.

The test loop looks like this:

  clock_t start, end;
  double cpu_time_used;
  long iteration = 200000000;
  start = clock();

  for (long i = 0; i < iteration; i++) {
    test();
  }

  end = clock();
  cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
  printf("Test %s time: %f\n", name(NAME), cpu_time_used);

`test()` Function Contents:

`adrp + add` pattern (48 pairs using different registers `x0-18`):

    adrp   x0, .LFB0
    add    x0, x0, #0xf00
    adrp   x1, .LFB0
    add    x1, x1, #0xf00
    adrp   x2, .LFB0
    add    x2, x2, #0xf00
    ...

`mov` triple pattern (48 sets using different registers `x0-18`):

    mov     x0, #0x1218
    movk    x0, #0xc801, lsl #16
    movk    x0, #0xeee4, lsl #32
    mov     x1, #0x1218
    movk    x1, #0xc801, lsl #16
    movk    x1, #0xeee4, lsl #32
    mov     x2, #0x1218
    movk    x2, #0xc801, lsl #16
    movk    x2, #0xeee4, lsl #32
    ...

Here are results on different platforms:

n1 : Test adrp time: 2.245888  Test movks time: 3.335474
v1 : Test adrp time: 1.953309  Test movks time: 2.870085
v2 : Test adrp time: 1.561889  Test movks time: 2.271725
m2 : Test adrp time: 0.965305  Test movks time: 1.264298

I also tested sequences reusing the same registers `x8`, and `adrp + add` still showed performance advantages across all platforms.

Do you think these numbers are trustworthy? Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26653#issuecomment-3188968224