RFR: 8343789: Move mutable nmethod data out of CodeCache [v9]
Boris Ulasevich
bulasevich at openjdk.org
Fri Feb 7 18:33:23 UTC 2025
On Mon, 3 Feb 2025 14:16:41 GMT, Andrew Dinn <adinn at openjdk.org> wrote:
>> src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1422:
>>
>>> 1420: bool force_movk = true; // movk is important if the target can be more than 4GB away
>>> 1421: adrp(dest, const_addr, offset, force_movk);
>>> 1422: ldr(dest, Address(dest, offset));
>>
>> I wonder if this really is the best way to do it. It's not clear to me that there is any advantage of using `adrp` in this case rather than a simple `mov(scratch, const_adr); ldr(dest, Address(scratch);`. The `mov` would produce `movz; movk; movk` which almost certainly execute in a single cycle, then a load without an offset, which is a single micro-op rather than two micro-ops for load+offset. All we've gained for this complication is a small reduction in code density rather than a performance improvement. I'd go with simplicity.
>
> Yes, I agree.
> It's not clear to me that there is any advantage of using adrp
I think ADP+MOVK is better both in terms of performance and code density.
<details>
<summary>Simple asm experiment shows that ADRP+MOVK performs better on my machine</summary>
$ gcc test.cpp && ./a.out
Allocated at: 0x10ffff0000
Elapsed time (movz-movk-movk): 3145591
Elapsed time (adrp-movk): 2739354
============================================================
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <time.h>
int main() {
void *desired_addr = (void *)0x10ffff0000;
size_t size = 4096;
int32_t* data = (int32_t*)mmap(desired_addr, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE, -1, 0);
if (data == MAP_FAILED) {
perror("mmap failed");
return 1;
}
data[0] = 13; data[1] = 14; data[2] = 15; data[3] = 16;
printf("Allocated at: %p\n", data);
int32_t* ptr = (int32_t*)&main;
int aa = 1;
int bb = 2;
int cc = 3;
int dd = 4;
clock_t start;
start = clock();
for (int i=0; i<1000*1000*1000; i++) {
asm (
"movz %0, 0x10, lsl #32; movk %0, 0xffff, lsl #16; movk %0, 0; ldr %0, [%0]; "
"movz %1, 0x10, lsl #32; movk %1, 0xffff, lsl #16; movk %1, 4; ldr %1, [%1]; "
"movz %2, 0x10, lsl #32; movk %2, 0xffff, lsl #16; movk %2, 8; ldr %2, [%2]; "
"movz %3, 0x10, lsl #32; movk %3, 0xffff, lsl #16; movk %3, 12;ldr %3, [%3]; "
: "=r" (aa), "=r" (bb), "=r" (cc), "=r" (dd) /* Output operands */
:
: "cc");
}
printf("Elapsed time (movz-movk-movk): %li\n", (clock() - start));
start = clock();
for (int i=0; i<1000*1000*1000; i++) {
asm (
"adrp %0, main; movk %0, 0, lsl #32; ldr %0, [%0, 0x0];"
"adrp %1, main; movk %1, 0, lsl #32; ldr %1, [%1, 0x4];"
"adrp %2, main; movk %2, 0, lsl #32; ldr %2, [%2, 0x8];"
"adrp %3, main; movk %3, 0, lsl #32; ldr %3, [%3, 0xc];"
: "=r" (aa), "=r" (bb), "=r" (cc), "=r" (dd) /* Output operands */
:
: "cc");
}
printf("Elapsed time (adrp-movk): %li\n", (clock() - start));
munmap(data, size);
return 0;
}
</details>
The results are consistent with llvm-mca analysis:
<details>
<summary>llvm-mca analysis suggests that ADRP+MOVK performs better on Neoverse-N1</summary>
Neoverse-N1. ADRP-MOVK wins over MOVZ-MOVK-MOVK
- Fewer instructions (300 vs 400)
- Fewer total cycles (75 vs 109) - Faster execution
- Lower instruction block throughput (0.7 vs 1.0) - More efficient execution
- Less resource pressure - Less risk of pipeline stalls
===================================================================================================
$ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-n1 asm1.S
Iterations: 100
Instructions: 300
Total Cycles: 75
Total uOps: 300
Dispatch Width: 8
uOps Per Cycle: 4.00
IPC: 4.00
Block RThroughput: 0.7
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 adrp x0, target
1 1 0.33 movk x0, #39612, lsl #32
1 4 0.50 * ldr x1, [x0]
Resources:
[0] - N1UnitB
[1.0] - N1UnitD
[1.1] - N1UnitD
[2.0] - N1UnitL
[2.1] - N1UnitL
[3] - N1UnitM
[4.0] - N1UnitS
[4.1] - N1UnitS
[5] - N1UnitV0
[6] - N1UnitV1
Resource pressure per iteration:
[0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6]
- - - 0.50 0.50 0.66 0.67 0.67 - -
Resource pressure by instruction:
[0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6] Instructions:
- - - - - 0.33 0.33 0.34 - - adrp x0, target
- - - - - 0.33 0.34 0.33 - - movk x0, #39612, lsl #32
- - - 0.50 0.50 - - - - - ldr x1, [x0]
===================================================================================================
$ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-n1 asm2.S
Iterations: 100
Instructions: 400
Total Cycles: 109
Total uOps: 400
Dispatch Width: 8
uOps Per Cycle: 3.67
IPC: 3.67
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 mov x0, #20014547599360
1 1 0.33 movk x0, #22136, lsl #16
1 1 0.33 movk x0, #39612
1 4 0.50 * ldr x1, [x0]
Resources:
[0] - N1UnitB
[1.0] - N1UnitD
[1.1] - N1UnitD
[2.0] - N1UnitL
[2.1] - N1UnitL
[3] - N1UnitM
[4.0] - N1UnitS
[4.1] - N1UnitS
[5] - N1UnitV0
[6] - N1UnitV1
Resource pressure per iteration:
[0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6]
- - - 0.50 0.50 0.99 1.00 1.01 - -
Resource pressure by instruction:
[0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6] Instructions:
- - - - - - 0.66 0.34 - - mov x0, #20014547599360
- - - - - 0.33 0.34 0.33 - - movk x0, #22136, lsl #16
- - - - - 0.66 - 0.34 - - movk x0, #39612
- - - 0.50 0.50 - - - - - ldr x1, [x0]
</details>
<details>
<summary>llvm-mca analysis suggests that ADRP+MOVK performs better on Neoverse-V2</summary>
- Fewer instructions (300 vs 400)
- Fewer total cycles (42 vs 59) - Faster execution
- Lower instruction block throughput (0.3 vs 0.5) - More efficient execution
- Less resource pressure - Less risk of pipeline stalls
===================================================================================================================================================
$ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-v2 asm1.S
Iterations: 100
Instructions: 300
Total Cycles: 42
Total uOps: 300
Dispatch Width: 16
uOps Per Cycle: 7.14
IPC: 7.14
Block RThroughput: 0.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 adrp x0, target
1 1 0.17 movk x0, #39612, lsl #32
1 4 0.33 * ldr x1, [x0]
Resources:
[0.0] - V2UnitB
[0.1] - V2UnitB
[1.0] - V2UnitD
[1.1] - V2UnitD
[2] - V2UnitL2
[3.0] - V2UnitL01
[3.1] - V2UnitL01
[4] - V2UnitM0
[5] - V2UnitM1
[6] - V2UnitS0
[7] - V2UnitS1
[8] - V2UnitS2
[9] - V2UnitS3
[10] - V2UnitV0
[11] - V2UnitV1
[12] - V2UnitV2
[13] - V2UnitV3
Resource pressure per iteration:
[0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
- - - - 0.33 0.33 0.34 0.33 0.33 0.34 0.34 0.33 0.33 - - - -
Resource pressure by instruction:
[0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
- - - - - - - 0.33 0.33 0.17 0.17 - - - - - - adrp x0, target
- - - - - - - - - 0.17 0.17 0.33 0.33 - - - - movk x0, #39612, lsl #32
- - - - 0.33 0.33 0.34 - - - - - - - - - - ldr x1, [x0]
===================================================================================================================================================
$ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-v2 asm2.S
Iterations: 100
Instructions: 400
Total Cycles: 59
Total uOps: 400
Dispatch Width: 16
uOps Per Cycle: 6.78
IPC: 6.78
Block RThroughput: 0.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.17 mov x0, #20014547599360
1 1 0.17 movk x0, #22136, lsl #16
1 1 0.17 movk x0, #39612
1 4 0.33 * ldr x1, [x0]
Resources:
[0.0] - V2UnitB
[0.1] - V2UnitB
[1.0] - V2UnitD
[1.1] - V2UnitD
[2] - V2UnitL2
[3.0] - V2UnitL01
[3.1] - V2UnitL01
[4] - V2UnitM0
[5] - V2UnitM1
[6] - V2UnitS0
[7] - V2UnitS1
[8] - V2UnitS2
[9] - V2UnitS3
[10] - V2UnitV0
[11] - V2UnitV1
[12] - V2UnitV2
[13] - V2UnitV3
Resource pressure per iteration:
[0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
- - - - 0.33 0.33 0.34 0.50 0.50 0.50 0.50 0.50 0.50 - - - -
Resource pressure by instruction:
[0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
- - - - - - - - - 0.33 0.33 0.17 0.17 - - - - mov x0, #20014547599360
- - - - - - - 0.17 0.17 0.16 0.16 0.17 0.17 - - - - movk x0, #22136, lsl #16
- - - - - - - 0.33 0.33 0.01 0.01 0.16 0.16 - - - - movk x0, #39612
- - - - 0.33 0.33 0.34 - - - - - - - - - - ldr x1, [x0]
===================================================================================================================================================
</details>
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1946989009
More information about the hotspot-compiler-dev
mailing list