RFR: 8343789: Move mutable nmethod data out of CodeCache [v9]

Fri Feb 7 18:33:23 UTC 2025

On Mon, 3 Feb 2025 14:16:41 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1422:
>> 
>>> 1420:     bool force_movk = true; // movk is important if the target can be more than 4GB away
>>> 1421:     adrp(dest, const_addr, offset, force_movk);
>>> 1422:     ldr(dest, Address(dest, offset));
>> 
>> I wonder if this really is the best way to do it. It's not clear to me that there is any advantage of using `adrp` in this case rather than a simple `mov(scratch, const_adr); ldr(dest, Address(scratch);`. The `mov` would produce `movz; movk; movk` which almost certainly execute in a single cycle, then a load without an offset, which is a single micro-op rather than two micro-ops for load+offset. All we've gained for this complication is a small reduction in code density rather than a performance improvement. I'd go with simplicity.
>
> Yes, I agree.

> It's not clear to me that there is any advantage of using adrp

I think ADP+MOVK is better both in terms of performance and code density.

<details>
<summary>Simple asm experiment shows that ADRP+MOVK performs better on my machine</summary>

    $ gcc test.cpp && ./a.out

    Allocated at: 0x10ffff0000
    Elapsed time (movz-movk-movk): 3145591
    Elapsed time (adrp-movk): 2739354

    ============================================================
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <unistd.h>
    #include <time.h>

    int main() {
        void *desired_addr = (void *)0x10ffff0000;
        size_t size = 4096;

        int32_t* data = (int32_t*)mmap(desired_addr, size, PROT_READ | PROT_WRITE,
                                       MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE, -1, 0);
        if (data == MAP_FAILED) {
            perror("mmap failed");
            return 1;
        }
        data[0] = 13; data[1] = 14; data[2] = 15; data[3] = 16;

        printf("Allocated at: %p\n", data);
        int32_t* ptr = (int32_t*)&main;

        int aa = 1;
        int bb = 2;
        int cc = 3;
        int dd = 4;
        clock_t start;

        start = clock();
        for (int i=0; i<1000*1000*1000; i++) {
          asm (
          "movz %0, 0x10, lsl #32; movk %0, 0xffff, lsl #16; movk %0, 0; ldr %0, [%0]; "
          "movz %1, 0x10, lsl #32; movk %1, 0xffff, lsl #16; movk %1, 4; ldr %1, [%1]; "
          "movz %2, 0x10, lsl #32; movk %2, 0xffff, lsl #16; movk %2, 8; ldr %2, [%2]; "
          "movz %3, 0x10, lsl #32; movk %3, 0xffff, lsl #16; movk %3, 12;ldr %3, [%3]; "
          : "=r" (aa), "=r" (bb), "=r" (cc), "=r" (dd)  /* Output operands */
          :
          : "cc");
        }
        printf("Elapsed time (movz-movk-movk): %li\n", (clock() - start));

        start = clock();
        for (int i=0; i<1000*1000*1000; i++) {
          asm (
          "adrp %0, main; movk %0, 0, lsl #32; ldr %0, [%0, 0x0];"
          "adrp %1, main; movk %1, 0, lsl #32; ldr %1, [%1, 0x4];"
          "adrp %2, main; movk %2, 0, lsl #32; ldr %2, [%2, 0x8];"
          "adrp %3, main; movk %3, 0, lsl #32; ldr %3, [%3, 0xc];"
          : "=r" (aa), "=r" (bb), "=r" (cc), "=r" (dd)  /* Output operands */
          :
          : "cc");
        }
        printf("Elapsed time (adrp-movk): %li\n", (clock() - start));

        munmap(data, size);
        return 0;
    }

</details>

The results are consistent with llvm-mca analysis:
<details>
<summary>llvm-mca analysis suggests that ADRP+MOVK performs better on Neoverse-N1</summary>
    Neoverse-N1. ADRP-MOVK wins over MOVZ-MOVK-MOVK

    - Fewer instructions (300 vs 400)
    - Fewer total cycles (75 vs 109) - Faster execution
    - Lower instruction block throughput (0.7 vs 1.0) - More efficient execution
    - Less resource pressure - Less risk of pipeline stalls

    ===================================================================================================

    $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-n1 asm1.S

    Iterations:        100
    Instructions:      300
    Total Cycles:      75
    Total uOps:        300

    Dispatch Width:    8
    uOps Per Cycle:    4.00
    IPC:               4.00
    Block RThroughput: 0.7

    Instruction Info:
    [1]: #uOps
    [2]: Latency
    [3]: RThroughput
    [4]: MayLoad
    [5]: MayStore
    [6]: HasSideEffects (U)

    [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
     1      1     0.33                        adrp  x0, target
     1      1     0.33                        movk  x0, #39612, lsl #32
     1      4     0.50    *                   ldr   x1, [x0]

    Resources:
    [0]   - N1UnitB
    [1.0] - N1UnitD
    [1.1] - N1UnitD
    [2.0] - N1UnitL
    [2.1] - N1UnitL
    [3]   - N1UnitM
    [4.0] - N1UnitS
    [4.1] - N1UnitS
    [5]   - N1UnitV0
    [6]   - N1UnitV1

    Resource pressure per iteration:
    [0]    [1.0]  [1.1]  [2.0]  [2.1]  [3]    [4.0]  [4.1]  [5]    [6]
     -      -      -     0.50   0.50   0.66   0.67   0.67    -      -

    Resource pressure by instruction:
    [0]    [1.0]  [1.1]  [2.0]  [2.1]  [3]    [4.0]  [4.1]  [5]    [6]    Instructions:
     -      -      -      -      -     0.33   0.33   0.34    -      -     adrp      x0, target
     -      -      -      -      -     0.33   0.34   0.33    -      -     movk      x0, #39612, lsl #32
     -      -      -     0.50   0.50    -      -      -      -      -     ldr       x1, [x0]

    ===================================================================================================

    $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-n1 asm2.S

    Iterations:        100
    Instructions:      400
    Total Cycles:      109
    Total uOps:        400

    Dispatch Width:    8
    uOps Per Cycle:    3.67
    IPC:               3.67
    Block RThroughput: 1.0

    Instruction Info:
    [1]: #uOps
    [2]: Latency
    [3]: RThroughput
    [4]: MayLoad
    [5]: MayStore
    [6]: HasSideEffects (U)

    [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
     1      1     0.33                        mov   x0, #20014547599360
     1      1     0.33                        movk  x0, #22136, lsl #16
     1      1     0.33                        movk  x0, #39612
     1      4     0.50    *                   ldr   x1, [x0]

    Resources:
    [0]   - N1UnitB
    [1.0] - N1UnitD
    [1.1] - N1UnitD
    [2.0] - N1UnitL
    [2.1] - N1UnitL
    [3]   - N1UnitM
    [4.0] - N1UnitS
    [4.1] - N1UnitS
    [5]   - N1UnitV0
    [6]   - N1UnitV1

    Resource pressure per iteration:
    [0]    [1.0]  [1.1]  [2.0]  [2.1]  [3]    [4.0]  [4.1]  [5]    [6]
     -      -      -     0.50   0.50   0.99   1.00   1.01    -      -

    Resource pressure by instruction:
    [0]    [1.0]  [1.1]  [2.0]  [2.1]  [3]    [4.0]  [4.1]  [5]    [6]    Instructions:
     -      -      -      -      -      -     0.66   0.34    -      -     mov       x0, #20014547599360
     -      -      -      -      -     0.33   0.34   0.33    -      -     movk      x0, #22136, lsl #16
     -      -      -      -      -     0.66    -     0.34    -      -     movk      x0, #39612
     -      -      -     0.50   0.50    -      -      -      -      -     ldr       x1, [x0]

</details>

<details>
<summary>llvm-mca analysis suggests that ADRP+MOVK performs better on Neoverse-V2</summary>

    - Fewer instructions (300 vs 400)
    - Fewer total cycles (42 vs 59) - Faster execution
    - Lower instruction block throughput (0.3 vs 0.5) - More efficient execution
    - Less resource pressure - Less risk of pipeline stalls

    ===================================================================================================================================================

    $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-v2 asm1.S

    Iterations:        100
    Instructions:      300
    Total Cycles:      42
    Total uOps:        300

    Dispatch Width:    16
    uOps Per Cycle:    7.14
    IPC:               7.14
    Block RThroughput: 0.3

    Instruction Info:
    [1]: #uOps
    [2]: Latency
    [3]: RThroughput
    [4]: MayLoad
    [5]: MayStore
    [6]: HasSideEffects (U)

    [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
     1      1     0.25                        adrp  x0, target
     1      1     0.17                        movk  x0, #39612, lsl #32
     1      4     0.33    *                   ldr   x1, [x0]

    Resources:
    [0.0] - V2UnitB
    [0.1] - V2UnitB
    [1.0] - V2UnitD
    [1.1] - V2UnitD
    [2]   - V2UnitL2
    [3.0] - V2UnitL01
    [3.1] - V2UnitL01
    [4]   - V2UnitM0
    [5]   - V2UnitM1
    [6]   - V2UnitS0
    [7]   - V2UnitS1
    [8]   - V2UnitS2
    [9]   - V2UnitS3
    [10]  - V2UnitV0
    [11]  - V2UnitV1
    [12]  - V2UnitV2
    [13]  - V2UnitV3

    Resource pressure per iteration:
    [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
     -      -      -      -     0.33   0.33   0.34   0.33   0.33   0.34   0.34   0.33   0.33    -      -      -      -

    Resource pressure by instruction:
    [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
     -      -      -      -      -      -      -     0.33   0.33   0.17   0.17    -      -      -      -      -      -     adrp     x0, target
     -      -      -      -      -      -      -      -      -     0.17   0.17   0.33   0.33    -      -      -      -     movk     x0, #39612, lsl #32
     -      -      -      -     0.33   0.33   0.34    -      -      -      -      -      -      -      -      -      -     ldr      x1, [x0]

    ===================================================================================================================================================

    $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-v2 asm2.S

    Iterations:        100
    Instructions:      400
    Total Cycles:      59
    Total uOps:        400

    Dispatch Width:    16
    uOps Per Cycle:    6.78
    IPC:               6.78
    Block RThroughput: 0.5

    Instruction Info:
    [1]: #uOps
    [2]: Latency
    [3]: RThroughput
    [4]: MayLoad
    [5]: MayStore
    [6]: HasSideEffects (U)

    [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
     1      1     0.17                        mov   x0, #20014547599360
     1      1     0.17                        movk  x0, #22136, lsl #16
     1      1     0.17                        movk  x0, #39612
     1      4     0.33    *                   ldr   x1, [x0]

    Resources:
    [0.0] - V2UnitB
    [0.1] - V2UnitB
    [1.0] - V2UnitD
    [1.1] - V2UnitD
    [2]   - V2UnitL2
    [3.0] - V2UnitL01
    [3.1] - V2UnitL01
    [4]   - V2UnitM0
    [5]   - V2UnitM1
    [6]   - V2UnitS0
    [7]   - V2UnitS1
    [8]   - V2UnitS2
    [9]   - V2UnitS3
    [10]  - V2UnitV0
    [11]  - V2UnitV1
    [12]  - V2UnitV2
    [13]  - V2UnitV3

    Resource pressure per iteration:
    [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
     -      -      -      -     0.33   0.33   0.34   0.50   0.50   0.50   0.50   0.50   0.50    -      -      -      -

    Resource pressure by instruction:
    [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
     -      -      -      -      -      -      -      -      -     0.33   0.33   0.17   0.17    -      -      -      -     mov      x0, #20014547599360
     -      -      -      -      -      -      -     0.17   0.17   0.16   0.16   0.17   0.17    -      -      -      -     movk     x0, #22136, lsl #16
     -      -      -      -      -      -      -     0.33   0.33   0.01   0.01   0.16   0.16    -      -      -      -     movk     x0, #39612
     -      -      -      -     0.33   0.33   0.34    -      -      -      -      -      -      -      -      -      -     ldr      x1, [x0]

    ===================================================================================================================================================

</details>

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1946989009