RFR: 8363620: AArch64: reimplement emit_static_call_stub()

Tue Aug 5 11:48:08 UTC 2025

On Tue, 5 Aug 2025 10:30:13 GMT, Fei Gao <fgao at openjdk.org> wrote:

> In the existing implementation, the static call stub typically emits a sequence like:
> `isb; movk; movz; movz; movk; movz; movz; br`.
> 
> This patch reimplements it using a more compact and patch-friendly sequence:
> 
> ldr x12, Label_data
> ldr x8, Label_entry
> br x8
> Label_data:
>   0x00000000
>   0x00000000
> Label_entry:
>   0x00000000
>   0x00000000
> 
> The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations.
> 
> While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.
> 
> A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
> 
> 
> Benchmark                       (length)   Mode   Cnt Master     Patch      Units
> StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
> StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
> StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
> StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
> StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
> StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
> 
> 
> All tests in Tier1 to Tier3, under both release and debug builds, have passed.
> 
> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 989:

> 987:   ldr(rscratch1, far_jump_entry);
> 988:   br(rscratch1);
> 989:   bind(far_jump_metadata);

I’m considering whether the data here should be 8-byte aligned, similar to what we did for the trampoline stubs https://github.com/openjdk/jdk/blob/743c821289a6562972364b5dcce8dd29a786264a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L950

I tried with a small case:

    @CompilerControl(CompilerControl.Mode.EXCLUDE)
    public static void callInterpreted0(int i) {
        val0 = i;
    }

    @CompilerControl(CompilerControl.Mode.EXCLUDE)
    public static void callInterpreted1(int i) {
        val1 = i;
    }

    @CompilerControl(CompilerControl.Mode.EXCLUDE)
    public static void callInterpreted2(int i) {
        val2 = i;
    }

    @Benchmark
    public void callCompiled() {
        for (int i = 0; i < length; i++) {
          callInterpreted0(i); // Make sure this is excluded from compilation
          callInterpreted1(i);
          callInterpreted2(i);
        }
    }

where the static stubs are laid out as follows:

  0x0000eee528201dd0:   ldr	x8, 0x0000eee528201dd8      ;   {trampoline_stub}
  0x0000eee528201dd4:   br	x8
  0x0000eee528201dd8:   .inst	0x2f69fe40 ; undefined
  0x0000eee528201ddc:   .inst	0x0000eee5 ; undefined
  0x0000eee528201de0:   ldr	x8, 0x0000eee528201de8      ;   {trampoline_stub}
  0x0000eee528201de4:   br	x8
  0x0000eee528201de8:   .inst	0x2f69fe40 ; undefined
  0x0000eee528201dec:   .inst	0x0000eee5 ; undefined
  0x0000eee528201df0:   ldr	x8, 0x0000eee528201df8      ;   {trampoline_stub}
  0x0000eee528201df4:   br	x8
  0x0000eee528201df8:   .inst	0x2f69fe40 ; undefined
  0x0000eee528201dfc:   .inst	0x0000eee5 ; undefined
  0x0000eee528201e00:   ldr	x12, 0x0000eee528201e0c     ;   {static_stub}
  0x0000eee528201e04:   ldr	x8, 0x0000eee528201e14
  0x0000eee528201e08:   br	x8
  0x0000eee528201e0c:   .inst	0x00000000 ; undefined
  0x0000eee528201e10:   .inst	0x00000000 ; undefined
  0x0000eee528201e14:   .inst	0x00000000 ; undefined
  0x0000eee528201e18:   .inst	0x00000000 ; undefined
  0x0000eee528201e1c:   ldr	x12, 0x0000eee528201e28     ;   {static_stub}
  0x0000eee528201e20:   ldr	x8, 0x0000eee528201e30
  0x0000eee528201e24:   br	x8
  0x0000eee528201e28:   .inst	0x00000000 ; undefined
  0x0000eee528201e2c:   .inst	0x00000000 ; undefined
  0x0000eee528201e30:   .inst	0x00000000 ; undefined
  0x0000eee528201e34:   .inst	0x00000000 ; undefined
  0x0000eee528201e38:   ldr	x12, 0x0000eee528201e44     ;   {static_stub}
  0x0000eee528201e3c:   ldr	x8, 0x0000eee528201e4c
  0x0000eee528201e40:   br	x8
  0x0000eee528201e44:   .inst	0x00000000 ; undefined
  0x0000eee528201e48:   .inst	0x00000000 ; undefined
  0x0000eee528201e4c:   .inst	0x00000000 ; undefined
  0x0000eee528201e50:   .inst	0x00000000 ; undefined

Here are the performance results:

Benchmark                                       (length)    Mode    Cnt   Master       Aligned     Unaligned   Units
StaticCallStub.StaticCallStubFar.callCompiled     1000      avgt    5     114.794      63.117      64.346      us/op
StaticCallStub.StaticCallStubFar.callCompiled     10000     avgt    5     1136.016     618.576     619.629     us/op
StaticCallStub.StaticCallStubFar.callCompiled     100000    avgt    5     11323.945    6191.452    6277.813    us/op
StaticCallStub.StaticCallStubNear.callCompiled    1000      avgt    5     114.335      63.142      64.091      us/op
StaticCallStub.StaticCallStubNear.callCompiled    10000     avgt    5     1140.667     618.653     619.861     us/op
StaticCallStub.StaticCallStubNear.callCompiled    100000    avgt    5     11351.394    6194.946    6195.255    us/op

We have several aspects to consider:

- 	8-byte alignment brings a minor performance gain, but it’s not significant compared to the overall improvement achieved by reimplementing the static stubs.

- 	Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.

- 	The alignment requirement for trampoline stubs doesn’t always introduce extra NOPs—padding is only added when trampoline and static stubs are interleaved. In contrast, enforcing 8-byte alignment for static stubs almost always introduces padding, since the first three instructions are not naturally aligned.

- We should carefully balance the trade-off between code size increase and code hotness (i.e., how frequently the stub code is executed).

Any feedback or suggestions would be greatly appreciated.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26638#discussion_r2254100081