RFR: 8363620: AArch64: reimplement emit_static_call_stub()

Mon Oct 27 10:58:04 UTC 2025

On Mon, 27 Oct 2025 10:40:46 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> In the existing implementation, the static call stub typically emits a sequence like:
>> `isb; movk; movz; movz; movk; movz; movz; br`.
>> 
>> This patch reimplements it using a more compact and patch-friendly sequence:
>> 
>> ldr x12, Label_data
>> ldr x8, Label_entry
>> br x8
>> Label_data:
>>   0x00000000
>>   0x00000000
>> Label_entry:
>>   0x00000000
>>   0x00000000
>> 
>> The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations.
>> 
>> While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.
>> 
>> A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
>> 
>> 
>> Benchmark                       (length)   Mode   Cnt Master     Patch      Units
>> StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
>> StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
>> StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
>> StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
>> StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
>> StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
>> 
>> 
>> All tests in Tier1 to Tier3, under both release and debug builds, have passed.
>> 
>> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
>
>> /open
> 
> Why? Do you have any reasonable expectation of a better way to do it?

Hi @theRealAph, I’ve been thinking more about this topic and want to share a few updated observations.

In the current implementation of the `static call stub`, the executing thread runs the following instructions:

[main code]
L0:
  bl trampoline_stub
  ... {post call}

trampoline_stub:
  ldr  x8, callee_address
  br   x8
callee_address:
  0x12345678
  0x12345678

static_stub:
  isb
  mov    x12, #0x0
  movk   x12, #0x0, lsl #16
  movk   x12, #0x0, lsl #32
  mov    x8, #0x0
  movk   x8, #0x0, lsl #16
  movk   x8, #0x0, lsl #32
  br     x8

The writing thread performs the following steps:

1. Updates the `MOV` instructions in `static_stub` with new values.
2. Calls `ICache::invalidate_range()`.
3. Writes a `bl static_stub` instruction at `L0`.
   (Note: both `static_stub` and `trampoline_stub` reside in the stub section, which is directly reachable by `bl`.)

In the existing implementation, when the writing thread writes the `bl static_stub` instruction at `L0`, it also updates the
`callee_address` in `trampoline_stub` to the address of `static_stub`. See https://github.com/openjdk/jdk/blob/cc9483b4da1a0f65f8773d0c7f35f2e6a7e1bd4f/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp#L92

If we modify the code slightly as follows:

  // Patch the call.
  if (reachable) {
    set_destination(dest);
  } else {
    // Patch the constant in the call's trampoline stub.
    address trampoline_stub_addr = get_trampoline();
    if (trampoline_stub_addr != nullptr) {
      assert (! is_NativeCallTrampolineStub_at(dest), "chained trampolines");
      nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(dest);
    }
    assert (trampoline_stub_addr != nullptr, "we need a trampoline");
    set_destination(trampoline_stub_addr);
  }

Then the `callee_address` in `trampoline_stub` would no longer be expected to point to `static_stub`.

According to an older version of the Arm ARM (https://developer.arm.com/documentation/ddi0487/ia),
section A2.2.2, there is an architectural guarantee called Prefetch Speculation Protection (PSP).

>From the Arm community blog: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads:

> Prefetch speculation protection is an architectural guarantee
> that makes some minimally-synchronized code updates possible.
> 
> “Prefetch speculation protection” is an name from old editions
> of the Arm ARM (such as [DDI0487 I.a](https://developer.arm.com/documentation/ddi0487/ia)).
> In recent versions, the required behaviours are covered by the
> formal concurrency model (as in section B2.3 of DDI0487 [Latest](https://developer.arm.com/documentation/ddi0487/latest/)),
> but this specific set of properties no longer has its own name.

Essentially, when a writing thread rewrites a direct branch with an updated direct branch, and another thread is concurrently executing that modified code, PSP ensures that the executing thread does not fetch stale instructions.

According to the B2.3.9 Ordering of instruction fetches of the [specification](https://developer.arm.com/documentation/ddi0487/ia):
If we update the `MOV`s in `static_stub`, and ensure coherence between the data writes and instruction fetches within the same shareability domain before writing `bl static_stub` at `L0`, then for any observer:
An instruction read from `L0` appears in program order before an instruction fetched from `static_stub`, if the instruction at `L0` contains the updated direct branch, then the subsequent fetch from `static_stub` will contain the updated `MOV` values.

These two properties imply that even without the `isb` in `static_stub`, if the executing thread fetches the updated direct branch at `L0` that jumps to `static_stub`, PSP guarantees it won’t execute stale `MOV`s from `static_stub`.

A straightforward way to confirm this behavior is to run a litmus test in [herd7](https://diy.inria.fr/www/).

The author of the blog post, @jacobbramley, also shared an example litmus test:

AArch64 PrefetchSpeculationProtection

(* Copyright 2025 Arm Limited *)

(*
   A canonical example of Prefetch Speculation Protection. P0 writes (and
   flushes) some code at P1's `new`, then rewrites P1's BL to point to it.

   If P1 observes the new BL before it executes, then PSP guarantees that it
   also executes the replaced `new` code.

   If P1 doesn't observe the new BL, it calls `old`, which hasn't changed.
*)

{
0:X0=instr:"MOV w0, #2";
0:X1=instr:"BL .+16";
0:X10=P1:new;
0:X11=P1:L0;
}

P0              |  P1            ;
STR W0, [X10]   |L0:             ;
DC CVAU, X10    |  BL old        ;
DSB ISH         |  B end         ;
IC IVAU, X10    |old:            ;
DSB ISH         |  MOV w0, #0    ;
                |  RET           ;
STR W1, [X11]   |new:            ;
                |  MOV w0, #1    ;
                |  RET           ;
                |end:            ;
exists(1:X0=0 / 1:X0=2)

Running it with herd7 produces the following output:

Test PrefetchSpeculationProtection Allowed
States 2
1:X0=0;
1:X0=2;
Ok
Witnesses
Positive: 2 Negative: 0
Flag Assuming-common-inner-shareable-domain
Flag Assuming-no-two-modified-instructions-are-on-the-same-cache-line
Condition exists (1:X0=0 / 1:X0=2)
Observation PrefetchSpeculationProtection Always 2 0

This result confirms that Prefetch Speculation Protection still applies under the formal model and also ensures that even without the `isb` in `Label new`, if the executing thread fetches the updated branch at `L0`, PSP guarantees it won’t execute stale `MOV` from `Label new`.

Therefore, it suggests that even without an `isb` in `static_stub`, if the executing thread observes the updated `BL` instruction first, PSP ensures that it will also fetch and execute the updated `MOV` instructions in `static_stub`.

What do you think? Really appreciate any feedback! Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26638#issuecomment-3450683093