RFR: 8363620: AArch64: reimplement emit_static_call_stub()
Fei Gao
fgao at openjdk.org
Mon Oct 27 10:58:04 UTC 2025
On Mon, 27 Oct 2025 10:40:46 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> In the existing implementation, the static call stub typically emits a sequence like:
>> `isb; movk; movz; movz; movk; movz; movz; br`.
>>
>> This patch reimplements it using a more compact and patch-friendly sequence:
>>
>> ldr x12, Label_data
>> ldr x8, Label_entry
>> br x8
>> Label_data:
>> 0x00000000
>> 0x00000000
>> Label_entry:
>> 0x00000000
>> 0x00000000
>>
>> The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations.
>>
>> While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.
>>
>> A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
>>
>>
>> Benchmark (length) Mode Cnt Master Patch Units
>> StaticCallStubFar.callCompiled 1000 avgt 5 39.346 22.474 us/op
>> StaticCallStubFar.callCompiled 10000 avgt 5 390.05 218.478 us/op
>> StaticCallStubFar.callCompiled 100000 avgt 5 3869.264 2174.001 us/op
>> StaticCallStubNear.callCompiled 1000 avgt 5 39.093 22.582 us/op
>> StaticCallStubNear.callCompiled 10000 avgt 5 387.319 217.398 us/op
>> StaticCallStubNear.callCompiled 100000 avgt 5 3855.825 2206.923 us/op
>>
>>
>> All tests in Tier1 to Tier3, under both release and debug builds, have passed.
>>
>> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
>
>> /open
>
> Why? Do you have any reasonable expectation of a better way to do it?
Hi @theRealAph, I’ve been thinking more about this topic and want to share a few updated observations.
In the current implementation of the `static call stub`, the executing thread runs the following instructions:
[main code]
L0:
bl trampoline_stub
... {post call}
trampoline_stub:
ldr x8, callee_address
br x8
callee_address:
0x12345678
0x12345678
static_stub:
isb
mov x12, #0x0
movk x12, #0x0, lsl #16
movk x12, #0x0, lsl #32
mov x8, #0x0
movk x8, #0x0, lsl #16
movk x8, #0x0, lsl #32
br x8
The writing thread performs the following steps:
1. Updates the `MOV` instructions in `static_stub` with new values.
2. Calls `ICache::invalidate_range()`.
3. Writes a `bl static_stub` instruction at `L0`.
(Note: both `static_stub` and `trampoline_stub` reside in the stub section, which is directly reachable by `bl`.)
In the existing implementation, when the writing thread writes the `bl static_stub` instruction at `L0`, it also updates the
`callee_address` in `trampoline_stub` to the address of `static_stub`. See https://github.com/openjdk/jdk/blob/cc9483b4da1a0f65f8773d0c7f35f2e6a7e1bd4f/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp#L92
If we modify the code slightly as follows:
// Patch the call.
if (reachable) {
set_destination(dest);
} else {
// Patch the constant in the call's trampoline stub.
address trampoline_stub_addr = get_trampoline();
if (trampoline_stub_addr != nullptr) {
assert (! is_NativeCallTrampolineStub_at(dest), "chained trampolines");
nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(dest);
}
assert (trampoline_stub_addr != nullptr, "we need a trampoline");
set_destination(trampoline_stub_addr);
}
Then the `callee_address` in `trampoline_stub` would no longer be expected to point to `static_stub`.
According to an older version of the Arm ARM (https://developer.arm.com/documentation/ddi0487/ia),
section A2.2.2, there is an architectural guarantee called Prefetch Speculation Protection (PSP).
>From the Arm community blog: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads:
> Prefetch speculation protection is an architectural guarantee
> that makes some minimally-synchronized code updates possible.
>
> “Prefetch speculation protection” is an name from old editions
> of the Arm ARM (such as [DDI0487 I.a](https://developer.arm.com/documentation/ddi0487/ia)).
> In recent versions, the required behaviours are covered by the
> formal concurrency model (as in section B2.3 of DDI0487 [Latest](https://developer.arm.com/documentation/ddi0487/latest/)),
> but this specific set of properties no longer has its own name.
Essentially, when a writing thread rewrites a direct branch with an updated direct branch, and another thread is concurrently executing that modified code, PSP ensures that the executing thread does not fetch stale instructions.
According to the B2.3.9 Ordering of instruction fetches of the [specification](https://developer.arm.com/documentation/ddi0487/ia):
If we update the `MOV`s in `static_stub`, and ensure coherence between the data writes and instruction fetches within the same shareability domain before writing `bl static_stub` at `L0`, then for any observer:
An instruction read from `L0` appears in program order before an instruction fetched from `static_stub`, if the instruction at `L0` contains the updated direct branch, then the subsequent fetch from `static_stub` will contain the updated `MOV` values.
These two properties imply that even without the `isb` in `static_stub`, if the executing thread fetches the updated direct branch at `L0` that jumps to `static_stub`, PSP guarantees it won’t execute stale `MOV`s from `static_stub`.
A straightforward way to confirm this behavior is to run a litmus test in [herd7](https://diy.inria.fr/www/).
The author of the blog post, @jacobbramley, also shared an example litmus test:
AArch64 PrefetchSpeculationProtection
(* Copyright 2025 Arm Limited *)
(*
A canonical example of Prefetch Speculation Protection. P0 writes (and
flushes) some code at P1's `new`, then rewrites P1's BL to point to it.
If P1 observes the new BL before it executes, then PSP guarantees that it
also executes the replaced `new` code.
If P1 doesn't observe the new BL, it calls `old`, which hasn't changed.
*)
{
0:X0=instr:"MOV w0, #2";
0:X1=instr:"BL .+16";
0:X10=P1:new;
0:X11=P1:L0;
}
P0 | P1 ;
STR W0, [X10] |L0: ;
DC CVAU, X10 | BL old ;
DSB ISH | B end ;
IC IVAU, X10 |old: ;
DSB ISH | MOV w0, #0 ;
| RET ;
STR W1, [X11] |new: ;
| MOV w0, #1 ;
| RET ;
|end: ;
exists(1:X0=0 / 1:X0=2)
Running it with herd7 produces the following output:
Test PrefetchSpeculationProtection Allowed
States 2
1:X0=0;
1:X0=2;
Ok
Witnesses
Positive: 2 Negative: 0
Flag Assuming-common-inner-shareable-domain
Flag Assuming-no-two-modified-instructions-are-on-the-same-cache-line
Condition exists (1:X0=0 / 1:X0=2)
Observation PrefetchSpeculationProtection Always 2 0
This result confirms that Prefetch Speculation Protection still applies under the formal model and also ensures that even without the `isb` in `Label new`, if the executing thread fetches the updated branch at `L0`, PSP guarantees it won’t execute stale `MOV` from `Label new`.
Therefore, it suggests that even without an `isb` in `static_stub`, if the executing thread observes the updated `BL` instruction first, PSP ensures that it will also fetch and execute the updated `MOV` instructions in `static_stub`.
What do you think? Really appreciate any feedback! Thanks!
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26638#issuecomment-3450683093
More information about the hotspot-dev
mailing list