RFR: 8363620: AArch64: reimplement emit_static_call_stub() [v2]

Sun Nov 30 11:42:51 UTC 2025

On Fri, 28 Nov 2025 10:17:50 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In the existing implementation, the static call stub typically emits a sequence like:
>> `isb; movk; movz; movz; movk; movz; movz; br`.
>> 
>> This patch reimplements it using a more compact and patch-friendly sequence:
>> 
>> ldr x12, Label_data
>> ldr x8, Label_entry
>> br x8
>> Label_data:
>>   0x00000000
>>   0x00000000
>> Label_entry:
>>   0x00000000
>>   0x00000000
>> 
>> The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations.
>> 
>> While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.
>> 
>> A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
>> 
>> 
>> Benchmark                       (length)   Mode   Cnt Master     Patch      Units
>> StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
>> StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
>> StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
>> StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
>> StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
>> StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
>> 
>> 
>> All tests in Tier1 to Tier3, under both release and debug builds, have passed.
>> 
>> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Patch 'isb' to 'nop'
>  - Merge branch 'master' into reimplement-static-call-stub
>  - 8363620: AArch64: reimplement emit_static_call_stub()
>    
>    In the existing implementation, the static call stub typically
>    emits a sequence like:
>    `isb; movk; movz; movz; movk; movz; movz; br`.
>    
>    This patch reimplements it using a more compact and patch-friendly
>    sequence:
>    ```
>    ldr x12, Label_data
>    ldr x8, Label_entry
>    br x8
>    Label_data:
>      0x00000000
>      0x00000000
>    Label_entry:
>      0x00000000
>      0x00000000
>    ```
>    The new approach places the target addresses adjacent to the code
>    and loads them dynamically. This allows us to update the call
>    target by modifying only the data in memory, without changing any
>    instructions. This avoids the need for I-cache flushes or
>    issuing an `isb`[1], which are both relatively expensive
>    operations.
>    
>    While emitting direct branches in static stubs for small code
>    caches can save 2 bytes compared to the new implementation,
>    modifying those branches still requires I-cache flushes or an
>    `isb`. This patch unifies the code generation by emitting the
>    same static stubs for both small and large code caches.
>    
>    A microbenchmark (StaticCallStub.java) demonstrates a performance
>    uplift of approximately 43%.
>    
>    Benchmark                       (length)   Mode   Cnt Master     Patch      Units
>    StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
>    StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
>    StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
>    StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
>    StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
>    StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
>    
>    All tests in Tier1 to Tier3, under both release and debug builds,
>    have passed.
>    
>    [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads

I think I'd do something like this. I does mean that we're executing an unnecessary jump+1 when we jump directly to the stub, but it maintains the invariant that the trampoline destination and the call destination are the same, so it does not matter how a call reaches the static call stub. I think this invariant is worth keeping.

Remember that we're jumping from compiled code to the _interpreter_, which does thousands of jumps! A single extra well-predicted branch won't hurt.

diff --git a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
index 3f3b8d28408..87887bb0a25 100644
--- a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
@@ -168,12 +168,16 @@ void CompiledDirectCall::set_to_interpreted(const methodHandle& callee, address
   //                             |  B end                    ;
   //                             |end:                       ;
   // forall (1:X0=1 / 1:X0=3)

We can't use `Assembler` to do this patching because it's not atomic.

-  CodeBuffer stub_first_instruction(stub, Assembler::instruction_size);
-  Assembler assembler(&stub_first_instruction);
-  assembler.nop();
+
+  NativeJump::insert(stub, stub + NativeJump::instruction_size);
+
+  address trampoline_stub_addr = _call->get_trampoline();
+  if (trampoline_stub_addr != nullptr) {
+    nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(stub);
+  }
 
   // Update jump to call.
-  set_destination_mt_safe(stub);
+  _call->set_destination(stub);
 }
 
 void CompiledDirectCall::set_stub_to_clean(static_stub_Relocation* static_stub) {
diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
index f2003dd9b55..22e7dcc2552 100644
--- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
@@ -238,6 +238,14 @@ void NativeJump::set_jump_destination(address dest) {
   ICache::invalidate_range(instruction_address(), instruction_size);
 };
 
+// Atomic insertion of jump to target.
+void NativeJump::insert(address code_pos, address target) {
+  intptr_t offset = target - code_pos;
+  uint32_t insn = 0b000101 << 26;
+  Instruction_aarch64::spatch((address)&insn, 25, 0, offset >> 2);
+  AtomicAccess::store((volatile uint32_t*)code_pos, insn);
+}
+  
 //-------------------------------------------------------------------
 
 address NativeGeneralJump::jump_destination() const {

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26638#issuecomment-3592480392