RFR: 8363620: AArch64: reimplement emit_static_call_stub() [v2]

Sun Nov 30 13:15:52 UTC 2025

On Sun, 30 Nov 2025 11:40:18 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>> 
>>  - Patch 'isb' to 'nop'
>>  - Merge branch 'master' into reimplement-static-call-stub
>>  - 8363620: AArch64: reimplement emit_static_call_stub()
>>    
>>    In the existing implementation, the static call stub typically
>>    emits a sequence like:
>>    `isb; movk; movz; movz; movk; movz; movz; br`.
>>    
>>    This patch reimplements it using a more compact and patch-friendly
>>    sequence:
>>    ```
>>    ldr x12, Label_data
>>    ldr x8, Label_entry
>>    br x8
>>    Label_data:
>>      0x00000000
>>      0x00000000
>>    Label_entry:
>>      0x00000000
>>      0x00000000
>>    ```
>>    The new approach places the target addresses adjacent to the code
>>    and loads them dynamically. This allows us to update the call
>>    target by modifying only the data in memory, without changing any
>>    instructions. This avoids the need for I-cache flushes or
>>    issuing an `isb`[1], which are both relatively expensive
>>    operations.
>>    
>>    While emitting direct branches in static stubs for small code
>>    caches can save 2 bytes compared to the new implementation,
>>    modifying those branches still requires I-cache flushes or an
>>    `isb`. This patch unifies the code generation by emitting the
>>    same static stubs for both small and large code caches.
>>    
>>    A microbenchmark (StaticCallStub.java) demonstrates a performance
>>    uplift of approximately 43%.
>>    
>>    Benchmark                       (length)   Mode   Cnt Master     Patch      Units
>>    StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
>>    StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
>>    StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
>>    StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
>>    StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
>>    StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
>>    
>>    All tests in Tier1 to Tier3, under both release and debug builds,
>>    have passed.
>>    
>>    [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
>
> I think I'd do something like this. I does mean that we're executing an unnecessary jump+1 when we jump directly to the stub, but it maintains the invariant that the trampoline destination and the call destination are the same, so it does not matter how a call reaches the static call stub. I think this invariant is worth keeping.
> 
> Remember that we're jumping from compiled code to the _interpreter_, which does thousands of jumps! A single extra well-predicted branch won't hurt.
> 
> 
> diff --git a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
> index 3f3b8d28408..87887bb0a25 100644
> --- a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
> +++ b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
> @@ -168,12 +168,16 @@ void CompiledDirectCall::set_to_interpreted(const methodHandle& callee, address
>    //                             |  B end                    ;
>    //                             |end:                       ;
>    // forall (1:X0=1 / 1:X0=3)
> 
> We can't use `Assembler` to do this patching because it's not atomic.
> 
> -  CodeBuffer stub_first_instruction(stub, Assembler::instruction_size);
> -  Assembler assembler(&stub_first_instruction);
> -  assembler.nop();
> +
> +  NativeJump::insert(stub, stub + NativeJump::instruction_size);
> +
> +  address trampoline_stub_addr = _call->get_trampoline();
> +  if (trampoline_stub_addr != nullptr) {
> +    nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(stub);
> +  }
>  
>    // Update jump to call.
> -  set_destination_mt_safe(stub);
> +  _call->set_destination(stub);
>  }
>  
>  void CompiledDirectCall::set_stub_to_clean(static_stub_Relocation* static_stub) {
> diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
> index f2003dd9b55..22e7dcc2552 100644
> --- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
> +++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
> @@ -238,6 +238,14 @@ void NativeJump::set_jump_destination(address dest) {
>    ICache::invalidate_range(instruction_address(), instruction_size);
>  };
>  
> +// Atomic insertion of jump to target.
> +void NativeJump::insert(address code_pos, address target) {
> +  intptr_t offset = target - code_pos;
> +  uint32_t insn = 0b000101 << 26;
> +  Instruction_aarch64::spatch((address)&insn, 25, 0, offset >> 2);
> +  AtomicAccess::store((volatile uint32_t*)code_pos, insn);
> +}
> +  
>  //-------------------------------------------------------------------
>  
>  address NativeGeneralJump::jump_destination() const {

@theRealAph thanks a lot for your explanation! I'll update it soon.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26638#issuecomment-3592540875