RFR: 8363620: AArch64: reimplement emit_static_call_stub() [v2]
Fei Gao
fgao at openjdk.org
Sun Nov 30 13:15:52 UTC 2025
On Sun, 30 Nov 2025 11:40:18 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>>
>> - Patch 'isb' to 'nop'
>> - Merge branch 'master' into reimplement-static-call-stub
>> - 8363620: AArch64: reimplement emit_static_call_stub()
>>
>> In the existing implementation, the static call stub typically
>> emits a sequence like:
>> `isb; movk; movz; movz; movk; movz; movz; br`.
>>
>> This patch reimplements it using a more compact and patch-friendly
>> sequence:
>> ```
>> ldr x12, Label_data
>> ldr x8, Label_entry
>> br x8
>> Label_data:
>> 0x00000000
>> 0x00000000
>> Label_entry:
>> 0x00000000
>> 0x00000000
>> ```
>> The new approach places the target addresses adjacent to the code
>> and loads them dynamically. This allows us to update the call
>> target by modifying only the data in memory, without changing any
>> instructions. This avoids the need for I-cache flushes or
>> issuing an `isb`[1], which are both relatively expensive
>> operations.
>>
>> While emitting direct branches in static stubs for small code
>> caches can save 2 bytes compared to the new implementation,
>> modifying those branches still requires I-cache flushes or an
>> `isb`. This patch unifies the code generation by emitting the
>> same static stubs for both small and large code caches.
>>
>> A microbenchmark (StaticCallStub.java) demonstrates a performance
>> uplift of approximately 43%.
>>
>> Benchmark (length) Mode Cnt Master Patch Units
>> StaticCallStubFar.callCompiled 1000 avgt 5 39.346 22.474 us/op
>> StaticCallStubFar.callCompiled 10000 avgt 5 390.05 218.478 us/op
>> StaticCallStubFar.callCompiled 100000 avgt 5 3869.264 2174.001 us/op
>> StaticCallStubNear.callCompiled 1000 avgt 5 39.093 22.582 us/op
>> StaticCallStubNear.callCompiled 10000 avgt 5 387.319 217.398 us/op
>> StaticCallStubNear.callCompiled 100000 avgt 5 3855.825 2206.923 us/op
>>
>> All tests in Tier1 to Tier3, under both release and debug builds,
>> have passed.
>>
>> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
>
> I think I'd do something like this. I does mean that we're executing an unnecessary jump+1 when we jump directly to the stub, but it maintains the invariant that the trampoline destination and the call destination are the same, so it does not matter how a call reaches the static call stub. I think this invariant is worth keeping.
>
> Remember that we're jumping from compiled code to the _interpreter_, which does thousands of jumps! A single extra well-predicted branch won't hurt.
>
>
> diff --git a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
> index 3f3b8d28408..87887bb0a25 100644
> --- a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
> +++ b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
> @@ -168,12 +168,16 @@ void CompiledDirectCall::set_to_interpreted(const methodHandle& callee, address
> // | B end ;
> // |end: ;
> // forall (1:X0=1 / 1:X0=3)
>
> We can't use `Assembler` to do this patching because it's not atomic.
>
> - CodeBuffer stub_first_instruction(stub, Assembler::instruction_size);
> - Assembler assembler(&stub_first_instruction);
> - assembler.nop();
> +
> + NativeJump::insert(stub, stub + NativeJump::instruction_size);
> +
> + address trampoline_stub_addr = _call->get_trampoline();
> + if (trampoline_stub_addr != nullptr) {
> + nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(stub);
> + }
>
> // Update jump to call.
> - set_destination_mt_safe(stub);
> + _call->set_destination(stub);
> }
>
> void CompiledDirectCall::set_stub_to_clean(static_stub_Relocation* static_stub) {
> diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
> index f2003dd9b55..22e7dcc2552 100644
> --- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
> +++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
> @@ -238,6 +238,14 @@ void NativeJump::set_jump_destination(address dest) {
> ICache::invalidate_range(instruction_address(), instruction_size);
> };
>
> +// Atomic insertion of jump to target.
> +void NativeJump::insert(address code_pos, address target) {
> + intptr_t offset = target - code_pos;
> + uint32_t insn = 0b000101 << 26;
> + Instruction_aarch64::spatch((address)&insn, 25, 0, offset >> 2);
> + AtomicAccess::store((volatile uint32_t*)code_pos, insn);
> +}
> +
> //-------------------------------------------------------------------
>
> address NativeGeneralJump::jump_destination() const {
@theRealAph thanks a lot for your explanation! I'll update it soon.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26638#issuecomment-3592540875
More information about the hotspot-dev
mailing list