RFR: 8363620: AArch64: reimplement emit_static_call_stub()
Fei Gao
fgao at openjdk.org
Tue Aug 5 11:48:08 UTC 2025
On Tue, 5 Aug 2025 10:30:13 GMT, Fei Gao <fgao at openjdk.org> wrote:
> In the existing implementation, the static call stub typically emits a sequence like:
> `isb; movk; movz; movz; movk; movz; movz; br`.
>
> This patch reimplements it using a more compact and patch-friendly sequence:
>
> ldr x12, Label_data
> ldr x8, Label_entry
> br x8
> Label_data:
> 0x00000000
> 0x00000000
> Label_entry:
> 0x00000000
> 0x00000000
>
> The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations.
>
> While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.
>
> A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
>
>
> Benchmark (length) Mode Cnt Master Patch Units
> StaticCallStubFar.callCompiled 1000 avgt 5 39.346 22.474 us/op
> StaticCallStubFar.callCompiled 10000 avgt 5 390.05 218.478 us/op
> StaticCallStubFar.callCompiled 100000 avgt 5 3869.264 2174.001 us/op
> StaticCallStubNear.callCompiled 1000 avgt 5 39.093 22.582 us/op
> StaticCallStubNear.callCompiled 10000 avgt 5 387.319 217.398 us/op
> StaticCallStubNear.callCompiled 100000 avgt 5 3855.825 2206.923 us/op
>
>
> All tests in Tier1 to Tier3, under both release and debug builds, have passed.
>
> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 989:
> 987: ldr(rscratch1, far_jump_entry);
> 988: br(rscratch1);
> 989: bind(far_jump_metadata);
I’m considering whether the data here should be 8-byte aligned, similar to what we did for the trampoline stubs https://github.com/openjdk/jdk/blob/743c821289a6562972364b5dcce8dd29a786264a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L950
I tried with a small case:
@CompilerControl(CompilerControl.Mode.EXCLUDE)
public static void callInterpreted0(int i) {
val0 = i;
}
@CompilerControl(CompilerControl.Mode.EXCLUDE)
public static void callInterpreted1(int i) {
val1 = i;
}
@CompilerControl(CompilerControl.Mode.EXCLUDE)
public static void callInterpreted2(int i) {
val2 = i;
}
@Benchmark
public void callCompiled() {
for (int i = 0; i < length; i++) {
callInterpreted0(i); // Make sure this is excluded from compilation
callInterpreted1(i);
callInterpreted2(i);
}
}
where the static stubs are laid out as follows:
0x0000eee528201dd0: ldr x8, 0x0000eee528201dd8 ; {trampoline_stub}
0x0000eee528201dd4: br x8
0x0000eee528201dd8: .inst 0x2f69fe40 ; undefined
0x0000eee528201ddc: .inst 0x0000eee5 ; undefined
0x0000eee528201de0: ldr x8, 0x0000eee528201de8 ; {trampoline_stub}
0x0000eee528201de4: br x8
0x0000eee528201de8: .inst 0x2f69fe40 ; undefined
0x0000eee528201dec: .inst 0x0000eee5 ; undefined
0x0000eee528201df0: ldr x8, 0x0000eee528201df8 ; {trampoline_stub}
0x0000eee528201df4: br x8
0x0000eee528201df8: .inst 0x2f69fe40 ; undefined
0x0000eee528201dfc: .inst 0x0000eee5 ; undefined
0x0000eee528201e00: ldr x12, 0x0000eee528201e0c ; {static_stub}
0x0000eee528201e04: ldr x8, 0x0000eee528201e14
0x0000eee528201e08: br x8
0x0000eee528201e0c: .inst 0x00000000 ; undefined
0x0000eee528201e10: .inst 0x00000000 ; undefined
0x0000eee528201e14: .inst 0x00000000 ; undefined
0x0000eee528201e18: .inst 0x00000000 ; undefined
0x0000eee528201e1c: ldr x12, 0x0000eee528201e28 ; {static_stub}
0x0000eee528201e20: ldr x8, 0x0000eee528201e30
0x0000eee528201e24: br x8
0x0000eee528201e28: .inst 0x00000000 ; undefined
0x0000eee528201e2c: .inst 0x00000000 ; undefined
0x0000eee528201e30: .inst 0x00000000 ; undefined
0x0000eee528201e34: .inst 0x00000000 ; undefined
0x0000eee528201e38: ldr x12, 0x0000eee528201e44 ; {static_stub}
0x0000eee528201e3c: ldr x8, 0x0000eee528201e4c
0x0000eee528201e40: br x8
0x0000eee528201e44: .inst 0x00000000 ; undefined
0x0000eee528201e48: .inst 0x00000000 ; undefined
0x0000eee528201e4c: .inst 0x00000000 ; undefined
0x0000eee528201e50: .inst 0x00000000 ; undefined
Here are the performance results:
Benchmark (length) Mode Cnt Master Aligned Unaligned Units
StaticCallStub.StaticCallStubFar.callCompiled 1000 avgt 5 114.794 63.117 64.346 us/op
StaticCallStub.StaticCallStubFar.callCompiled 10000 avgt 5 1136.016 618.576 619.629 us/op
StaticCallStub.StaticCallStubFar.callCompiled 100000 avgt 5 11323.945 6191.452 6277.813 us/op
StaticCallStub.StaticCallStubNear.callCompiled 1000 avgt 5 114.335 63.142 64.091 us/op
StaticCallStub.StaticCallStubNear.callCompiled 10000 avgt 5 1140.667 618.653 619.861 us/op
StaticCallStub.StaticCallStubNear.callCompiled 100000 avgt 5 11351.394 6194.946 6195.255 us/op
We have several aspects to consider:
- 8-byte alignment brings a minor performance gain, but it’s not significant compared to the overall improvement achieved by reimplementing the static stubs.
- Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.
- The alignment requirement for trampoline stubs doesn’t always introduce extra NOPs—padding is only added when trampoline and static stubs are interleaved. In contrast, enforcing 8-byte alignment for static stubs almost always introduces padding, since the first three instructions are not naturally aligned.
- We should carefully balance the trade-off between code size increase and code hotness (i.e., how frequently the stub code is executed).
Any feedback or suggestions would be greatly appreciated.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26638#discussion_r2254100081
More information about the hotspot-dev
mailing list