RFR: 8363620: AArch64: reimplement emit_static_call_stub()

Thu Aug 7 16:47:19 UTC 2025

On Wed, 6 Aug 2025 09:30:07 GMT, Andrew Haley <aph at openjdk.org> wrote:

>>> Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.
>> 
>> I thought other threads are modifying the data, which is the reason why we needed the ISB.
>
>> > Unaligned memory access may be non-atomic, although in this case, other threads aren’t modifying the data.
>> 
>> I thought other threads are modifying the data, which is the reason why we needed the ISB.
> 
> Yes, exactly. Other threads are executing this nmethod while we are modifying its static call stub.
> 
> We hold the lock while we're doing this, so any thread racing with us to _modify_ the call stub would be blocked. However, another thread could execute the call we just patched but not fully observe the changes we made to the stub, which is why we need the ISB.
> 
> We have a similar situation with this PR, but with data rather than instruction memory. We need to make sure that any racing thread observes changes to the stub before it observes the patched call to the stub. There is no ordering between instruction and data memory, which is why the current stub uses only instruction memory, keeping everything right. The ISB ensures that the executing thread observes the changed stub: there is a happens-before from the `ICache::invalidate_range` after patching the stub (but before patching the call) to the `isb` at the start of the stub.
> 
> For this PR to be correctly synchronized, we'd need a similar happens-before from patching the constants used in the stub to executing the stub. I can think of a way to do that, but it's fiddly and may not be worth it.

Thanks for your reviewing @theRealAph @adinn @dean-long .

Also thanks a lot for the explanation — it’s really helpful and clarified why the `isb` is necessary in the current implementation.

This patch is inspired by the trampoline stub design, and I’m trying to understand why that approach works as it does. Happy to be corrected if I’ve got anything wrong.

Initially, we generate a trampoline stub to redirect control flow from the caller to the static call resolver. Once the static stub is patched, we also update the trampoline stub, though in some cases the call may bypass the trampoline entirely and jump directly from the caller to the static stub. When the callee is eventually compiled, we reset the trampoline to once again direct the call back to the resolver. After resolution completes, we then update the trampoline to point to the compiled callee.

My question is: Why isn’t any explicit memory synchronization required before executing the trampoline stub? How do we ensure that any data loaded by the trampoline stub is fully synchronized across multiple threads? I’d appreciate any insights.

Thank you very much!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26638#discussion_r2260874295