RFR: 8325937: runtime/handshake/HandshakeDirectTest.java causes "monitor end should be strictly below the frame pointer" assertion failure on AArch64

Tue Oct 1 22:30:34 UTC 2024

On Tue, 1 Oct 2024 18:38:33 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Add missing StoreLoad fences in handshaking code to match safepoint code.  Thanks to @pchilano for finding this bug.
> 
> Tested with tier1-4 and tier8 which has Kitchensink in it.

Let me try to recall how we ended up where we did...

Back when the first version of thread local handshakes was shipped, we checked the thread state with false positives but then took a semaphore and reloaded the thread state. We relied on the semaphore having a fence in its implementation. This ensured there was a fence between arming and reading the state. (cf. https://hg.openjdk.org/jdk/hs/rev/0ce0ac68ace7#l51.346)
Back then UseSysMemBarrier did not exist, but its predecessor UseMemBar worked fairly similarly, and we ensured that a corresponding big global synchronization hammer would run between arming and reading the thread states (cf. https://hg.openjdk.org/jdk/hs/rev/0ce0ac68ace7#l51.135).

The idea of using global synchronization to read thread states worked okay for cooperative safepointing which was also a global synchronization mechanism. But one of the biggest reasons we removed UseMemBar was because we did not want global synchronization for direct handshakes with a single thread.

All this code has shuffled around significantly since then. The implicit fence embedded in the locking was crucial.

Translating this to what we have today, the sys_membarrier is emitted instead of page serialization, just the same. We still use the same fence elision trick in HandshakeState::try_process where we first read the thread state just to check if there is any point taking a mutex. Then we take said mutex, which we also rely on having a fence() in its implementation, after which we reload the thread state to see if we can do the remote processing of the target thread.

In other words, it looks to me like this has been subtle but seemingly correct since its inception, provided that locking the mutex has a fence() in its implementation.

We make the same assumption in many places that surely locking a mutex causes a full fence so we don't need to fence() right next to a lock(). If that isn't true any longer on AArch64 because of some silly optimization, this does make me wonder if we have more "interesting problems" of a similar nature waiting to bite us, or if lock() does indeed fence() in which case the proposed patch does not seem to solve the bug being chased down. Both outcomes seem concerning to me.

So a lot comes down to what lock() actually does on AArch64. Should probably check that to find out which one of the two problems we have.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21295#issuecomment-2387195853