RFR: 8325937: runtime/handshake/HandshakeDirectTest.java causes "monitor end should be strictly below the frame pointer" assertion failure on AArch64

Wed Oct 2 04:38:34 UTC 2024

On Wed, 2 Oct 2024 03:16:56 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> Add missing StoreLoad fences in handshaking code to match safepoint code.  Thanks to @pchilano for finding this bug.
>> 
>> Tested with tier1-4 and tier8 which has Kitchensink in it.
>
> I stepped through the code in one of the Neoverse-N1 machines we have and I don't see a full fence when grabbing the mutex, just acquire semantics through the casa instruction (I skipped the first instructions):
> 
> 
>    0xfffff7eaa330 <pthread_mutex_trylock+80>:	mov	x2, x20
>    0xfffff7eaa334 <pthread_mutex_trylock+84>:	mov	w1, #0x1                   	 // #1
>    0xfffff7eaa338 <pthread_mutex_trylock+88>:	mov	w0, #0x0                   	// #0
>    0xfffff7eaa33c <pthread_mutex_trylock+92>:	mov	w19, #0x0                   	// #0
>    0xfffff7eaa340 <pthread_mutex_trylock+96>:	bl	0xfffff7eb4b80 <__aarch64_cas4_acq>
>    0xfffff7eaa344 <pthread_mutex_trylock+100>:	cbnz	w0, 0xfffff7eaa730 <pthread_mutex_trylock+1104>
>    0xfffff7eaa348 <pthread_mutex_trylock+104>:	mov	w0, w19
>    0xfffff7eaa34c <pthread_mutex_trylock+108>:	ldp	x19, x20, [sp, #16]
>    0xfffff7eaa350 <pthread_mutex_trylock+112>:	ldp	x21, x22, [sp, #32]
>    0xfffff7eaa354 <pthread_mutex_trylock+116>:	ldp	x23, x24, [sp, #48]
>    0xfffff7eaa358 <pthread_mutex_trylock+120>:	ldp	x29, x30, [sp], #96
>    0xfffff7eaa35c <pthread_mutex_trylock+124>:	ret
> 
>    0xfffff7eb4b80 <__aarch64_cas4_acq>:	adrp	x16, 0xfffff7ed4000 <__pthread_keys+15608>
>    0xfffff7eb4b84 <__aarch64_cas4_acq+4>:	ldrb	w16, [x16, #920]
>    0xfffff7eb4b88 <__aarch64_cas4_acq+8>:	cbz	w16, 0xfffff7eb4b94 <__aarch64_cas4_acq+20>
>    0xfffff7eb4b8c <__aarch64_cas4_acq+12>:	casa	w0, w1, [x2]
>    0xfffff7eb4b90 <__aarch64_cas4_acq+16>:	ret
> 
> 
> We could double-check the same thing in the Neoverse-N2 machine where we are seeing this issue but I think this already shows that we can't expect a full fence. In general that's how I always thought about pthread_mutex_lock/unlock as providing acquire/release semantics and not full fence [1], although in practice it probably would.
> 
> As to why we are not seeing the crash on the N1 then, my guess is the casa instruction might be implemented with stronger guarantees than what is specified, but not in the N2.
> 
> [1] https://preshing.com/20120913/acquire-and-release-semantics/

Thanks for checking @pchilano. As @dholmes-ora mentioned, POSIX doesn't really specify in more detail how strong memory ordering you can assume. So relaxing it such that accesses in the critical section may float above it, is an "interesting choice". I have seen a whole bunch of code that assumes that doesn't happen.

I suppose we have the choice of adding a fence to lock() and friends on ARM, or discovering through interesting core files where else we expected accesses in critical sections to not float out from the critical section as if locking in the JVM is some kind of slippery water park.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21295#issuecomment-2387609074