RFR: 8325937: runtime/handshake/HandshakeDirectTest.java causes "monitor end should be strictly below the frame pointer" assertion failure on AArch64

Thu Oct 3 17:22:38 UTC 2024

On Thu, 3 Oct 2024 14:31:46 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

> > I guess I must confess to my real motivation here.
> > We have a sort-of-knee-jerk reaction to finding concurrency bugs, sometimes caused by a misunderstanding of primitive sematics, of sprinkling fences around just in case rather than fixing the mistakes. I guess the main motivation is fear: we just don't know what is lurking here. But all that does is paper over the cracks, and it makes it very hard to reason about the code. There is no way to know whether some logic relies on a side effect of locking. It's much better _for the reader_ if we make all of this explicit, rather than implied.
> 
> I totally agree with you that we need the semantics to be made explicit and clear so we are not all assuming different things, introducing more of these kind of bugs. In this discussion we have seen a wide range of assumptions about what the semantics of pthread_mutex_lock is, ranging from acquire(), to some sort of SC interpretation to full fence(). And nobody is seemingly right or wrong as the spec is rather vague and leaves a lot to our imagination. So the "misunderstanding" is not surprising, when it hasn't been specified.

I'm not sure I agree with the lack of specification. If used for mutual exclusion, mutexes do what they're supposed to do. This is to say, make sure you only access shared variables while in the critical section between lock and unlock, and there is no data race. The result will be SC. What is not specified is the side effects of a mutex implementation on racy accesses to variables outside the critical section.

> In terms of what the implementation actually does though, it seems like glibc pthread_mutex currently uses acquire/release only just like std::mutex. That seemingly blows up the roach hotel model, SC and fence expectations all together.

That's not strictly true in the case of AArch64, because it has some additional guarantees. But never mind that, on some platforms  pthread_mutex may be as weak as acq/rel.

> So if the path forward is to retroactively accept what the rather relaxed glibc implementation gives us as the "correct" contract of what locking should do, and dismiss anything else is "misunderstanding", then we can say goodbye to expectations a lot of people have had when building HotSpot over the years.

Sure. I've always found these expectations a lot of people have had to be utterly baffling, which is perhaps where our difference originates. So it's not so much a matter of disagreeing about facts.

The problem - from my point of view - is using side effects of the implementation of some operation to provide us with essential synchronization, even in cases where that operation was unrelated. That is an anti-pattern, IMO: programming for the implementation, not the specification.

> I think that seems a bit reckless.

I get that.

> If we do go in that direction, I would hope we could get there in a more incremental fashion after actually looking at the code and reasoning a bit about the implications. So for now, let's just add Coleen's fence. But I think we really ought to think very carefully about what we decide to do about this going forward. I'd personally like to think about it a bit as the implications are far from obvious to me.

Fair enough, we need to think about it. And I'd like to be part of that conversation, If I may. What I'm trying to avoid, as you no doubt realize, is full fences being added to all concurrency primitives, in effect dragging more recent architectures down to legacy x86.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21295#issuecomment-2391934523