RFR: 8320318: ObjectMonitor Responsible thread [v3]

Mon Sep 23 10:14:50 UTC 2024

On Wed, 18 Sep 2024 18:29:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> Removed the concept of an ObjectMonitor Responsible thread.
>> 
>> The reason to have an ObjectMonitor Responsible thread was to avoid threads getting stranded due to a hole in the successor protocol. This hole was there because adding the necessary memory barrier was considered too expensive some 20 years ago.
>> 
>> The ObjectMonitor Responsible thread code adds complexity, and doing timed parks just to avoid getting stranded is not the way forward. More info about the problems with the ObjectMonitor responsible thread can be found in [JDK-8320318](https://bugs.openjdk.org/browse/JDK-8320318).
>> 
>> After removing the ObjectMonitor Responsible thread we see increased performance on all supported platforms except Windows. [JDK-8339730](https://bugs.openjdk.org/browse/JDK-8339730) has been created to handle this.
>> 
>> Passes tier1-tier7 on supported platforms.
>> x64, AArch64, Riscv64, ppc64le and s390x passes ok on the test/micro/org/openjdk/bench/vm/lang/LockUnlock.java test.
>> Arm32 and Zero doesn't need any changes as far as I can tell.
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update two, after the review

This looks generally good to me. Found some weirdness, but I think we can fix it after this goes in. Some nit too but I don't need to see the updated patch for that. Thanks for fixing this!

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 222:

> 220:   // Check if the entry lists are empty.
> 221:   ldr(rscratch1, Address(tmp, ObjectMonitor::EntryList_offset()));
> 222:   ldr(tmpReg, Address(tmp, ObjectMonitor::cxq_offset()));

What strikes me a bit is that the EntryList vs cxq loads have inconsistent ordering in the runtime and the different ports. Notably, the x86 port loads EntryList first while here for example we load cxq first. This adds some cognitive overhead reasoning about the implications, at least for me. In particular, with respect to the ordering we see here where EntryList is loaded first, the following race is possible (which is not possible in the x86 intrinsic):
1. Thread 1 enters the monitor.
2. Thread 2 tries to enter the monitor, fails, adds itself to cxq, tries again, and eventually parks.
3. Thread 1 starts exiting the monitor and runs all the instructions up to and including the EntryList load above on line 221, which yields an empty EntryList.
4. Thread 3 enters the monitor since it is no longer owned.
5. Thread 3 exits the monitor, and moves cxq to EntryList in the process while still holding the lock.
6. Thread 1 runs the next instruction loading cxq on line 222, resulting in an empty list.
7. Thread 1 draws the conclusion that there is no thread waiting for the monitor, even though Thread 2 has waited for the monitor since before Thread 2 released it.

Even though this choice of ordering between reading EntryList and cxq allows a waiter to be completely unnoticed throughout the monitorexit procedure due to unfortunate interleavings, which is weird, the protocol deals with this just fine. Since Thread 2 managed to enter the monitor, the responsibility of ensuring liveness of the monitor becomes the responsibility of Thread 2 which will make Thread 1 the successor and unpark Thread 1.

I'm not proposing we have the "other" more intuitive ordering of loading cxq first which removes this race so we don't have to think about it. Enforcing that ordering requires an extra loadload fence, which isn't free. I think what I would prefer to see, is that we at least use a consistent ordering so we don't have to think about why some platforms use one ordering while others use another one and what different races only affect some platforms. And I'd prefer if this less obvious ordering of reading EntryList before cxq is the championed ordering, with a comment in the shared code explaining why it's okay that waiters slip undetected between the two loads. Because without explicit fencing, that's an order that has to be valid anyway, unless we jam in some loadload fences.

Anyway, I think this could be a follow-up cleanup, but doesn't really need to be fixed in this patch.

src/hotspot/share/runtime/objectMonitor.cpp line 335:

> 333:       // ObjectMonitor::deflate_monitor() will decrement contentions
> 334:       // after it recognizes that the async deflation was cancelled.
> 335:       contention_mark.extend();

This is a bit scary. Previously the locking_thread would already have 1 stake in the _contentions, and recognizes that after cancelling deflation, that stake is asynchronously released by the deflation thread. This means that in practice, we have 0 stakes left in the contention counter after the CAS that swings the owner to the locking_thread succeeds. Yet the ObjectMonitorContentionMark RAII object passed in from the caller looks like it guarantees there is a stake in the _contentions throughout its scope. By explicitly adding to contentions, the locking_thread reclaimed its stake in the _contentions counter while holding the lock, guaranteeing that deflation is indeed impossible until the end of the scope.
The new extend() mechanism seems to consider it equivalent to not increment here and also not decrement later. +1 -1 == 0 right? However, that is not equivalent. HotSpot math works in mysterious ways. The old mechanism guaranteed the linearization point for deflation is blocked until you get out of scope. The new mechanism does not. Instead, it's up to the user to reason about for how long deflation is blocked out. It's blocked out as long as the monitor is held naturally, but if it is released and the scope is still active, there is no stake in the _contentions counter and deflation would succeed if it tries again. Things like the _waiters counter might tell the heuristics of deflation to not try again. An absence of a safepoint poll might also prevent deflation from trying again in a timely fashion. But the point is that the linearization point for deflation is no longer blocked, and the abstraction looks safer than it is.

In practice, I don't know of any bug because of this. Seems like deflation is in other ways blocked out in practice. But I would really prefer if extend() would add to contentions and the destructor always decrements. This way, the contract is stronger and it's easier to convince ourselves that we have not messed up. The scope would on its own prevent deflation, regardless of how it is used, which cannot be guaranteed any longer with the current extend() implementation.

Again, this might be better suited for a follow-up RFE.

src/hotspot/share/runtime/objectMonitor.hpp line 226:

> 224:   static ByteSize succ_offset()        { return byte_offset_of(ObjectMonitor, _succ); }
> 225:   static ByteSize EntryList_offset()   { return byte_offset_of(ObjectMonitor, _EntryList); }
> 226:   static ByteSize contentions_offset() { return byte_offset_of(ObjectMonitor, _contentions); }

Looks like a leftover from the previous approach that tried to deal with deflation races in assembly code. It should probably be removed.

-------------

Marked as reviewed by eosterlund (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/19454#pullrequestreview-2292382088
PR Review Comment: https://git.openjdk.org/jdk/pull/19454#discussion_r1771021064
PR Review Comment: https://git.openjdk.org/jdk/pull/19454#discussion_r1771096278
PR Review Comment: https://git.openjdk.org/jdk/pull/19454#discussion_r1751894926