RFR: 8318757 ObjectMonitor::deflate_monitor fails "assert(prev == old_value) failed: unexpected prev owner=0x0000000000000002, expected=0x0000000000000000"

Mon Nov 6 09:53:36 UTC 2023

On Mon, 6 Nov 2023 09:46:11 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

> A safepointed monitor deflation pass can run interleaved with a paused async monitor deflation pass. The code is not written to handle that situation and asserts when it finds a DEFLATER_MARKER in the owner field. @pchilano also found other issues with having to monitor deflation passes interleaved. More info below ...

There are only two paths to run a safepointed monitor deflation pass:
1) VM_ThreadDump when it collects monitors and looks for deadlocks
2) In the "final audit" when we print information about interesting monitors in the system

This PR deals with (1), which was the cause for the asserts in the bug report.

The following describes what is happening and how we hit the assert:

The stack trace of the asserting safepointed monitor deflation code:

 ObjectMonitor::deflate_monitor
 ObjectSynchronizer::deflate_monitor_list
 ObjectSynchronizer::deflate_idle_monitors
 VM_ThreadDump::doit
 VM_Operation::evaluate
 VMThread::evaluate_operation
 VMThread::inner_execute

And the stack trace of the MonitorDeflationThread performing the paused async monitor deflation:

 SafepointSynchronize::block
 SafepointMechanism::process
 SafepointMechanism::process_if_requested
 ThreadBlockInVMPreprocess<void (JavaThread*)>::~ThreadBlockInVMPreprocess
 ThreadBlockInVM::~ThreadBlockInVM
 ObjectSynchronizer::chk_for_block_req
 ObjectSynchronizer::deflate_monitor_list
 ObjectSynchronizer::deflate_idle_monitors
 MonitorDeflationThread::monitor_deflation_thread_entry

The MonitorDeflationThread is inside the loop that is deflating object monitors on the `_in_use_list` list, but that it hasn't reached the point where it tries to unlink the deflated monitors. In the deflation loop we have checks to see if it is time safepoint (`chk_for_block_req`). At some point it *is* time to block (the stack trace above), and the VMThread starts to run the VM_ThreadDump safepoint operation, which also tries to deflate monitors on the `_in_use_list`. This is what then causes the assert to fail. The MonitorDeflationThread sets the `_owner` to `DEFLATER_MARKER` (value 0x2) and release the `_object` OopHandle. Then when the VMThread finds this ObjectMonitor it sees that `_object` is "cleared" and expects _owner to be nullptr and not 0x2.

The assert comes from this path:

  if (obj == nullptr) {
    // If the object died, we can recycle the monitor without racing with
    // Java threads. The GC already broke the association with the object.
    set_owner_from(nullptr, DEFLATER_MARKER);

My proposed fix for this is to stop deflating monitors from the safepoting VM_ThreadDump operation. This has the following benefits:

1) It solves this bug!
2) It removes work from a pause
3) It decouples the thread-dumping code from the Synchronizer code
4) It renames ObjectMonitorHashtable

FWIW, we are currently experimenting with a patch to put ObjectMonitors in a hashtable in preparation for Lilliput. So, I had already written most of the patch to support (2, 3, 4).

The patch removes the call to monitor_deflation from the VM_ThreadDump thread and instead it walks the in-use list of ObjectMonitors and only collects the much more limited set of ObjectMonitors that have an owner.

Some arguments have been made that by deflating the monitors we can make the walk over the ObjectMonitors faster and that it helps other operations that have to visit them. In the included test (provided by Patricio) I can see that visiting the monitor lists takes a couple of tens of milliseconds when the list start to grow into the millions. To not regress this situation I added a call to poke the MonitorDeflationThread if the thread-dumping code sees that we have more than 100000 ObjectMonitors. This number is arbitrarily chosen, but it's around the point where walking the list started to take more than a couple of milliseconds on my machine. Does this make sense? Do you want a named constant for this somewhere?

One thing to note from this patch is how the thread-dumping code is put inside the VM_ThreadDump operation, which is placed inside the vmOperations.hpp/cpp files and this code grows when I moved some of the existing code. An alternative to this could be to move VM_ThreadDump to threadServices.hpp/cpp. If we want to make that change, I would prefer to do it as a patch that goes in either before or after this fix.

I've tested this with the provided jtreg test and with the test that originally reproduced the assert. I'm going to run this through our CI pipeline, but I'd like to get this PR out before that has been completed.

I've been measuring the performance of this fix with this little patch to extract the times of the thread-dumping code:
https://github.com/stefank/jdk/commit/e9086ac856602dc0957b8157ee568d0891e3402b

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16519#issuecomment-1794438582