RFR: 8318757 ObjectMonitor::deflate_monitor fails "assert(prev == old_value) failed: unexpected prev owner=0x0000000000000002, expected=0x0000000000000000"
Stefan Karlsson
stefank at openjdk.org
Mon Nov 6 09:53:36 UTC 2023
On Mon, 6 Nov 2023 09:46:11 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:
> A safepointed monitor deflation pass can run interleaved with a paused async monitor deflation pass. The code is not written to handle that situation and asserts when it finds a DEFLATER_MARKER in the owner field. @pchilano also found other issues with having to monitor deflation passes interleaved. More info below ...
There are only two paths to run a safepointed monitor deflation pass:
1) VM_ThreadDump when it collects monitors and looks for deadlocks
2) In the "final audit" when we print information about interesting monitors in the system
This PR deals with (1), which was the cause for the asserts in the bug report.
The following describes what is happening and how we hit the assert:
The stack trace of the asserting safepointed monitor deflation code:
ObjectMonitor::deflate_monitor
ObjectSynchronizer::deflate_monitor_list
ObjectSynchronizer::deflate_idle_monitors
VM_ThreadDump::doit
VM_Operation::evaluate
VMThread::evaluate_operation
VMThread::inner_execute
And the stack trace of the MonitorDeflationThread performing the paused async monitor deflation:
SafepointSynchronize::block
SafepointMechanism::process
SafepointMechanism::process_if_requested
ThreadBlockInVMPreprocess<void (JavaThread*)>::~ThreadBlockInVMPreprocess
ThreadBlockInVM::~ThreadBlockInVM
ObjectSynchronizer::chk_for_block_req
ObjectSynchronizer::deflate_monitor_list
ObjectSynchronizer::deflate_idle_monitors
MonitorDeflationThread::monitor_deflation_thread_entry
The MonitorDeflationThread is inside the loop that is deflating object monitors on the `_in_use_list` list, but that it hasn't reached the point where it tries to unlink the deflated monitors. In the deflation loop we have checks to see if it is time safepoint (`chk_for_block_req`). At some point it *is* time to block (the stack trace above), and the VMThread starts to run the VM_ThreadDump safepoint operation, which also tries to deflate monitors on the `_in_use_list`. This is what then causes the assert to fail. The MonitorDeflationThread sets the `_owner` to `DEFLATER_MARKER` (value 0x2) and release the `_object` OopHandle. Then when the VMThread finds this ObjectMonitor it sees that `_object` is "cleared" and expects _owner to be nullptr and not 0x2.
The assert comes from this path:
if (obj == nullptr) {
// If the object died, we can recycle the monitor without racing with
// Java threads. The GC already broke the association with the object.
set_owner_from(nullptr, DEFLATER_MARKER);
My proposed fix for this is to stop deflating monitors from the safepoting VM_ThreadDump operation. This has the following benefits:
1) It solves this bug!
2) It removes work from a pause
3) It decouples the thread-dumping code from the Synchronizer code
4) It renames ObjectMonitorHashtable
FWIW, we are currently experimenting with a patch to put ObjectMonitors in a hashtable in preparation for Lilliput. So, I had already written most of the patch to support (2, 3, 4).
The patch removes the call to monitor_deflation from the VM_ThreadDump thread and instead it walks the in-use list of ObjectMonitors and only collects the much more limited set of ObjectMonitors that have an owner.
Some arguments have been made that by deflating the monitors we can make the walk over the ObjectMonitors faster and that it helps other operations that have to visit them. In the included test (provided by Patricio) I can see that visiting the monitor lists takes a couple of tens of milliseconds when the list start to grow into the millions. To not regress this situation I added a call to poke the MonitorDeflationThread if the thread-dumping code sees that we have more than 100000 ObjectMonitors. This number is arbitrarily chosen, but it's around the point where walking the list started to take more than a couple of milliseconds on my machine. Does this make sense? Do you want a named constant for this somewhere?
One thing to note from this patch is how the thread-dumping code is put inside the VM_ThreadDump operation, which is placed inside the vmOperations.hpp/cpp files and this code grows when I moved some of the existing code. An alternative to this could be to move VM_ThreadDump to threadServices.hpp/cpp. If we want to make that change, I would prefer to do it as a patch that goes in either before or after this fix.
I've tested this with the provided jtreg test and with the test that originally reproduced the assert. I'm going to run this through our CI pipeline, but I'd like to get this PR out before that has been completed.
I've been measuring the performance of this fix with this little patch to extract the times of the thread-dumping code:
https://github.com/stefank/jdk/commit/e9086ac856602dc0957b8157ee568d0891e3402b
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16519#issuecomment-1794438582
More information about the hotspot-dev
mailing list