RFR (s): JDK-8013129 Possible deadlock with Metaspace locks due to mixed usage of safepoint aware and non-safepoint aware locking

Fri Apr 26 10:00:38 UTC 2013

On 25/04/2013 10:03 PM, Mikael Gerdin wrote:
> On 04/25/2013 01:57 PM, Coleen Phillimore wrote:
>>
>> I believe this change is correct.  I think if we elide safepoint locks
>> for a lock sometimes, we always have to elide safepoint checks for that
>> lock.   David will correct me if I'm wrong.
>
> I also believe that the change is correct.
> And I hope that David will chime in on this one.

Well .... I think that is accurate though I haven't quite seen it stated 
that way. The ability to lock without a safepoint check depends, as far 
as I understand it (and I'm rusty on the details, and the details are 
complex) on the threads using the lock (and their state) and whether the 
lock may be needed by the VMThread during a safepoint.

Really to see what is going wrong I need to see the stacks of the 
threads when the deadlock occurs, else it is just speculation as to what 
might have gone wrong. There is a hint (a strong one) in mutex.cpp:

// Lock without safepoint check - a degenerate variant of lock().
// Should ONLY be used by safepoint code and other code
// that is guaranteed not to block while running inside the VM. If this 
is called with
// thread state set to be in VM, the safepoint synchronization code will 
deadlock!

If a mutator tries to lock without a safepoint check and the lock is 
held, the thread will block but remain in the _thread_in_vm state. If a 
safepoint commences during that time the lock owner will not be able to 
release the lock if it hits a safepoint check first. Meanwhile the 
VMThread will be waiting for the blocked thread to reach a safepoint 
safe state - which it never will as it is physically blocked while still 
_thread_in_vm. Hence deadlock.

So the current change to check for safepoints is undoubtedly a safe 
alternative.

Now an interesting thing with these changes is that one only applies to 
the VMThread - so what does presence or absence of a safepoint check 
mean for the VMThread itself? There is only one difference in that case: 
locking with a safepoint check allows "lock sneaking". But I don't think 
I need to expand on that here - see mutex.cpp.

But this does raise a follow on concern: when the thread is not the 
VMThread, can it hit a safepoint or otherwise block in the region whilst 
holding the Metaspace lock? If the answer is yes then we still have a 
potential safepoint deadlock because the VMThread will not be able to 
acquire the lock during the safepoint VM_operation and will block.

Aside: this also makes me wonder about the safety of other 
lock-without-safepoint-check usages.

David
-----

> /Mikael
>
>>
>> Coleen
>>
>> On 4/25/2013 5:10 AM, Mikael Gerdin wrote:
>>> Hi,
>>>
>>> Problem:
>>> We've seen some hangs in the GC nightly testing recently.
>>> When I looked at the minidump files from some of those hangs they
>>> looked like safepoint deadlocks where one thread was parked in
>>> Mutex::lock_without_safepoint_check on one of the "Metaspace
>>> allocation locks".
>>>
>>> Both of the hangs I investigated also had threads trying to lock the
>>> same Metaspace lock but without eliding safepoint checks because they
>>> were originating from Metaspace::deallocate.
>>>
>>> I believe that since the change to allocate MethodCounters on demand
>>> and potentially deallocating them when racing this issue was brought
>>> to the surface because of more frequent calls to Metaspace::deallocate
>>> when not at a safepoint.
>>>
>>> I was able to reproduce the hang after about an hour by running an
>>> instrumented build where MethodCounters are allocated and then
>>> unconditionally deallocated on each entry to
>>> Method::build_method_counters.
>>>
>>> I can't describe the failure mode in detail since I'm not familiar
>>> with the Mutex code but I can imagine that the locking state machine
>>> is broken when we take completely different code paths for the same
>>> Mutex.
>>>
>>> Suggested fix:
>>> My suggested fix is to change Metaspace::deallocate to take the lock
>>> with Mutex::_no_safepoint_check_flag.
>>>
>>> With my fix I ran the test that managed to reproduce the failure
>>> overnight without reproducing the hang.
>>> I also ran the parallel class loading tests and nashorn's
>>> test262parallel for good measure.
>>>
>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8013129/webrev.0/
>>> JBS: https://jbs.oracle.com/bugs/browse/JDK-8013129
>>> bugs.sun.com: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8013129
>>>
>>> /Mikael
>>
>