RFR (s): JDK-8013129 Possible deadlock with Metaspace locks due to mixed usage of safepoint aware and non-safepoint aware locking

Fri Apr 26 05:08:24 PDT 2013

David,

On 2013-04-26 12:00, David Holmes wrote:
> On 25/04/2013 10:03 PM, Mikael Gerdin wrote:
>> On 04/25/2013 01:57 PM, Coleen Phillimore wrote:
>>>
>>> I believe this change is correct.  I think if we elide safepoint locks
>>> for a lock sometimes, we always have to elide safepoint checks for that
>>> lock.   David will correct me if I'm wrong.
>>
>> I also believe that the change is correct.
>> And I hope that David will chime in on this one.
>
> Well .... I think that is accurate though I haven't quite seen it stated
> that way. The ability to lock without a safepoint check depends, as far
> as I understand it (and I'm rusty on the details, and the details are
> complex) on the threads using the lock (and their state) and whether the
> lock may be needed by the VMThread during a safepoint.
>
> Really to see what is going wrong I need to see the stacks of the
> threads when the deadlock occurs, else it is just speculation as to what
> might have gone wrong. There is a hint (a strong one) in mutex.cpp:
>
> // Lock without safepoint check - a degenerate variant of lock().
> // Should ONLY be used by safepoint code and other code
> // that is guaranteed not to block while running inside the VM. If this
> is called with
> // thread state set to be in VM, the safepoint synchronization code will
> deadlock!
>
> If a mutator tries to lock without a safepoint check and the lock is
> held, the thread will block but remain in the _thread_in_vm state. If a
> safepoint commences during that time the lock owner will not be able to
> release the lock if it hits a safepoint check first. Meanwhile the
> VMThread will be waiting for the blocked thread to reach a safepoint
> safe state - which it never will as it is physically blocked while still
> _thread_in_vm. Hence deadlock.

This is what I saw in one of the hangs. The VM was trying to safepoint 
and one thread was waiting on the lock with _no_safepoint_check_flag.

This in itself should not be a problem since the last thread to hold the 
lock should have made sure that it woke up the waiting thread when 
leaving the critical region.

My theory is that mixing Mutex::lock and 
Mutex::lock_without_safepoint_check somehow leads to a thread leaving 
the lock without signaling that properly to the next in line.

I've dumped a bit of information on the hang in a text file at:
http://cr.openjdk.java.net/~mgerdin/8013129/deadlock-info.txt

>
> So the current change to check for safepoints is undoubtedly a safe
> alternative.
>
> Now an interesting thing with these changes is that one only applies to
> the VMThread - so what does presence or absence of a safepoint check
> mean for the VMThread itself? There is only one difference in that case:
> locking with a safepoint check allows "lock sneaking". But I don't think
> I need to expand on that here - see mutex.cpp.

The separate VMThread and non-vmthread code paths in this change are 
unnecessary and I thought Coleen had a change which was going to fix 
that but I may have been mistaken.

>
> But this does raise a follow on concern: when the thread is not the
> VMThread, can it hit a safepoint or otherwise block in the region whilst
> holding the Metaspace lock? If the answer is yes then we still have a
> potential safepoint deadlock because the VMThread will not be able to
> acquire the lock during the safepoint VM_operation and will block.

Currently we don't block in the allocation path and if we need to 
trigger a GC I believe we release the locks before starting a VM_operation.

>
> Aside: this also makes me wonder about the safety of other
> lock-without-safepoint-check usages.

Indeed, if there is a deadlock possiblility when mixing 
lock-without-safepoint-check and "normal" lock then we should consider 
enforcing that a lock is used only for one of the two paths, perhaps 
using the C++ compiler / type system.

/Mikael

>
>
> David
> -----
>
>> /Mikael
>>
>>>
>>> Coleen
>>>
>>> On 4/25/2013 5:10 AM, Mikael Gerdin wrote:
>>>> Hi,
>>>>
>>>> Problem:
>>>> We've seen some hangs in the GC nightly testing recently.
>>>> When I looked at the minidump files from some of those hangs they
>>>> looked like safepoint deadlocks where one thread was parked in
>>>> Mutex::lock_without_safepoint_check on one of the "Metaspace
>>>> allocation locks".
>>>>
>>>> Both of the hangs I investigated also had threads trying to lock the
>>>> same Metaspace lock but without eliding safepoint checks because they
>>>> were originating from Metaspace::deallocate.
>>>>
>>>> I believe that since the change to allocate MethodCounters on demand
>>>> and potentially deallocating them when racing this issue was brought
>>>> to the surface because of more frequent calls to Metaspace::deallocate
>>>> when not at a safepoint.
>>>>
>>>> I was able to reproduce the hang after about an hour by running an
>>>> instrumented build where MethodCounters are allocated and then
>>>> unconditionally deallocated on each entry to
>>>> Method::build_method_counters.
>>>>
>>>> I can't describe the failure mode in detail since I'm not familiar
>>>> with the Mutex code but I can imagine that the locking state machine
>>>> is broken when we take completely different code paths for the same
>>>> Mutex.
>>>>
>>>> Suggested fix:
>>>> My suggested fix is to change Metaspace::deallocate to take the lock
>>>> with Mutex::_no_safepoint_check_flag.
>>>>
>>>> With my fix I ran the test that managed to reproduce the failure
>>>> overnight without reproducing the hang.
>>>> I also ran the parallel class loading tests and nashorn's
>>>> test262parallel for good measure.
>>>>
>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8013129/webrev.0/
>>>> JBS: https://jbs.oracle.com/bugs/browse/JDK-8013129
>>>> bugs.sun.com:
>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8013129
>>>>
>>>> /Mikael
>>>
>>