RFR (s): JDK-8013129 Possible deadlock with Metaspace locks due to mixed usage of safepoint aware and non-safepoint aware locking

Fri Apr 26 12:47:58 UTC 2013

On 26/04/2013 10:08 PM, Mikael Gerdin wrote:
> David,
>
> On 2013-04-26 12:00, David Holmes wrote:
>> On 25/04/2013 10:03 PM, Mikael Gerdin wrote:
>>> On 04/25/2013 01:57 PM, Coleen Phillimore wrote:
>>>>
>>>> I believe this change is correct.  I think if we elide safepoint locks
>>>> for a lock sometimes, we always have to elide safepoint checks for that
>>>> lock.   David will correct me if I'm wrong.
>>>
>>> I also believe that the change is correct.
>>> And I hope that David will chime in on this one.
>>
>> Well .... I think that is accurate though I haven't quite seen it stated
>> that way. The ability to lock without a safepoint check depends, as far
>> as I understand it (and I'm rusty on the details, and the details are
>> complex) on the threads using the lock (and their state) and whether the
>> lock may be needed by the VMThread during a safepoint.
>>
>> Really to see what is going wrong I need to see the stacks of the
>> threads when the deadlock occurs, else it is just speculation as to what
>> might have gone wrong. There is a hint (a strong one) in mutex.cpp:
>>
>> // Lock without safepoint check - a degenerate variant of lock().
>> // Should ONLY be used by safepoint code and other code
>> // that is guaranteed not to block while running inside the VM. If this
>> is called with
>> // thread state set to be in VM, the safepoint synchronization code will
>> deadlock!
>>
>> If a mutator tries to lock without a safepoint check and the lock is
>> held, the thread will block but remain in the _thread_in_vm state. If a
>> safepoint commences during that time the lock owner will not be able to
>> release the lock if it hits a safepoint check first. Meanwhile the
>> VMThread will be waiting for the blocked thread to reach a safepoint
>> safe state - which it never will as it is physically blocked while still
>> _thread_in_vm. Hence deadlock.
>
> This is what I saw in one of the hangs. The VM was trying to safepoint
> and one thread was waiting on the lock with _no_safepoint_check_flag.
>
> This in itself should not be a problem since the last thread to hold the
> lock should have made sure that it woke up the waiting thread when
> leaving the critical region.

The deadlock should be quite simple:

- vmThread is initiating safepoint and is waiting on thread B to be 
safepoint-safe
- thread A holds lock and is blocked at safepoint
- thread B tried to lock-without-safepoint-check and is blocked on the lock

> My theory is that mixing Mutex::lock and
> Mutex::lock_without_safepoint_check somehow leads to a thread leaving
> the lock without signaling that properly to the next in line.
> I've dumped a bit of information on the hang in a text file at:
> http://cr.openjdk.java.net/~mgerdin/8013129/deadlock-info.txt

So here is what has happens. One thread was blocked acquiring the lock 
with a safepoint check. The owning thread releases the lock and wakes 
the first thread. That thread acquires the lock, clears OnDeck and 
returns from ILock at which point the ThreadBlockInVM destructor is 
called. This results in the thread blocking at the safepoint. But note 
that owner is still NULL at this stage. Meanwhile another thread has 
done a lock-without-safepoint-check which goes straight to ILock (which 
doesn't even look at owner) and blocks because the low-level lock is 
indeed held.

Hence deadlock.

>> So the current change to check for safepoints is undoubtedly a safe
>> alternative.
>>
>> Now an interesting thing with these changes is that one only applies to
>> the VMThread - so what does presence or absence of a safepoint check
>> mean for the VMThread itself? There is only one difference in that case:
>> locking with a safepoint check allows "lock sneaking". But I don't think
>> I need to expand on that here - see mutex.cpp.
>
> The separate VMThread and non-vmthread code paths in this change are
> unnecessary and I thought Coleen had a change which was going to fix
> that but I may have been mistaken.
>
>>
>> But this does raise a follow on concern: when the thread is not the
>> VMThread, can it hit a safepoint or otherwise block in the region whilst
>> holding the Metaspace lock? If the answer is yes then we still have a
>> potential safepoint deadlock because the VMThread will not be able to
>> acquire the lock during the safepoint VM_operation and will block.
>
> Currently we don't block in the allocation path and if we need to
> trigger a GC I believe we release the locks before starting a VM_operation.

What about the deallocation path? The thing is that all critical 
sections involving this lock must be non-blocking if the VMThread also 
needs the lock.

>> Aside: this also makes me wonder about the safety of other
>> lock-without-safepoint-check usages.
>
> Indeed, if there is a deadlock possiblility when mixing
> lock-without-safepoint-check and "normal" lock then we should consider
> enforcing that a lock is used only for one of the two paths, perhaps
> using the C++ compiler / type system.

The constraint is only for JavaThreads that must participate in the 
safepoint. Or simply stating that another way, as it says in mutex.hpp, 
no thread that is _thread_in_vm should ever lock without a safepoint 
check. I wonder why we don't assert that in lock_without_safepoint_check?

David
-----

> /Mikael
>
>>
>>
>> David
>> -----
>>
>>> /Mikael
>>>
>>>>
>>>> Coleen
>>>>
>>>> On 4/25/2013 5:10 AM, Mikael Gerdin wrote:
>>>>> Hi,
>>>>>
>>>>> Problem:
>>>>> We've seen some hangs in the GC nightly testing recently.
>>>>> When I looked at the minidump files from some of those hangs they
>>>>> looked like safepoint deadlocks where one thread was parked in
>>>>> Mutex::lock_without_safepoint_check on one of the "Metaspace
>>>>> allocation locks".
>>>>>
>>>>> Both of the hangs I investigated also had threads trying to lock the
>>>>> same Metaspace lock but without eliding safepoint checks because they
>>>>> were originating from Metaspace::deallocate.
>>>>>
>>>>> I believe that since the change to allocate MethodCounters on demand
>>>>> and potentially deallocating them when racing this issue was brought
>>>>> to the surface because of more frequent calls to Metaspace::deallocate
>>>>> when not at a safepoint.
>>>>>
>>>>> I was able to reproduce the hang after about an hour by running an
>>>>> instrumented build where MethodCounters are allocated and then
>>>>> unconditionally deallocated on each entry to
>>>>> Method::build_method_counters.
>>>>>
>>>>> I can't describe the failure mode in detail since I'm not familiar
>>>>> with the Mutex code but I can imagine that the locking state machine
>>>>> is broken when we take completely different code paths for the same
>>>>> Mutex.
>>>>>
>>>>> Suggested fix:
>>>>> My suggested fix is to change Metaspace::deallocate to take the lock
>>>>> with Mutex::_no_safepoint_check_flag.
>>>>>
>>>>> With my fix I ran the test that managed to reproduce the failure
>>>>> overnight without reproducing the hang.
>>>>> I also ran the parallel class loading tests and nashorn's
>>>>> test262parallel for good measure.
>>>>>
>>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8013129/webrev.0/
>>>>> JBS: https://jbs.oracle.com/bugs/browse/JDK-8013129
>>>>> bugs.sun.com:
>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8013129
>>>>>
>>>>> /Mikael
>>>>
>>>