RFR (s): JDK-8013129 Possible deadlock with Metaspace locks due to mixed usage of safepoint aware and non-safepoint aware locking

Fri Apr 26 07:57:08 PDT 2013

David,

On 2013-04-26 14:47, David Holmes wrote:
> On 26/04/2013 10:08 PM, Mikael Gerdin wrote:
>> David,
>>
>> On 2013-04-26 12:00, David Holmes wrote:
>>> On 25/04/2013 10:03 PM, Mikael Gerdin wrote:
>>>> On 04/25/2013 01:57 PM, Coleen Phillimore wrote:
>>>>>
>>>>> I believe this change is correct.  I think if we elide safepoint locks
>>>>> for a lock sometimes, we always have to elide safepoint checks for
>>>>> that
>>>>> lock.   David will correct me if I'm wrong.
>>>>
>>>> I also believe that the change is correct.
>>>> And I hope that David will chime in on this one.
>>>
>>> Well .... I think that is accurate though I haven't quite seen it stated
>>> that way. The ability to lock without a safepoint check depends, as far
>>> as I understand it (and I'm rusty on the details, and the details are
>>> complex) on the threads using the lock (and their state) and whether the
>>> lock may be needed by the VMThread during a safepoint.
>>>
>>> Really to see what is going wrong I need to see the stacks of the
>>> threads when the deadlock occurs, else it is just speculation as to what
>>> might have gone wrong. There is a hint (a strong one) in mutex.cpp:
>>>
>>> // Lock without safepoint check - a degenerate variant of lock().
>>> // Should ONLY be used by safepoint code and other code
>>> // that is guaranteed not to block while running inside the VM. If this
>>> is called with
>>> // thread state set to be in VM, the safepoint synchronization code will
>>> deadlock!
>>>
>>> If a mutator tries to lock without a safepoint check and the lock is
>>> held, the thread will block but remain in the _thread_in_vm state. If a
>>> safepoint commences during that time the lock owner will not be able to
>>> release the lock if it hits a safepoint check first. Meanwhile the
>>> VMThread will be waiting for the blocked thread to reach a safepoint
>>> safe state - which it never will as it is physically blocked while still
>>> _thread_in_vm. Hence deadlock.
>>
>> This is what I saw in one of the hangs. The VM was trying to safepoint
>> and one thread was waiting on the lock with _no_safepoint_check_flag.
>>
>> This in itself should not be a problem since the last thread to hold the
>> lock should have made sure that it woke up the waiting thread when
>> leaving the critical region.
>
> The deadlock should be quite simple:
>
> - vmThread is initiating safepoint and is waiting on thread B to be
> safepoint-safe
> - thread A holds lock and is blocked at safepoint
> - thread B tried to lock-without-safepoint-check and is blocked on the lock

I think that's the same conclusion I came to in the end as well.

>
>> My theory is that mixing Mutex::lock and
>> Mutex::lock_without_safepoint_check somehow leads to a thread leaving
>> the lock without signaling that properly to the next in line.
>> I've dumped a bit of information on the hang in a text file at:
>> http://cr.openjdk.java.net/~mgerdin/8013129/deadlock-info.txt
>
> So here is what has happens. One thread was blocked acquiring the lock
> with a safepoint check. The owning thread releases the lock and wakes
> the first thread. That thread acquires the lock, clears OnDeck and
> returns from ILock at which point the ThreadBlockInVM destructor is
> called. This results in the thread blocking at the safepoint. But note
> that owner is still NULL at this stage. Meanwhile another thread has
> done a lock-without-safepoint-check which goes straight to ILock (which
> doesn't even look at owner) and blocks because the low-level lock is
> indeed held.
>
> Hence deadlock.

Thanks for this detailed description David.

>
>>> So the current change to check for safepoints is undoubtedly a safe
>>> alternative.
>>>
>>> Now an interesting thing with these changes is that one only applies to
>>> the VMThread - so what does presence or absence of a safepoint check
>>> mean for the VMThread itself? There is only one difference in that case:
>>> locking with a safepoint check allows "lock sneaking". But I don't think
>>> I need to expand on that here - see mutex.cpp.
>>
>> The separate VMThread and non-vmthread code paths in this change are
>> unnecessary and I thought Coleen had a change which was going to fix
>> that but I may have been mistaken.
>>
>>>
>>> But this does raise a follow on concern: when the thread is not the
>>> VMThread, can it hit a safepoint or otherwise block in the region whilst
>>> holding the Metaspace lock? If the answer is yes then we still have a
>>> potential safepoint deadlock because the VMThread will not be able to
>>> acquire the lock during the safepoint VM_operation and will block.
>>
>> Currently we don't block in the allocation path and if we need to
>> trigger a GC I believe we release the locks before starting a
>> VM_operation.
>
> What about the deallocation path? The thing is that all critical
> sections involving this lock must be non-blocking if the VMThread also
> needs the lock.

The deallocation path is also non-blocking.

>
>>> Aside: this also makes me wonder about the safety of other
>>> lock-without-safepoint-check usages.
>>
>> Indeed, if there is a deadlock possiblility when mixing
>> lock-without-safepoint-check and "normal" lock then we should consider
>> enforcing that a lock is used only for one of the two paths, perhaps
>> using the C++ compiler / type system.
>
> The constraint is only for JavaThreads that must participate in the
> safepoint. Or simply stating that another way, as it says in mutex.hpp,
> no thread that is _thread_in_vm should ever lock without a safepoint
> check. I wonder why we don't assert that in lock_without_safepoint_check?

Currently every metaspace allocation will take the metaspace allocation 
lock with lock_without_safepoint_check while the thread is 
_thread_in_vm. I can's say if any other locks are used that way but 
since we don't assert it I'm sure someone else does it as well.

If we could have separate Mutex/MutexLocker classes for locks which have 
safepoint checks and locks which don't have them we would at least get 
rid of the problem with mixing them on the same lock.

/Mikael

>
> David
> -----
>
>> /Mikael
>>
>>>
>>>
>>> David
>>> -----
>>>
>>>> /Mikael
>>>>
>>>>>
>>>>> Coleen
>>>>>
>>>>> On 4/25/2013 5:10 AM, Mikael Gerdin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Problem:
>>>>>> We've seen some hangs in the GC nightly testing recently.
>>>>>> When I looked at the minidump files from some of those hangs they
>>>>>> looked like safepoint deadlocks where one thread was parked in
>>>>>> Mutex::lock_without_safepoint_check on one of the "Metaspace
>>>>>> allocation locks".
>>>>>>
>>>>>> Both of the hangs I investigated also had threads trying to lock the
>>>>>> same Metaspace lock but without eliding safepoint checks because they
>>>>>> were originating from Metaspace::deallocate.
>>>>>>
>>>>>> I believe that since the change to allocate MethodCounters on demand
>>>>>> and potentially deallocating them when racing this issue was brought
>>>>>> to the surface because of more frequent calls to
>>>>>> Metaspace::deallocate
>>>>>> when not at a safepoint.
>>>>>>
>>>>>> I was able to reproduce the hang after about an hour by running an
>>>>>> instrumented build where MethodCounters are allocated and then
>>>>>> unconditionally deallocated on each entry to
>>>>>> Method::build_method_counters.
>>>>>>
>>>>>> I can't describe the failure mode in detail since I'm not familiar
>>>>>> with the Mutex code but I can imagine that the locking state machine
>>>>>> is broken when we take completely different code paths for the same
>>>>>> Mutex.
>>>>>>
>>>>>> Suggested fix:
>>>>>> My suggested fix is to change Metaspace::deallocate to take the lock
>>>>>> with Mutex::_no_safepoint_check_flag.
>>>>>>
>>>>>> With my fix I ran the test that managed to reproduce the failure
>>>>>> overnight without reproducing the hang.
>>>>>> I also ran the parallel class loading tests and nashorn's
>>>>>> test262parallel for good measure.
>>>>>>
>>>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8013129/webrev.0/
>>>>>> JBS: https://jbs.oracle.com/bugs/browse/JDK-8013129
>>>>>> bugs.sun.com:
>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8013129
>>>>>>
>>>>>> /Mikael
>>>>>
>>>>