RFR (s): JDK-8013129 Possible deadlock with Metaspace locks due to mixed usage of safepoint aware and non-safepoint aware locking

Thu Apr 25 15:24:17 UTC 2013

On 4/25/2013 5:03 AM, Mikael Gerdin wrote:
> Coleen,
>
> On 04/25/2013 01:57 PM, Coleen Phillimore wrote:
>>
>> Mikael,
>>
>> I believe this change is correct.  I think if we elide safepoint locks
>> for a lock sometimes, we always have to elide safepoint checks for that
>> lock.   David will correct me if I'm wrong.

I think this is a sufficient strategy and it's what I've followed
in the past.

So I agree with the change.

Jon

>
> I also believe that the change is correct.
> And I hope that David will chime in on this one.
>
> /Mikael
>
>>
>> Coleen
>>
>> On 4/25/2013 5:10 AM, Mikael Gerdin wrote:
>>> Hi,
>>>
>>> Problem:
>>> We've seen some hangs in the GC nightly testing recently.
>>> When I looked at the minidump files from some of those hangs they
>>> looked like safepoint deadlocks where one thread was parked in
>>> Mutex::lock_without_safepoint_check on one of the "Metaspace
>>> allocation locks".
>>>
>>> Both of the hangs I investigated also had threads trying to lock the
>>> same Metaspace lock but without eliding safepoint checks because they
>>> were originating from Metaspace::deallocate.
>>>
>>> I believe that since the change to allocate MethodCounters on demand
>>> and potentially deallocating them when racing this issue was brought
>>> to the surface because of more frequent calls to Metaspace::deallocate
>>> when not at a safepoint.
>>>
>>> I was able to reproduce the hang after about an hour by running an
>>> instrumented build where MethodCounters are allocated and then
>>> unconditionally deallocated on each entry to
>>> Method::build_method_counters.
>>>
>>> I can't describe the failure mode in detail since I'm not familiar
>>> with the Mutex code but I can imagine that the locking state machine
>>> is broken when we take completely different code paths for the same
>>> Mutex.
>>>
>>> Suggested fix:
>>> My suggested fix is to change Metaspace::deallocate to take the lock
>>> with Mutex::_no_safepoint_check_flag.
>>>
>>> With my fix I ran the test that managed to reproduce the failure
>>> overnight without reproducing the hang.
>>> I also ran the parallel class loading tests and nashorn's
>>> test262parallel for good measure.
>>>
>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8013129/webrev.0/
>>> JBS: https://jbs.oracle.com/bugs/browse/JDK-8013129
>>> bugs.sun.com: 
>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8013129
>>>
>>> /Mikael
>>
>