RFR (S/M) JDK-8010196 NPG: Internal Error: Metaspace allocation lock -- possible deadlock

Mikael Gerdin mikael.gerdin at oracle.com
Tue Apr 9 15:19:46 UTC 2013



On 2013-04-09 16:48, Coleen Phillimore wrote:
>
> On 04/09/2013 04:54 AM, Mikael Gerdin wrote:
>>
>>
>> On 2013-04-09 06:13, David Holmes wrote:
>>> On 9/04/2013 1:58 PM, Coleen Phillimore wrote:
>>>>
>>>> Minor correction below.
>>>>
>>>> On 4/8/2013 11:20 PM, Coleen Phillimore wrote:
>>>>> Mikael,
>>>>> This looks good including the refactoring.
>>>>>
>>>>> On 4/8/2013 8:17 PM, David Holmes wrote:
>>>>>> Hi Mikael,
>>>>>>
>>>>>> This approach sounds (and looks) reasonable to me, but I'll leav eit
>>>>>> to others to officially Review.
>>>>>>
>>>>>> One concern I did have was whether the initialization of the array
>>>>>> could fail and what that would mean for how this particular cld was
>>>>>> used. That led me to the following code which looks suspect to me:
>>>>>>
>>>>>> // ClassLoaderData into the java/lang/ClassLoader object as a hidden
>>>>>> field
>>>>>> ClassLoaderData* ClassLoaderDataGraph::add(ClassLoaderData**
>>>>>> cld_addr, Handle loader, TRAPS) {
>>>>>>   ...
>>>>>>   ClassLoaderData* cld = new ClassLoaderData(loader, is_anonymous);
>>>>>>
>>>>>>   if (cld_addr != NULL) {
>>>>>>     // First, Atomically set it
>>>>>>     ClassLoaderData* old = (ClassLoaderData*)
>>>>>> Atomic::cmpxchg_ptr(cld, cld_addr, NULL);
>>>>>>     if (old != NULL) {
>>>>>>       delete cld;
>>>>>>       // Returns the data.
>>>>>>       return old;
>>>>>>     }
>>>>>>   }
>>>>>>
>>>>>>   // We won the race, and therefore the task of adding the data to
>>>>>> the list of
>>>>>>   // class loader data
>>>>>>   do {
>>>>>>     cld->set_next(next);
>>>>>>     ClassLoaderData* exchanged =
>>>>>> (ClassLoaderData*)Atomic::cmpxchg_ptr(cld, list_head, next);
>>>>>>     if (exchanged == next) {
>>>>>>       ...
>>>>>>       // Create dependencies after the CLD is added to the list.
>>>>>> Otherwise,
>>>>>>       // the GC GC will not find the CLD and the _class_loader field
>>>>>> will
>>>>>>       // not be updated.
>>>>>>       cld->init_dependencies(CHECK_NULL);
>>>>>>       return cld;
>>>>>>     }
>>>>>>     next = exchanged;
>>>>>>   } while (true);
>>>>>> }
>>>>>>
>>>>>> So it explicitly states that we create dependencies after adding to
>>>>>> the list, but if creating the dependencies fails we return null, but
>>>>>> leave the new cld in the list. Is that a valid thing to do ??
>>>>>>
>>>>>
>>>>> Not really.  If we hit OOM here there isn't a good recovery. We
>>>>> shouldn't return NULL in any case.   I will file a bug.  In the
>>>>> meantime, Mikael can you change it to CATCH_NULL and fix the comment
>>>>> above (remove "the GC" before "GC")?
>>>>
>>>> David is right, it can return null with the pending exception but
>>>> should
>>>> be removed from the list and from the class_loader hidden field in this
>>>> case.   This function is dangerous because it has unhandled oops (the
>>>> address passed in).  For now, I think you can leave it Mikael.  It'll
>>>> throw the OOM anyway.
>>>
>>> So getting back to the original fix my concern is whether after
>>> init_dependencies fails, can we ever call add_dependency for that cld
>>> and so attempt to lock on a NULL object?
>>
>> If _dependencies is NULL when we call record_dependency I have a hard
>> time seeing how the current code would avoid crashing.
>>
>> In fact, it looks pretty likely that we'll crash if we ever fail
>> init_dependencies.
>
> If we fail init_dependencies, we return a NULL CLD to the caller with
> the exception pending.  That'll do a CHECK_NULL() and the OOM will pop
> back out to some caller in Java.   But the CLD will still be attached to
> the class_loader() oop.
>
> It should be removed from the CLD graph and unattached to the class
> loader oop.  Unfortunately the former needs more locking than this CAS
> that we have.
>
> Do you want to fix this for your bug fix or file a new bug?

It would be good if we fix this with a new bug.
I'm already doing both a small refactor and a bug fix in the same one :)

/Mikael

>
> Coleen
>
>>
>>
>> If we survive all the way to record_dependency the current version of
>> the code looks like it will assert twice in debug builds and then SEGV
>> at the call to last->obj_at{_put} in locked_add_dependency.
>>
>> /Mikael
>>
>>>
>>> David
>>>
>>>> Coleen
>>>>
>>>>>
>>>>> Coleen
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>> On 8/04/2013 11:49 PM, Mikael Gerdin wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> The problem is that when running the G1 garbage collector a call to
>>>>>>> objArrayOopDesc::obj_at_put can in rare cases cause the G1 dirty
>>>>>>> card
>>>>>>> queue buffer to fill up and this will cause G1 to take
>>>>>>> DirtyCardQ_CBL_mon/16 to return the full buffer and get a new one.
>>>>>>>
>>>>>>> Adding dependencies to a ClassLoaderData is currently protected
>>>>>>> by the
>>>>>>> per-metaspace "Metaspace allocation lock"/5 (which protects more
>>>>>>> than
>>>>>>> just allocations).
>>>>>>>
>>>>>>> Because I want to avoid messing around with the lock ordering I
>>>>>>> suggest
>>>>>>> that we use an ObjectLocker on the _dependencies oop in
>>>>>>> ClassLoaderData.
>>>>>>> _dependencies is a 2-element objArrayOop, in effect it's an ad-hoc
>>>>>>> linked list of ClassLoaders/Classes which must be kept alive by this
>>>>>>> CLD.
>>>>>>>
>>>>>>> Using an ObjectLocker on the head element of the linked list
>>>>>>> should be
>>>>>>> at least as good as using the metaspace_lock(). There should not be
>>>>>>> any
>>>>>>> new deadlock issues with VM mutexes since any application thread
>>>>>>> could
>>>>>>> use a similar synchronized linked list construct.
>>>>>>>
>>>>>>> As a bonus I suggest that we factor out the dependency handling from
>>>>>>> ClassLoaderData into a inner class: ClassLoaderData::Dependencies
>>>>>>> and
>>>>>>> let Dependencies manage the head of the linked list by itself. This
>>>>>>> should make it more clear that the dependency list is not guarded by
>>>>>>> the
>>>>>>> metaspace lock anymore.
>>>>>>>
>>>>>>> Buglinks:
>>>>>>> https://jbs.oracle.com/bugs/browse/JDK-8010196
>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8010196
>>>>>>>
>>>>>>> Testing:
>>>>>>> default jprt run
>>>>>>> jdk/test/java/lang/invoke/BigArityTest.java (which could
>>>>>>> reproduce the
>>>>>>> issue with -XX:+UseG1GC -XX:G1UpdateBufferSize=1)
>>>>>>>
>>>>>>> Webrevs:
>>>>>>> Incremental:
>>>>>>> http://cr.openjdk.java.net/~mgerdin/8010196/webrev.0-fix/webrev
>>>>>>> http://cr.openjdk.java.net/~mgerdin/8010196/webrev.0-factor-out/webrev
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~mgerdin/8010196/webrev.0-rename/webrev
>>>>>>> Everything:
>>>>>>> http://cr.openjdk.java.net/~mgerdin/8010196/webrev.0/webrev
>>>>>>>
>>>>>>> Thanks
>>>>>>> /Mikael
>>>>>
>>>>
>



More information about the hotspot-gc-dev mailing list