RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking
David Holmes
david.holmes at oracle.com
Thu Feb 16 02:03:15 UTC 2017
On 16/02/2017 1:07 AM, Erik Helin wrote:
> On 02/15/2017 02:48 AM, David Holmes wrote:
>> Hi Erik,
>
> Hi David,
>
> thanks for having a look! Please see new patches at:
> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
Updates look fine to me.
>> oops_do can miss an oop that is in the process of being added - is that
>> a problem?
>
> Nope, that is not an issue. Even if `add_handle` and `oops_do` where
> synchronized with each other (for example by using lock), you can still
> have the scenario where a GC thread first runs (and finishes) `oops_do`
> and then `add_handle` is called (a linearized execution).
>
> This scenario is something the concurrent marking algorithms already
> should be prepared to handle.
I'm unclear about the exact hand-over process for the oop. Before the
add the oop will be found through some existing path - presumably the
Handle - and after the add it will be found through oops_do. But at some
point that "existing path" will no longer have the oop, and it will be
assumed to be found via the oops_do. The success of that seems to depend
on what order the concurrent marking checks things. Or it may be that
once concurrent marking starts the Handle can't lose the oop without
concurrent marking being informed?
The scenario I'm concerned about is this:
- add() is executing concurrently with oops_do
- oops_do reads _head just before new Chunk is added
- oops_do thread is preempted
- add() completes and call chain unwinds and at some point the Handle is
released
- oops_do continues but has missed the new oop
- Handle traversal continues but the oop is no longer there
Thanks,
David
-----
>> 140 if (c->data[i] != NULL) {
>> 141 f->do_oop(&c->data[i]);
>> 142 }
>>
>> Given a slot can be nulled out concurrently at any time, is it worth
>> does this NULL check? The OopClosure will have to do its own NULL check
>> anyway.
>
> Sure, this was mostly to be a bit consistent with JNIHandleBlock (it
> does the same NULL check). Think of it as an optimization, there is no
> reason to call `f->do_oop` if `c->data[i] == NULL`. Do you think it
> reads better if the NULL check is removed? I don't have a strong opinion
> on this one.
>
>> 144 c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
>> intptr_t*)&c->next);
>>
>> This doesn't need to be a load-acquire. You already loaded 'c' via
>> load-acquire of _head (or chained therefrom) and its next field is set
>> prior to the setting of the _head that you read.
>
> Agree. I just discussed this with Erik Ö as well, and we all agree on
> this. I also removed the `volatile` specifier for `next`.
>
>> 624 jobject ClassLoaderData::add_handle(Handle h) {
>> 625 MutexLockerEx ml(metaspace_lock(),
>> Mutex::_no_safepoint_check_flag);
>> 626 return (jobject) _handles.add(h());
>> 627 }
>> 628
>> 629 void ClassLoaderData::remove_handle_unsafe(jobject h) {
>> 630 assert(_handles.contains((oop*) h), "Got unexpected handle "
>> PTR_FORMAT, p2i((oop*) h));
>> 631 *((oop*) h) = NULL;
>> 632 }
>>
>> I'm a bit unclear on the typing here. Previously we use a JNI Handle
>> which was a jobject. But now we've simply stored an oop into a slot in
>> an array. We pass the address of that slot back as a jobject, but is it
>> really?
>
> So, this is a bit confusing, and I discussed this with Coleen a few days
> ago. jobject was used originally used because that is what
> `JNIHandleBlock::allocate_handle` returns. Using something other than
> jobject means updating `ModuleEntry` and in particular users
> `ModuleEntry::_pd`. This is not something we are keen on doing right now
> for JDK 9, and Coleen is planning to clean up how the runtime handles
> oops for 10 (or at least improve the situation).
>
> I don't know what jobject is meant to represent, in the code it is just
> an empty class (an opaque type).
>
> ClassLoaderData::add_handle could just as well return an oop*, but AFAIU
> we try to avoid oops and oop* in the runtime code (because that might
> mean that some code forgot to tell the GC about the oop). Maybe I'm
> wrong here?
>
>> I would also expect contains to take an oop not an oop* - it
>> seems to be asking "is this an address of a slot in our
>> ChunkedHandleList" rather than asking "is this oop in our
>> ChunkedHandleList". ??
>
> I actually wanted the assert to check whether the passed jobject (which
> really is an oop*) actually originated from a call to
> ClassLoaderData::add_handle. I don't want any code to pass a jobject to
> ClassLoaderData::remove_handle_unsafe that doesn't come from a call to
> ClassLoaderData::add_handle.
>
>> A few minor comments:
>>
>> Copyrights need updating.
>
> Aaah, it is that time of the year again. Fixed.
>
>> src/share/vm/classfile/classLoaderData.hpp
>>
>> 177 // Only on thread ...
>>
>> Typo: on -> one
>
> Thanks, fixed.
>
>> src/share/vm/classfile/moduleEntry.cpp
>>
>> 90 // Create a JNI handle for the shared ProtectionDomain and save it
>> atomically.
>> 91 // If someone beats us setting the _pd cache, the created JNI
>> handle is destroyed.
>>
>> These are no longer JNI Handles.
>
> Thanks, I updated the comment.
>
> Thanks,
> Erik
>
>> Thanks,
>> David
>> -----
>>
>>> Testing:
>>> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>>>
>>> I would appreciate quite a few reviews on this patch, it contains a nice
>>> mixture of:
>>> - me venturing into the runtime code :)
>>> - lock free code
>>> - unreproducible bug (even though I'm sure of the root cause)
>>>
>>> Thanks for everyone participating in the discussions around this bug and
>>> potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and
>>> Jiangli.
>>>
>>> Thanks!
>>> Erik
>>>
>>> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914
More information about the hotspot-dev
mailing list