RFR: 8213231: ThreadSnapshot::_threadObj can become stale
Erik Helin
erik.helin at oracle.com
Thu Jan 24 08:05:27 UTC 2019
On 1/24/19 1:24 AM, David Holmes wrote:
> Thanks for the updates Erik - they generally look fine. One typo in the
> header file:
>
> // ThreadSnasphot instances should only be created via
>
> Snasphot -> Snapshot
Thanks for re-reviewing, I will fix the typo before I push.
I will push this to jdk/jdk while waiting for the jdk12-fix-request to
hopefully get approved. I have to talked Jesper and he is fine with me
pushing this patch to jdk/jdk and then manually backporting to jdk/jdk12
if the fix request gets approved (this could, but most likely won't,
result in some merge work for Jesper).
Thanks,
Erik
> Cheers,
> David
>
> On 24/01/2019 12:17 am, Erik Helin wrote:
>> On 1/22/19 11:03 PM, David Holmes wrote:
>>> Hi Erik,
>>
>> Hey David, thanks for reviewing!
>>
>>> Good find! This bug made an interesting read. Makes me wonder how we
>>> may have caught this kind of error sooner? Some kind of unhandled oop
>>> check? NoSafepointVerifier?
>>
>> Using -XX:+CheckUnhandledOops (and compiling with
>> -DCHECK_UNHANDLED_OOPS) would have caught the issue, but not exactly
>> the way we want it. If you look in UnhandledOops::clear_unhandled_oops
>> you see that it does not assert if an oop is held over a safepoint, it
>> will overwrite the oopDesc* is with BAD_OOP_ADDR (0xfffffff1).
>> UnhandledOops does this because it is okay to hold on to an oop over a
>> safepoint *if* you remember to update the oop again after the
>> safepoint (this pattern is used at some places in the VM).
>>
>> In the case of this particular bug, we would have crashed in the
>> ThreadInfo constructor because the java.lang.Thread pointer would have
>> had the value BAD_OOP_ADDR. This means that we still would have had to
>> figure out how that java.lang.Thread pointer got the value
>> BAD_OOP_ADDR. This is better than to track down why the
>> ThreadInfo.threadName field was stale, but still usually require quite
>> a bit of archeology in the core file.
>>
>> Adding a NoSafepointVerifier at the top of the constructor for
>> ThreadSnapshot would have helped. But if we would have realized that
>> we needed a NoSafepointVerifier, then it would probably have been
>> better to just use a Handle instead of an oop for the
>> ThreadSnapshot::_threadObj field. In this particular case we could
>> have placed a HandleMark at the top of jmm_GetThreadInfo and
>> jmm_DumpThreads and then used a Handle for ThreadSnapshot::_threadObj.
>>
>> The problem with the Handle approach is that using a Handle for a
>> field in a class requires the instances of that class to always be
>> guarded by a HandleMark. This works fine for ThreadSnapshot as it is
>> used right now, but might not be an applicable technique for all
>> classes (for example a class that does not have a "lexical lifetime"
>> and thus can't be guarded by a HandleMark).
>>
>> I have discussed this a bit with Erik Ö and StefanK and we think a
>> more general solution would be to build something on top of OopStorage
>> (and potentially combine that with oopHandle). This way you get a safe
>> and flexible handle for fields in classes. This would probably work
>> for all but the most performance sensitive scenarios (we can probably
>> create some special handles for those cases).
>>> On 23/01/2019 12:59 am, Erik Helin wrote:
>>>> Hi all,
>>>>
>>>> this patch fixes a problem when the oop in
>>>> ThreadSnapshot::_threadObj can become stale. The issue is that the
>>>> ThreadSnapshot::oops_do method only gets called when a
>>>> ThreadSnapshot instance has been registered in a ThreadDumpResult
>>>> (via the ThreadDumpResult::add_thread_snapshot method). But, in
>>>> order to register a ThreadSnapshot instance, you must first create
>>>> it. The problem is that the ThreadSnapshot constructor first sets
>>>> _threadObj to thread->threadObj() and then further down might call
>>>> ObjectSynchronizer:: get_lock_owner. The call to
>>>> ObjectSynchronizer:: get_lock_owner can result in a VM_RevokeBias VM
>>>> operation being executed. If a GC VM operation already is enqueued,
>>>> then that GC VM operation will run when the VM_RevokeBias VM
>>>> operation is executed. That GC VM operation will not update the oop
>>>> in ThreadSafepoint::_threadObj, because that ThreadSnapshot instance
>>>> has not yet been registered in any ThreadDumpResult (recall that the
>>>> ThreadSafepoint is being constructed), so the GC has no way to find
>>>> it. The oop in ThreadSafepoint::_threadObj will then become dangling
>>>> which most likely will cause the JVM to get a SIGSEGV some time later.
>>>
>>> _blocker_object could suffer the same fate, and possibly
>>> _blocker_object_owner if there could be other paths leading to a
>>> safepoint.
>>
>> Yes, that is correct.
>>>> The issue was found when debugging why an instance of
>>>> java/lang/management/ThreadInfo on the Java heap had a stale pointer
>>>> in its threadName field. Turns out that the java.lang.Thread
>>>> instance passed to the ThreadInfo was stale most likely for the
>>>> reason outlined in the paragraph above.
>>>>
>>>> This patch fixes the issue by ensuring that a ThreadSnapshot is
>>>> always registered in a ThreadDumpResult before the initialization of
>>>> the ThreadSnapshot is done. This ensures that the GC will always be
>>>> able to find the oop ThreadSnapshot::_threadObj via
>>>> ThreadDumpResult::oops_do.
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~ehelin/8213231/00/
>>>
>>> Explanation sounded simple enough - actual change was a little harder
>>> to follow :)
>>>
>>> src/hotspot/share/services/threadService.cpp
>>>
>>> Initially I was concerned about ensuring all oop fields were NULL but
>>> I see the ThreadSnapshot constructor handles that. That means some of
>>> the NULL setting in initialize is redundant:
>>>
>>> 854 _stack_trace = NULL;
>>> 855 _concurrent_locks = NULL;
>>> 856 _next = NULL;
>>> 866 _blocker_object = NULL;
>>> 867 _blocker_object_owner = NULL;
>>
>> Nice catch, I missed removing those assignments, fixed that in version
>> 01.
>>
>>> src/hotspot/share/services/threadService.hpp
>>>
>>> The new initialize function doesn't need to be public.
>>
>> It does, because it is used from the class ThreadDumpResult, see
>> ThreadDumpResult::add_thread_snapshot. If we want the constructor for
>> ThreadSnapshot and ThreadSnapshot::initialize to be private, then
>> ThreadSnapshot must friend ThreadDumpResult. I did this change for 01,
>> the added benefit is that the only way to create a ThreadSnapshot* is
>> now through ThreadDumpResult::add_thread_snapshot (the ThreadSnapshot
>> constructor is private). This means we have ensured that exactly this
>> bug can't happen again :)
>>
>> > src/hotspot/share/services/threadService.hpp
>>> + void link_thread_snapshot(ThreadSnapshot* ts);
>>>
>>> Please follow the existing layout style in that code and shift the
>>> method name across to align with others.
>>
>> I tried to follow the existing style as much as possible, but the code
>> in threadService.hpp contradicts itself in a few places :) See if you
>> prefer the way I indented the function names in 01 (and don't ask me
>> for my opinion on vertically aligning fields and methods in class
>> declarations ;).
>>
>> Please see new webrevs at:
>> - full: http://cr.openjdk.java.net/~ehelin/8213231/01/
>> - inc: http://cr.openjdk.java.net/~ehelin/8213231/00-01
>>
>> Thanks,
>> Erik
>>
>>> Thanks,
>>> David
>>>
>>>
>>>>
>>>> Issue:
>>>> https://bugs.openjdk.java.net/browse/JDK-8213231
>>>>
>>>> Testing:
>>>> - Tier 1, 2 and 3 on Windows, Mac, Linux (all x86-64)
>>>> - RunThese30M (multiple runs) and RunThese24h on Linux x86-64
>>>> (please note that I never managed to reproduce the issue, all
>>>> analysis was done based on a core file)
>>>>
>>>> Thanks,
>>>> Erik
More information about the hotspot-runtime-dev
mailing list