RFR: 8213231: ThreadSnapshot::_threadObj can become stale

David Holmes david.holmes at oracle.com
Thu Jan 24 00:24:21 UTC 2019


Thanks for the updates Erik - they generally look fine. One typo in the 
header file:

// ThreadSnasphot instances should only be created via

Snasphot -> Snapshot

Cheers,
David

On 24/01/2019 12:17 am, Erik Helin wrote:
> On 1/22/19 11:03 PM, David Holmes wrote:
>> Hi Erik,
> 
> Hey David, thanks for reviewing!
> 
>> Good find! This bug made an interesting read. Makes me wonder how we 
>> may have caught this kind of error sooner? Some kind of unhandled oop 
>> check? NoSafepointVerifier?
> 
> Using -XX:+CheckUnhandledOops (and compiling with 
> -DCHECK_UNHANDLED_OOPS) would have caught the issue, but not exactly the 
> way we want it. If you look in UnhandledOops::clear_unhandled_oops you 
> see that it does not assert if an oop is held over a safepoint, it will 
> overwrite the oopDesc* is with BAD_OOP_ADDR (0xfffffff1). UnhandledOops 
> does this because it is okay to hold on to an oop over a safepoint *if* 
> you remember to update the oop again after the safepoint (this pattern 
> is used at some places in the VM).
> 
> In the case of this particular bug, we would have crashed in the 
> ThreadInfo constructor because the java.lang.Thread pointer would have 
> had the value BAD_OOP_ADDR. This means that we still would have had to 
> figure out how that java.lang.Thread pointer got the value BAD_OOP_ADDR. 
> This is better than to track down why the ThreadInfo.threadName field 
> was stale, but still usually require quite a bit of archeology in the 
> core file.
> 
> Adding a NoSafepointVerifier at the top of the constructor for 
> ThreadSnapshot would have helped. But if we would have realized that we 
> needed a NoSafepointVerifier, then it would probably have been better to 
> just use a Handle instead of an oop for the ThreadSnapshot::_threadObj 
> field. In this particular case we could have placed a HandleMark at the 
> top of jmm_GetThreadInfo and jmm_DumpThreads and then used a Handle for 
> ThreadSnapshot::_threadObj.
> 
> The problem with the Handle approach is that using a Handle for a field 
> in a class requires the instances of that class to always be guarded by 
> a HandleMark. This works fine for ThreadSnapshot as it is used right 
> now, but might not be an applicable technique for all classes (for 
> example a class that does not have a "lexical lifetime" and thus can't 
> be guarded by a HandleMark).
> 
> I have discussed this a bit with Erik Ö and StefanK and we think a more 
> general solution would be to build something on top of OopStorage (and 
> potentially combine that with oopHandle). This way you get a safe and 
> flexible handle for fields in classes. This would probably work for all 
> but the most performance sensitive scenarios (we can probably create 
> some special handles for those cases).
>> On 23/01/2019 12:59 am, Erik Helin wrote:
>>> Hi all,
>>>
>>> this patch fixes a problem when the oop in ThreadSnapshot::_threadObj 
>>> can become stale. The issue is that the ThreadSnapshot::oops_do 
>>> method only gets called when a ThreadSnapshot instance has been 
>>> registered in a ThreadDumpResult (via the 
>>> ThreadDumpResult::add_thread_snapshot method). But, in order to 
>>> register a ThreadSnapshot instance, you must first create it. The 
>>> problem is that the ThreadSnapshot constructor first sets _threadObj 
>>> to thread->threadObj() and then further down might call 
>>> ObjectSynchronizer:: get_lock_owner. The call to ObjectSynchronizer:: 
>>> get_lock_owner can result in a VM_RevokeBias VM operation being 
>>> executed. If a GC VM operation already is enqueued, then that GC VM 
>>> operation will run when the VM_RevokeBias VM operation is executed. 
>>> That GC VM operation will not update the oop in 
>>> ThreadSafepoint::_threadObj, because that ThreadSnapshot instance has 
>>> not yet been registered in any ThreadDumpResult (recall that the 
>>> ThreadSafepoint is being constructed), so the GC has no way to find 
>>> it. The oop in ThreadSafepoint::_threadObj will then become dangling 
>>> which most likely will cause the JVM to get a SIGSEGV some time later.
>>
>> _blocker_object could suffer the same fate, and possibly 
>> _blocker_object_owner if there could be other paths leading to a 
>> safepoint.
> 
> Yes, that is correct.
>>> The issue was found when debugging why an instance of 
>>> java/lang/management/ThreadInfo on the Java heap had a stale pointer 
>>> in its threadName field. Turns out that the java.lang.Thread instance 
>>> passed to the ThreadInfo was stale most likely for the reason 
>>> outlined in the paragraph above.
>>>
>>> This patch fixes the issue by ensuring that a ThreadSnapshot is 
>>> always registered in a ThreadDumpResult before the initialization of 
>>> the ThreadSnapshot is done. This ensures that the GC will always be 
>>> able to find the oop ThreadSnapshot::_threadObj via 
>>> ThreadDumpResult::oops_do.
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~ehelin/8213231/00/
>>
>> Explanation sounded simple enough - actual change was a little harder 
>> to follow :)
>>
>> src/hotspot/share/services/threadService.cpp
>>
>> Initially I was concerned about ensuring all oop fields were NULL but 
>> I see the ThreadSnapshot constructor handles that. That means some of 
>> the NULL setting in initialize is redundant:
>>
>>   854   _stack_trace = NULL;
>>   855   _concurrent_locks = NULL;
>>   856   _next = NULL;
>>   866   _blocker_object = NULL;
>>   867   _blocker_object_owner = NULL;
> 
> Nice catch, I missed removing those assignments, fixed that in version 01.
> 
>> src/hotspot/share/services/threadService.hpp
>>
>> The new initialize function doesn't need to be public.
> 
> It does, because it is used from the class ThreadDumpResult, see 
> ThreadDumpResult::add_thread_snapshot. If we want the constructor for 
> ThreadSnapshot and ThreadSnapshot::initialize to be private, then 
> ThreadSnapshot must friend ThreadDumpResult. I did this change for 01, 
> the added benefit is that the only way to create a ThreadSnapshot* is 
> now through ThreadDumpResult::add_thread_snapshot (the ThreadSnapshot 
> constructor is private). This means we have ensured that exactly this 
> bug can't happen again :)
> 
>  > src/hotspot/share/services/threadService.hpp
>> +   void link_thread_snapshot(ThreadSnapshot* ts);
>>
>> Please follow the existing layout style in that code and shift the 
>> method name across to align with others.
> 
> I tried to follow the existing style as much as possible, but the code 
> in threadService.hpp contradicts itself in a few places :) See if you 
> prefer the way I indented the function names in 01 (and don't ask me for 
> my opinion on vertically aligning fields and methods in class 
> declarations ;).
> 
> Please see new webrevs at:
> - full: http://cr.openjdk.java.net/~ehelin/8213231/01/
> - inc: http://cr.openjdk.java.net/~ehelin/8213231/00-01
> 
> Thanks,
> Erik
> 
>> Thanks,
>> David
>>
>>
>>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8213231
>>>
>>> Testing:
>>> - Tier 1, 2 and 3 on Windows, Mac, Linux (all x86-64)
>>> - RunThese30M (multiple runs) and RunThese24h on Linux x86-64
>>>    (please note that I never managed to reproduce the issue, all 
>>> analysis was done based on a core file)
>>>
>>> Thanks,
>>> Erik


More information about the hotspot-runtime-dev mailing list