RFR: 8213231: ThreadSnapshot::_threadObj can become stale

Erik Helin erik.helin at oracle.com
Wed Jan 23 14:17:28 UTC 2019


On 1/22/19 11:03 PM, David Holmes wrote:
> Hi Erik,

Hey David, thanks for reviewing!

> Good find! This bug made an interesting read. Makes me wonder how we may 
> have caught this kind of error sooner? Some kind of unhandled oop check? 
> NoSafepointVerifier?

Using -XX:+CheckUnhandledOops (and compiling with 
-DCHECK_UNHANDLED_OOPS) would have caught the issue, but not exactly the 
way we want it. If you look in UnhandledOops::clear_unhandled_oops you 
see that it does not assert if an oop is held over a safepoint, it will 
overwrite the oopDesc* is with BAD_OOP_ADDR (0xfffffff1). UnhandledOops 
does this because it is okay to hold on to an oop over a safepoint *if* 
you remember to update the oop again after the safepoint (this pattern 
is used at some places in the VM).

In the case of this particular bug, we would have crashed in the 
ThreadInfo constructor because the java.lang.Thread pointer would have 
had the value BAD_OOP_ADDR. This means that we still would have had to 
figure out how that java.lang.Thread pointer got the value BAD_OOP_ADDR. 
This is better than to track down why the ThreadInfo.threadName field 
was stale, but still usually require quite a bit of archeology in the 
core file.

Adding a NoSafepointVerifier at the top of the constructor for 
ThreadSnapshot would have helped. But if we would have realized that we 
needed a NoSafepointVerifier, then it would probably have been better to 
just use a Handle instead of an oop for the ThreadSnapshot::_threadObj 
field. In this particular case we could have placed a HandleMark at the 
top of jmm_GetThreadInfo and jmm_DumpThreads and then used a Handle for 
ThreadSnapshot::_threadObj.

The problem with the Handle approach is that using a Handle for a field 
in a class requires the instances of that class to always be guarded by 
a HandleMark. This works fine for ThreadSnapshot as it is used right 
now, but might not be an applicable technique for all classes (for 
example a class that does not have a "lexical lifetime" and thus can't 
be guarded by a HandleMark).

I have discussed this a bit with Erik Ö and StefanK and we think a more 
general solution would be to build something on top of OopStorage (and 
potentially combine that with oopHandle). This way you get a safe and 
flexible handle for fields in classes. This would probably work for all 
but the most performance sensitive scenarios (we can probably create 
some special handles for those cases).
> On 23/01/2019 12:59 am, Erik Helin wrote:
>> Hi all,
>>
>> this patch fixes a problem when the oop in ThreadSnapshot::_threadObj 
>> can become stale. The issue is that the ThreadSnapshot::oops_do method 
>> only gets called when a ThreadSnapshot instance has been registered in 
>> a ThreadDumpResult (via the ThreadDumpResult::add_thread_snapshot 
>> method). But, in order to register a ThreadSnapshot instance, you must 
>> first create it. The problem is that the ThreadSnapshot constructor 
>> first sets _threadObj to thread->threadObj() and then further down 
>> might call ObjectSynchronizer:: get_lock_owner. The call to 
>> ObjectSynchronizer:: get_lock_owner can result in a VM_RevokeBias VM 
>> operation being executed. If a GC VM operation already is enqueued, 
>> then that GC VM operation will run when the VM_RevokeBias VM operation 
>> is executed. That GC VM operation will not update the oop in 
>> ThreadSafepoint::_threadObj, because that ThreadSnapshot instance has 
>> not yet been registered in any ThreadDumpResult (recall that the 
>> ThreadSafepoint is being constructed), so the GC has no way to find 
>> it. The oop in ThreadSafepoint::_threadObj will then become dangling 
>> which most likely will cause the JVM to get a SIGSEGV some time later.
> 
> _blocker_object could suffer the same fate, and possibly 
> _blocker_object_owner if there could be other paths leading to a safepoint.

Yes, that is correct.
>> The issue was found when debugging why an instance of 
>> java/lang/management/ThreadInfo on the Java heap had a stale pointer 
>> in its threadName field. Turns out that the java.lang.Thread instance 
>> passed to the ThreadInfo was stale most likely for the reason outlined 
>> in the paragraph above.
>>
>> This patch fixes the issue by ensuring that a ThreadSnapshot is always 
>> registered in a ThreadDumpResult before the initialization of the 
>> ThreadSnapshot is done. This ensures that the GC will always be able 
>> to find the oop ThreadSnapshot::_threadObj via ThreadDumpResult::oops_do.
>>
>> Webrev:
>> http://cr.openjdk.java.net/~ehelin/8213231/00/
> 
> Explanation sounded simple enough - actual change was a little harder to 
> follow :)
> 
> src/hotspot/share/services/threadService.cpp
> 
> Initially I was concerned about ensuring all oop fields were NULL but I 
> see the ThreadSnapshot constructor handles that. That means some of the 
> NULL setting in initialize is redundant:
> 
>   854   _stack_trace = NULL;
>   855   _concurrent_locks = NULL;
>   856   _next = NULL;
>   866   _blocker_object = NULL;
>   867   _blocker_object_owner = NULL;

Nice catch, I missed removing those assignments, fixed that in version 01.

> src/hotspot/share/services/threadService.hpp
> 
> The new initialize function doesn't need to be public.

It does, because it is used from the class ThreadDumpResult, see 
ThreadDumpResult::add_thread_snapshot. If we want the constructor for 
ThreadSnapshot and ThreadSnapshot::initialize to be private, then 
ThreadSnapshot must friend ThreadDumpResult. I did this change for 01, 
the added benefit is that the only way to create a ThreadSnapshot* is 
now through ThreadDumpResult::add_thread_snapshot (the ThreadSnapshot 
constructor is private). This means we have ensured that exactly this 
bug can't happen again :)

 > src/hotspot/share/services/threadService.hpp
> +   void link_thread_snapshot(ThreadSnapshot* ts);
> 
> Please follow the existing layout style in that code and shift the 
> method name across to align with others.

I tried to follow the existing style as much as possible, but the code 
in threadService.hpp contradicts itself in a few places :) See if you 
prefer the way I indented the function names in 01 (and don't ask me for 
my opinion on vertically aligning fields and methods in class 
declarations ;).

Please see new webrevs at:
- full: http://cr.openjdk.java.net/~ehelin/8213231/01/
- inc: http://cr.openjdk.java.net/~ehelin/8213231/00-01

Thanks,
Erik

> Thanks,
> David
> 
> 
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8213231
>>
>> Testing:
>> - Tier 1, 2 and 3 on Windows, Mac, Linux (all x86-64)
>> - RunThese30M (multiple runs) and RunThese24h on Linux x86-64
>>    (please note that I never managed to reproduce the issue, all 
>> analysis was done based on a core file)
>>
>> Thanks,
>> Erik


More information about the hotspot-runtime-dev mailing list