RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Wed Nov 28 07:35:17 UTC 2018

#$%$#@! I missed the fact that the thread id is set _after_ we do 
record_stack_base_and-size(). I thought that was in call_run() not 
native_thread_entry().

<sigh> New bug being filed.

Thanks,
David

On 28/11/2018 5:26 pm, David Holmes wrote:
> Hi Thomas,
> 
> On 28/11/2018 4:30 pm, Thomas Stüfe wrote:
>> Hi David,
>>
>> I admit I still do not really get it..
>>
>> E.g. for GC threads:
>>
>> We have
>>
>> static ConcurrentMarkSweepThread*
>> ConcurrentMarkSweepThread::start(CMSCollector* collector);
>>
>> which creates a ConcurrentMarkSweepThread and then calls
>> os::create_thread() to hook it up with an OSThread and start it.
>>
>> ConcurrentMarkSweepThread derives from NonJavaThread, which in its
>> constructor adds the thread object to the list.
>>
>> So there is a time gap where the NJT is part of the list, but
>> Thread::_osthread is still NULL.
>>
>> In ThreadTimesClosure::do_thread(), we call
>> os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(), 
>>
>> &clockid);
>>
>> Should we then not crash when dereferencing the NULL
>> osthread()->pthread_id()? Why do we crash inside
>> pthread_getcpuclockid?
>>
>> If I look further into os::create_thread(), I see that there is
>> another, smaller time gap where we create OSThread and anchor it into
>> the Thread object:
>>
>>    thread->set_osthread(osthread);
>>
>> and then later, after pthread_create is thru, set its thread id:
>>
>>      // Store pthread info into the OSThread
>>      osthread->set_pthread_id(tid);
>>
>> When OSThread is created, we call OSThread::pd_initialize() and set
>> its _threadid to 0. We do this in the constructor, before anchoring
>> the OSThread in its Thread.
>>
>> So for my understanding, we should have two situations:
>>
>> (1)- NJT is in list, but its _osthread member is NULL. In that case I
>> would expect a different crash.
>> (2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
> 
> Yes both situations may be possible. In the related bug report only the 
> 0 thread_id was observed.
> 
>> (Modulo any concurrency issues, e.g. could the
>> "thread->set_osthread(osthread);" be visible before OSThread is
>> initialized?
>>
>> Depending on what is happening, a fix for (1) would probably be a
>> querying for osthread==NULL, a fix for (2) would be to guard os::---
>> functions - well, at least os::thread_cpu_time() - to disregard
>> Threads with pthread_t == 0. Both fixes seem better to me than
>> querying the stacksize() when walking the list - that seems a bit
>> awkward.
> 
> I find it awkward that the list is populated with partially constructed 
> threads in the first place. This particular code is only interested in 
> live threads that can safely be queried. The stack_size() check is a 
> surrogate for "this is a live thread that can be queried".
> 
>> -- 
>> P.s.
>>
>> ConcurrentGCThread::ConcurrentGCThread() :
>>    _should_terminate(false), _has_terminated(false) {
>> };
>>
>> I was surprised to see no invocation to the base class ctor in the
>> initializer list. I was not aware that this was even possible. For
>> code clearness, I would prefer the call to the base class ctor to be
>> explicit.)
> 
> I assume it is implicit. But yes it should be explicit.
> 
> Cheers,
> David
> 
>> -- 
>>
>> Cheers, Thomas
>>
>>
>> On Wed, Nov 28, 2018 at 3:58 AM David Holmes <david.holmes at oracle.com> 
>> wrote:
>>>
>>> Sorry for the delayed response.
>>>
>>> On 21/11/2018 3:01 pm, Kim Barrett wrote:
>>>>> On Nov 20, 2018, at 3:50 AM, David Holmes <david.holmes at oracle.com> 
>>>>> wrote:
>>>>>
>>>>> After discussions with Kim I've decided to split out the NJT list 
>>>>> update into a separate RFE:
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8214097
>>>>>
>>>>> So only the change in management.cpp needs reviewing and testing.
>>>>>
>>>>> Updated webrev:
>>>>>
>>>>> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
>>>>
>>>> Looks good.
>>>
>>> Thanks Kim.
>>>
>>> I've decided to stick with this simple fix for NJTs only.
>>>
>>> David
>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On 20/11/2018 10:01 am, David Holmes wrote:
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
>>>>>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
>>>>>> There is an internal management API that reports CPU times for 
>>>>>> NonJavaThreads (NJTs). That functionality requires a valid/live 
>>>>>> target thread so that we can use its pthread_t identity to obtain 
>>>>>> its CPU clock via pthread_getcpuclockid().
>>>>>> There is an iteration mechanism for NJTs in which the NJT is 
>>>>>> registered during its constructor and de-registered during its 
>>>>>> destructor. A thread that has only been constructed has not yet 
>>>>>> executed and so is not a valid target for this management API. 
>>>>>> This seems to be the cause of failures reported in this bug (and 
>>>>>> JDK-8213434). Registering a NJT only when it starts executing is 
>>>>>> an appealing fix for this, but that impacts all current users of 
>>>>>> the NJT list and straight-away causes a problem with the 
>>>>>> BarrierSet initialization logic. So I don't attempt that.
>>>>>> Instead the first part of the fix is for 
>>>>>> ThreadTimesClosure::do_thread to skip threads that have not yet 
>>>>>> executed - which we can recognize by seeing an uninitialized (i.e. 
>>>>>> zero) stackbase.
>>>>>> A second part of the fix, which can be deferred to a separate RFE 
>>>>>> for NJT lifecycle management if desired, tackles the problem of 
>>>>>> encountering a terminated thread during iteration - which can also 
>>>>>> lead to SEGVs. This can arise because NJT's are not actually 
>>>>>> "destructed", even if they terminate, and so they never get 
>>>>>> removed from the NJT list. Calling destructors is problematic 
>>>>>> because the code using these NJTs assume they are always valid. So 
>>>>>> the fix in this case is to move the de-registering from the NJT 
>>>>>> list out of the destructor and into the Thread::call_run() method 
>>>>>> so it is done before a thread actually terminates. This can be 
>>>>>> considered a first step in cleaning up the NJT lifecycle, where 
>>>>>> the remaining steps touch on a lot of areas and so need to be 
>>>>>> handled separately e.g. see JDK-8087340 for shutting down WorkGang 
>>>>>> GC worker threads.
>>>>>> Testing: tiers 1 -3
>>>>>> I should point out that I've been unable to reproduce this failure 
>>>>>> locally, even after thousands of runs. I'm hoping Zhengyu can test 
>>>>>> this in the conditions reported in JDK-8213434.
>>>>>> Thanks,
>>>>>> David
>>>>
>>>>