RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Tue Nov 20 10:38:06 UTC 2018

Hi Thomas,

Thanks for taking a look at this.

On 20/11/2018 8:20 pm, Thomas Stüfe wrote:
> Hi David,
> 
> out of interest, would pthread_getcpuclockid also crash when fed an
> pthread_t of an existing thread which has been suspended and not yet
> started (on Solaris, AIX)? Not that it would have much to report :)

I don't know for certain but I would certainly hope not! Once the 
pthread_t has been returned to the caller it should be valid. Note 
however on Solaris we don't normally use pthreads.

> --
> 
> 1643   // exclude externally visible JavaThreads
> 1644   if (thread->is_Java_thread() &&
> !thread->is_hidden_from_external_view()) {
> 1645     return;
> 1646   }
> 1647
> 1648   // NonJavaThread instances may not be fully initialized yet, so
> we need to
> 1649   // skip any that aren't - check for zero stack_size()
> 1650   if (!thread->is_Java_thread() && thread->stack_size() == 0) {
> 1651     return;
> 1652   }
> 
> So do we really want to skip the check for externally hidden java
> threads? If not, would not this b better:
> 
> 1648   // Thread instances may not be fully initialized yet, so we need to
> 1649   // skip any that aren't - check for zero stack_size()
> 1650   if (thread->stack_size() == 0) {
> 1651     return;
> 1652   }

JavaThreads can't be "not fully initialized" as they don't appear in the 
ThreadsList until they are fully initialized. So I prefer to be clear 
this only affects NJTs.

> --
> 
> I agree an "is_initialized" flag would be better. We do have the
> ThreadState in the associated OsThread, could that be used?

I suppose it could use:

thread->osThread()->get_state() >= RUNNABLE

but OSThread ThreadState is a bit of an anachronism - as it states in 
the header:

// Note: the ThreadState is legacy code and is not correctly implemented.
// Uses of ThreadState need to be replaced by the state in the JavaThread.

so I'd rather not use that. Checking the stack_size has been set ensures 
the thread has reached as "reasonable" point in its execution. It's 
later than strictly necessary but we're racing regardless.

Thanks,
David
-----

> Thanks, Thomas
> On Tue, Nov 20, 2018 at 9:53 AM David Holmes <david.holmes at oracle.com> wrote:
>>
>> After discussions with Kim I've decided to split out the NJT list update
>> into a separate RFE:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8214097
>>
>> So only the change in management.cpp needs reviewing and testing.
>>
>> Updated webrev:
>>
>> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
>>
>> Thanks,
>> David
>>
>> On 20/11/2018 10:01 am, David Holmes wrote:
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
>>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
>>>
>>> There is an internal management API that reports CPU times for
>>> NonJavaThreads (NJTs). That functionality requires a valid/live target
>>> thread so that we can use its pthread_t identity to obtain its CPU clock
>>> via pthread_getcpuclockid().
>>>
>>> There is an iteration mechanism for NJTs in which the NJT is registered
>>> during its constructor and de-registered during its destructor. A thread
>>> that has only been constructed has not yet executed and so is not a
>>> valid target for this management API. This seems to be the cause of
>>> failures reported in this bug (and JDK-8213434). Registering a NJT only
>>> when it starts executing is an appealing fix for this, but that impacts
>>> all current users of the NJT list and straight-away causes a problem
>>> with the BarrierSet initialization logic. So I don't attempt that.
>>>
>>> Instead the first part of the fix is for ThreadTimesClosure::do_thread
>>> to skip threads that have not yet executed - which we can recognize by
>>> seeing an uninitialized (i.e. zero) stackbase.
>>>
>>> A second part of the fix, which can be deferred to a separate RFE for
>>> NJT lifecycle management if desired, tackles the problem of encountering
>>> a terminated thread during iteration - which can also lead to SEGVs.
>>> This can arise because NJT's are not actually "destructed", even if they
>>> terminate, and so they never get removed from the NJT list. Calling
>>> destructors is problematic because the code using these NJTs assume they
>>> are always valid. So the fix in this case is to move the de-registering
>>> from the NJT list out of the destructor and into the Thread::call_run()
>>> method so it is done before a thread actually terminates. This can be
>>> considered a first step in cleaning up the NJT lifecycle, where the
>>> remaining steps touch on a lot of areas and so need to be handled
>>> separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
>>> threads.
>>>
>>> Testing: tiers 1 -3
>>>
>>> I should point out that I've been unable to reproduce this failure
>>> locally, even after thousands of runs. I'm hoping Zhengyu can test this
>>> in the conditions reported in JDK-8213434.
>>>
>>> Thanks,
>>> David