RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

David Holmes david.holmes at oracle.com
Tue Nov 20 12:49:59 UTC 2018


Hi Thomas,

On 20/11/2018 8:50 pm, Thomas Stüfe wrote:
> On Tue, Nov 20, 2018 at 11:38 AM David Holmes <david.holmes at oracle.com> wrote:
>>
>> Hi Thomas,
>>
>> Thanks for taking a look at this.
>>
>> On 20/11/2018 8:20 pm, Thomas Stüfe wrote:
>>> Hi David,
>>>
>>> out of interest, would pthread_getcpuclockid also crash when fed an
>>> pthread_t of an existing thread which has been suspended and not yet
>>> started (on Solaris, AIX)? Not that it would have much to report :)
>>
>> I don't know for certain but I would certainly hope not! Once the
>> pthread_t has been returned to the caller it should be valid. Note
>> however on Solaris we don't normally use pthreads.
>>
>>> --
>>>
>>> 1643   // exclude externally visible JavaThreads
>>> 1644   if (thread->is_Java_thread() &&
>>> !thread->is_hidden_from_external_view()) {
>>> 1645     return;
>>> 1646   }
>>> 1647
>>> 1648   // NonJavaThread instances may not be fully initialized yet, so
>>> we need to
>>> 1649   // skip any that aren't - check for zero stack_size()
>>> 1650   if (!thread->is_Java_thread() && thread->stack_size() == 0) {
>>> 1651     return;
>>> 1652   }
>>>
>>> So do we really want to skip the check for externally hidden java
>>> threads? If not, would not this b better:
>>>
>>> 1648   // Thread instances may not be fully initialized yet, so we need to
>>> 1649   // skip any that aren't - check for zero stack_size()
>>> 1650   if (thread->stack_size() == 0) {
>>> 1651     return;
>>> 1652   }
>>
>> JavaThreads can't be "not fully initialized" as they don't appear in the
>> ThreadsList until they are fully initialized.
> 
> Are you sure?
> 
> See e.g. CompileBroker::make_thread() :
> 
>        Threads::add(thread);
>        Thread::start(thread);
> 
> since stack_size is still 0 before Thread::start() on platforms where
> the threads are created suspended, you could encounter threads with
> stack_size==0. I do not know whether this translates to an invalid
> pthread_t handle though (hence my question about suspended threads).

Ah yes you are right - sorry. I didn't mean to imply that JavaThreads 
were guaranteed to have their stack set - that's certainly not true. For 
non-suspended-at-start platforms they are guaranteed to have executed 
some code. But as you note if started suspended then they may not be 
actually started before being seen in the ThreadsList.

For this particular case Solaris always uses the LWP id to read the 
lwpusage from /proc. I'm not aware of any reports that this has ever 
failed and I would not expect it to fail, but ...

AIX similarly uses a kernel-thread ID and a kernel-thread specific API.

Platform idiosyncrasies aside you may well be right that its better to 
ensure all the examined threads have at least executed to the point 
where they have set their own stack information.

Opinions from others?

>> So I prefer to be clear
>> this only affects NJTs.
>>
>>> --
>>>
>>> I agree an "is_initialized" flag would be better. We do have the
>>> ThreadState in the associated OsThread, could that be used?
>>
>> I suppose it could use:
>>
>> thread->osThread()->get_state() >= RUNNABLE
>>
>> but OSThread ThreadState is a bit of an anachronism - as it states in
>> the header:
>>
>> // Note: the ThreadState is legacy code and is not correctly implemented.
>> // Uses of ThreadState need to be replaced by the state in the JavaThread.
>>
>> so I'd rather not use that. Checking the stack_size has been set ensures
>> the thread has reached as "reasonable" point in its execution. It's
>> later than strictly necessary but we're racing regardless.
>>
> 
> Querying for thread_stack==0 is certainly fine at this point.

Okay - thanks.

David
-----

> Thanks, Thomas
> 
>> Thanks,
>> David
>> -----
>>
>>> Thanks, Thomas
>>> On Tue, Nov 20, 2018 at 9:53 AM David Holmes <david.holmes at oracle.com> wrote:
>>>>
>>>> After discussions with Kim I've decided to split out the NJT list update
>>>> into a separate RFE:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8214097
>>>>
>>>> So only the change in management.cpp needs reviewing and testing.
>>>>
>>>> Updated webrev:
>>>>
>>>> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On 20/11/2018 10:01 am, David Holmes wrote:
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
>>>>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
>>>>>
>>>>> There is an internal management API that reports CPU times for
>>>>> NonJavaThreads (NJTs). That functionality requires a valid/live target
>>>>> thread so that we can use its pthread_t identity to obtain its CPU clock
>>>>> via pthread_getcpuclockid().
>>>>>
>>>>> There is an iteration mechanism for NJTs in which the NJT is registered
>>>>> during its constructor and de-registered during its destructor. A thread
>>>>> that has only been constructed has not yet executed and so is not a
>>>>> valid target for this management API. This seems to be the cause of
>>>>> failures reported in this bug (and JDK-8213434). Registering a NJT only
>>>>> when it starts executing is an appealing fix for this, but that impacts
>>>>> all current users of the NJT list and straight-away causes a problem
>>>>> with the BarrierSet initialization logic. So I don't attempt that.
>>>>>
>>>>> Instead the first part of the fix is for ThreadTimesClosure::do_thread
>>>>> to skip threads that have not yet executed - which we can recognize by
>>>>> seeing an uninitialized (i.e. zero) stackbase.
>>>>>
>>>>> A second part of the fix, which can be deferred to a separate RFE for
>>>>> NJT lifecycle management if desired, tackles the problem of encountering
>>>>> a terminated thread during iteration - which can also lead to SEGVs.
>>>>> This can arise because NJT's are not actually "destructed", even if they
>>>>> terminate, and so they never get removed from the NJT list. Calling
>>>>> destructors is problematic because the code using these NJTs assume they
>>>>> are always valid. So the fix in this case is to move the de-registering
>>>>> from the NJT list out of the destructor and into the Thread::call_run()
>>>>> method so it is done before a thread actually terminates. This can be
>>>>> considered a first step in cleaning up the NJT lifecycle, where the
>>>>> remaining steps touch on a lot of areas and so need to be handled
>>>>> separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
>>>>> threads.
>>>>>
>>>>> Testing: tiers 1 -3
>>>>>
>>>>> I should point out that I've been unable to reproduce this failure
>>>>> locally, even after thousands of runs. I'm hoping Zhengyu can test this
>>>>> in the conditions reported in JDK-8213434.
>>>>>
>>>>> Thanks,
>>>>> David


More information about the serviceability-dev mailing list