RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Tue Nov 20 10:50:30 UTC 2018

On Tue, Nov 20, 2018 at 11:38 AM David Holmes <david.holmes at oracle.com> wrote:
>
> Hi Thomas,
>
> Thanks for taking a look at this.
>
> On 20/11/2018 8:20 pm, Thomas Stüfe wrote:
> > Hi David,
> >
> > out of interest, would pthread_getcpuclockid also crash when fed an
> > pthread_t of an existing thread which has been suspended and not yet
> > started (on Solaris, AIX)? Not that it would have much to report :)
>
> I don't know for certain but I would certainly hope not! Once the
> pthread_t has been returned to the caller it should be valid. Note
> however on Solaris we don't normally use pthreads.
>
> > --
> >
> > 1643   // exclude externally visible JavaThreads
> > 1644   if (thread->is_Java_thread() &&
> > !thread->is_hidden_from_external_view()) {
> > 1645     return;
> > 1646   }
> > 1647
> > 1648   // NonJavaThread instances may not be fully initialized yet, so
> > we need to
> > 1649   // skip any that aren't - check for zero stack_size()
> > 1650   if (!thread->is_Java_thread() && thread->stack_size() == 0) {
> > 1651     return;
> > 1652   }
> >
> > So do we really want to skip the check for externally hidden java
> > threads? If not, would not this b better:
> >
> > 1648   // Thread instances may not be fully initialized yet, so we need to
> > 1649   // skip any that aren't - check for zero stack_size()
> > 1650   if (thread->stack_size() == 0) {
> > 1651     return;
> > 1652   }
>
> JavaThreads can't be "not fully initialized" as they don't appear in the
> ThreadsList until they are fully initialized.

Are you sure?

See e.g. CompileBroker::make_thread() :

      Threads::add(thread);
      Thread::start(thread);

since stack_size is still 0 before Thread::start() on platforms where
the threads are created suspended, you could encounter threads with
stack_size==0. I do not know whether this translates to an invalid
pthread_t handle though (hence my question about suspended threads).

> So I prefer to be clear
> this only affects NJTs.
>
> > --
> >
> > I agree an "is_initialized" flag would be better. We do have the
> > ThreadState in the associated OsThread, could that be used?
>
> I suppose it could use:
>
> thread->osThread()->get_state() >= RUNNABLE
>
> but OSThread ThreadState is a bit of an anachronism - as it states in
> the header:
>
> // Note: the ThreadState is legacy code and is not correctly implemented.
> // Uses of ThreadState need to be replaced by the state in the JavaThread.
>
> so I'd rather not use that. Checking the stack_size has been set ensures
> the thread has reached as "reasonable" point in its execution. It's
> later than strictly necessary but we're racing regardless.
>

Querying for thread_stack==0 is certainly fine at this point.

Thanks, Thomas

> Thanks,
> David
> -----
>
> > Thanks, Thomas
> > On Tue, Nov 20, 2018 at 9:53 AM David Holmes <david.holmes at oracle.com> wrote:
> >>
> >> After discussions with Kim I've decided to split out the NJT list update
> >> into a separate RFE:
> >>
> >> https://bugs.openjdk.java.net/browse/JDK-8214097
> >>
> >> So only the change in management.cpp needs reviewing and testing.
> >>
> >> Updated webrev:
> >>
> >> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
> >>
> >> Thanks,
> >> David
> >>
> >> On 20/11/2018 10:01 am, David Holmes wrote:
> >>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
> >>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
> >>>
> >>> There is an internal management API that reports CPU times for
> >>> NonJavaThreads (NJTs). That functionality requires a valid/live target
> >>> thread so that we can use its pthread_t identity to obtain its CPU clock
> >>> via pthread_getcpuclockid().
> >>>
> >>> There is an iteration mechanism for NJTs in which the NJT is registered
> >>> during its constructor and de-registered during its destructor. A thread
> >>> that has only been constructed has not yet executed and so is not a
> >>> valid target for this management API. This seems to be the cause of
> >>> failures reported in this bug (and JDK-8213434). Registering a NJT only
> >>> when it starts executing is an appealing fix for this, but that impacts
> >>> all current users of the NJT list and straight-away causes a problem
> >>> with the BarrierSet initialization logic. So I don't attempt that.
> >>>
> >>> Instead the first part of the fix is for ThreadTimesClosure::do_thread
> >>> to skip threads that have not yet executed - which we can recognize by
> >>> seeing an uninitialized (i.e. zero) stackbase.
> >>>
> >>> A second part of the fix, which can be deferred to a separate RFE for
> >>> NJT lifecycle management if desired, tackles the problem of encountering
> >>> a terminated thread during iteration - which can also lead to SEGVs.
> >>> This can arise because NJT's are not actually "destructed", even if they
> >>> terminate, and so they never get removed from the NJT list. Calling
> >>> destructors is problematic because the code using these NJTs assume they
> >>> are always valid. So the fix in this case is to move the de-registering
> >>> from the NJT list out of the destructor and into the Thread::call_run()
> >>> method so it is done before a thread actually terminates. This can be
> >>> considered a first step in cleaning up the NJT lifecycle, where the
> >>> remaining steps touch on a lot of areas and so need to be handled
> >>> separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
> >>> threads.
> >>>
> >>> Testing: tiers 1 -3
> >>>
> >>> I should point out that I've been unable to reproduce this failure
> >>> locally, even after thousands of runs. I'm hoping Zhengyu can test this
> >>> in the conditions reported in JDK-8213434.
> >>>
> >>> Thanks,
> >>> David