RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Tue Nov 20 08:50:46 UTC 2018

After discussions with Kim I've decided to split out the NJT list update 
into a separate RFE:

https://bugs.openjdk.java.net/browse/JDK-8214097

So only the change in management.cpp needs reviewing and testing.

Updated webrev:

http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/

Thanks,
David

On 20/11/2018 10:01 am, David Holmes wrote:
> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
> 
> There is an internal management API that reports CPU times for 
> NonJavaThreads (NJTs). That functionality requires a valid/live target 
> thread so that we can use its pthread_t identity to obtain its CPU clock 
> via pthread_getcpuclockid().
> 
> There is an iteration mechanism for NJTs in which the NJT is registered 
> during its constructor and de-registered during its destructor. A thread 
> that has only been constructed has not yet executed and so is not a 
> valid target for this management API. This seems to be the cause of 
> failures reported in this bug (and JDK-8213434). Registering a NJT only 
> when it starts executing is an appealing fix for this, but that impacts 
> all current users of the NJT list and straight-away causes a problem 
> with the BarrierSet initialization logic. So I don't attempt that.
> 
> Instead the first part of the fix is for ThreadTimesClosure::do_thread 
> to skip threads that have not yet executed - which we can recognize by 
> seeing an uninitialized (i.e. zero) stackbase.
> 
> A second part of the fix, which can be deferred to a separate RFE for 
> NJT lifecycle management if desired, tackles the problem of encountering 
> a terminated thread during iteration - which can also lead to SEGVs. 
> This can arise because NJT's are not actually "destructed", even if they 
> terminate, and so they never get removed from the NJT list. Calling 
> destructors is problematic because the code using these NJTs assume they 
> are always valid. So the fix in this case is to move the de-registering 
> from the NJT list out of the destructor and into the Thread::call_run() 
> method so it is done before a thread actually terminates. This can be 
> considered a first step in cleaning up the NJT lifecycle, where the 
> remaining steps touch on a lot of areas and so need to be handled 
> separately e.g. see JDK-8087340 for shutting down WorkGang GC worker 
> threads.
> 
> Testing: tiers 1 -3
> 
> I should point out that I've been unable to reproduce this failure 
> locally, even after thousands of runs. I'm hoping Zhengyu can test this 
> in the conditions reported in JDK-8213434.
> 
> Thanks,
> David