RFC: more robust handling of terminated but still attached threads

Wed Jul 4 15:15:38 UTC 2018

On Tue, Jul 3, 2018 at 11:21 AM, David Holmes <david.holmes at oracle.com> wrote:
> Bug: https://bugs.openjdk.java.net/browse/JDK-8205878
>
> We hit asserts or trigger SEGVs when we try to operate on a native thread ID
> for a JNI-attached thread that has actually terminated but which did not
> detach first. It still appears in the threadsList and we try to process it
> during DumpOnExit (but there are probably other operations that could run
> into this in the general case).
>
> Fixing the tests is easy. But the more general question is how to make the
> VM code more robust in the face of this situation.
>
> At the lowest level we can watch for ESRCH from pthread_* functions and try
> to program in alternate logic that gives some "result" for that thread.
>
> At higher-level we may be able to heuristically guess that the native thread
> has terminated and so skip it in ALL_JAVA_THREADS and similar constructors.
> For example pthread_kill(t,0) can heuristically check if 't' is not alive as
> it may return ESRCH. But of course if t terminated then it is entirely
> possible that the pthread_t value for it has been reused. And if t is not
> going to detach we could be racing with its termination anyway - so the
> heuristic may pass and we still hit a low-level assert or SEGV.
>
> What do people think? Do we try to deal with this at the bottom, or at the
> top, or all the way through? (There's obviously a diminishing return on
> effort versus benefit here.)

I think handling ESRCH in a couple of pthread APIs at the bottom and
having a mechanism to quietly shoo away Thread objects for dead
threads would cover most cases. I also would only do this for threads
attached from the outside.

I can see that reused pthread IDs of dead JNI threads could be a
problem, but I do not see a cheap solution for that. And this is a JNI
coding error, right?

If we really are worried about this kind of problem, we can have a
watcher task periodically probing the threads in the threadlist to
check if their pthread_t s result in ESRCH - if you do this
periodically, you reduce the chance of reuse. The period could even be
adjustable, for analysis reasons (if you suspect this kind of error,
decrease test period time)

But I would only do this if really necessary.

Just my 5 cent.

..Thomas

>
> Thanks,
> David