(S) RFR: 8159461: bigapps/Kitchensink/stressExitCode hits assert: Must be VMThread or JavaThread

Mon Aug 8 23:57:35 UTC 2016

Hi Dan,

Thanks for the review.

On 9/08/2016 2:07 AM, Daniel D. Daugherty wrote:
> On 8/4/16 8:28 PM, David Holmes wrote:
>> Hi Volker,
>>
>> Thanks for looking at this.
>>
>> On 5/08/2016 1:48 AM, Volker Simonis wrote:
>>> Hi David,
>>>
>>> thanks for doing this change on all platforms.
>>> The fix looks good. Maybe you can just extend the following comment with
>>> something like:
>>>
>>>  //  Note that the SR_lock plays no role in this suspend/resume
>>> protocol.
>>>  //  It is only used in SR_handler as a thread termination indicator if
>>> NULL.
>>
>> Darn this code is confusing - too many "SR"'s :( I have added
>>
>> //  Note that the SR_lock plays no role in this suspend/resume protocol,
>> //  but is checked for NULL in SR_handler as a thread termination
>> indicator.
>>
>> Updated webrev:
>>
>> http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/
>
> src/share/vm/runtime/thread.cpp
>     L380:   _SR_lock = NULL;
>         I was expecting the _SR_lock to be freed and NULL'ed earlier
>         based on the discussion in the bug report. Since the crashing
>         assert() happens in a race between the JavaThread destructor
>         the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
>         field to be dealt with as early as possible in the Thread
>         destructor (or even earlier; see my last comment).

I will respond after that comment.

>  src/os/linux/vm/os_linux.cpp
>     L4010:   // mask is changed as part of thread termination. Check the
> current thread
>         grammar?: "Check the current" -> "Check that the current"

Will change.

>     L4015:   if (thread->SR_lock() == NULL)
>     L4016:     return;
>         style nit: multi-line if-statements require '{' and '}'
>         Please add the braces or make this a single line if-statement.
>         I would prefer the braces. :-)

Will fix.

>         Isn't there still a window between the completion of the
>         JavaThread destructor and where the Thread destructor sets
>         _SR_lock = NULL?

See below.

>     L4020:   OSThread* osthread = thread->osthread();
>         Not your bug. This code assumes that osthread != NULL.
>         Maybe it needs to be more robust.

Depends what kind of impossibilities we want to guard against. :) There 
should be no possible way a signal can be sent to a thread that doesn't 
even have a osThread as it means we never successfully started/attached 
the thread.

> src/os/aix/vm/os_aix.cpp
>     L2731:   if (thread->SR_lock() == NULL)
>     L2732:     return;
>         Same style nit.
>
>         Same race.
>
>     L2736:   OSThread* osthread = thread->osthread();
>         Same robustness comment.
>
> src/os/bsd/vm/os_bsd.cpp
>     L2759:   if (thread->SR_lock() == NULL)
>     L2760:     return;
>         Same style nit.
>
>         Same race.
>
>     L2764:   OSThread* osthread = thread->osthread();
>         Same robustness comment.
>
> It has been a very long time since I've dealt with races in the
> suspend/resume code so I'm probably very rusty with this code.
> If the _SR_lock is only used by the JavaThread suspend/resume
> protocol, then we could consider free'ing and NULL'ing the field
> in the JavaThread destructor (as the last piece of work).
>
> That should eliminate the race that was being observed by the
> SR_handler() in this bug. It will open a very small race where
> is_Java_thread() can return true, the _SR_lock field is !NULL,
> but the _SR_lock has been deleted.

Given that it should have been impossible to get into the SR_handler in 
the first place from this code I was trying to minimize the disruption 
to the existing logic. Moving the delete/NULLing to just before the call 
to os::free_thread() fixes the crashes that had been observed. I was not 
trying to make the entire destruction sequence safe wrt. the SR_handler.

My major concern with deleting the SR_lock much earlier is the potential 
race condition that I have previously outlined in:

https://bugs.openjdk.java.net/browse/JDK-8152849

where there is no protection against a target thread terminating. The 
sooner it terminates and deletes the SR_lock the more likely we may 
attempt to lock a deleted lock!

Thanks,
David

> Dan
>
>
>>
>> This also reminded me to follow up on why the Solaris SR_handler is
>> different and I found it is not actually installed as a direct signal
>> handler, but is called from the real signal handler if dealing with a
>> JavaThread or the VMThread. Consequently the Solaris version of the
>> SR_handler can not encounter this specific bug and so I have reverted
>> the changes to os_solaris.cpp
>>
>> Thanks,
>> David
>>
>>
>>> Regards,
>>> Volker
>>>
>>> On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>>     webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
>>>     <http://cr.openjdk.java.net/~dholmes/8159461/webrev/>
>>>
>>>     bug: https://bugs.openjdk.java.net/browse/JDK-8159461
>>>     <https://bugs.openjdk.java.net/browse/JDK-8159461>
>>>
>>>     The suspend/resume signal (SR_signum) is never sent to a thread once
>>>     it has started to terminate. On one platform (SuSE 12) we have seen
>>>     what appears to be a "stuck" signal, which is only delivered when
>>>     the terminating thread restores its original signal mask (as if
>>>     pthread_sigmask makes the system realize there is a pending signal -
>>>     we already check the signal was not blocked). At this point in the
>>>     thread termination we have freed the osthread, so the the SR_handler
>>>     would access deallocated memory. In debug builds we first hit an
>>>     assertion that the current thread is a JavaThread or the VMThread -
>>>     that assertion fails, even though it is a JavaThread, because we
>>>     have already executed the ~JavaThread destructor and inside the
>>>     ~Thread destructor we are a plain Thread not a JavaThread.
>>>
>>>     The fix was to make a small adjustment to the thread termination
>>>     process so that we delete the SR_lock before calling
>>>     os::free_thread(). In the SR_handler() we can then use a NULL check
>>>     of SR_lock() to indicate the thread has terminated and we return.
>>>
>>>     While only seen on Linux I took the opportunity to apply the fix on
>>>     all platforms and also cleaned up the code where we were using
>>>     Thread::current() unsafely in a signal-handling context.
>>>
>>>     Testing: regular tier 1 (JPRT)
>>>              Kitchensink (in progress)
>>>
>>>     As we can't readily reproduce the problem I tested this by having a
>>>     terminating thread raise SR_signum directly from within the ~Thread
>>>     destructor.
>>>
>>>     Thanks,
>>>     David
>>>
>>>
>