(S) RFR: 8159461: bigapps/Kitchensink/stressExitCode hits assert: Must be VMThread or JavaThread

Fri Aug 12 01:25:39 UTC 2016

David,

Sorry I forgot to respond before I left for Santa Fe, NM...
More below...

On 8/8/16 5:57 PM, David Holmes wrote:
> Hi Dan,
>
> Thanks for the review.
>
> On 9/08/2016 2:07 AM, Daniel D. Daugherty wrote:
>> On 8/4/16 8:28 PM, David Holmes wrote:
>>> Hi Volker,
>>>
>>> Thanks for looking at this.
>>>
>>> On 5/08/2016 1:48 AM, Volker Simonis wrote:
>>>> Hi David,
>>>>
>>>> thanks for doing this change on all platforms.
>>>> The fix looks good. Maybe you can just extend the following comment 
>>>> with
>>>> something like:
>>>>
>>>>  //  Note that the SR_lock plays no role in this suspend/resume
>>>> protocol.
>>>>  //  It is only used in SR_handler as a thread termination 
>>>> indicator if
>>>> NULL.
>>>
>>> Darn this code is confusing - too many "SR"'s :( I have added
>>>
>>> //  Note that the SR_lock plays no role in this suspend/resume 
>>> protocol,
>>> //  but is checked for NULL in SR_handler as a thread termination
>>> indicator.
>>>
>>> Updated webrev:
>>>
>>> http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/
>>
>> src/share/vm/runtime/thread.cpp
>>     L380:   _SR_lock = NULL;
>>         I was expecting the _SR_lock to be freed and NULL'ed earlier
>>         based on the discussion in the bug report. Since the crashing
>>         assert() happens in a race between the JavaThread destructor
>>         the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
>>         field to be dealt with as early as possible in the Thread
>>         destructor (or even earlier; see my last comment).
>
> I will respond after that comment.
>
>>  src/os/linux/vm/os_linux.cpp
>>     L4010:   // mask is changed as part of thread termination. Check the
>> current thread
>>         grammar?: "Check the current" -> "Check that the current"
>
> Will change.
>
>>     L4015:   if (thread->SR_lock() == NULL)
>>     L4016:     return;
>>         style nit: multi-line if-statements require '{' and '}'
>>         Please add the braces or make this a single line if-statement.
>>         I would prefer the braces. :-)
>
> Will fix.
>
>>         Isn't there still a window between the completion of the
>>         JavaThread destructor and where the Thread destructor sets
>>         _SR_lock = NULL?
>
> See below.
>
>>     L4020:   OSThread* osthread = thread->osthread();
>>         Not your bug. This code assumes that osthread != NULL.
>>         Maybe it needs to be more robust.
>
> Depends what kind of impossibilities we want to guard against. :) 
> There should be no possible way a signal can be sent to a thread that 
> doesn't even have a osThread as it means we never successfully 
> started/attached the thread.

That's a really good point. I'm good with what's there
for osthread.

>
>> src/os/aix/vm/os_aix.cpp
>>     L2731:   if (thread->SR_lock() == NULL)
>>     L2732:     return;
>>         Same style nit.
>>
>>         Same race.
>>
>>     L2736:   OSThread* osthread = thread->osthread();
>>         Same robustness comment.
>>
>> src/os/bsd/vm/os_bsd.cpp
>>     L2759:   if (thread->SR_lock() == NULL)
>>     L2760:     return;
>>         Same style nit.
>>
>>         Same race.
>>
>>     L2764:   OSThread* osthread = thread->osthread();
>>         Same robustness comment.
>>
>> It has been a very long time since I've dealt with races in the
>> suspend/resume code so I'm probably very rusty with this code.
>> If the _SR_lock is only used by the JavaThread suspend/resume
>> protocol, then we could consider free'ing and NULL'ing the field
>> in the JavaThread destructor (as the last piece of work).
>>
>> That should eliminate the race that was being observed by the
>> SR_handler() in this bug. It will open a very small race where
>> is_Java_thread() can return true, the _SR_lock field is !NULL,
>> but the _SR_lock has been deleted.
>
> Given that it should have been impossible to get into the SR_handler 
> in the first place from this code I was trying to minimize the 
> disruption to the existing logic. Moving the delete/NULLing to just 
> before the call to os::free_thread() fixes the crashes that had been 
> observed. I was not trying to make the entire destruction sequence 
> safe wrt. the SR_handler.

I suspect it is the combination of 1) NULLing the _SR_lock as a sentinel and
2) doing that before the more expensive os::free_thread() call that results
in the change in behavior.

> My major concern with deleting the SR_lock much earlier is the 
> potential race condition that I have previously outlined in:
>
> https://bugs.openjdk.java.net/browse/JDK-8152849
>
> where there is no protection against a target thread terminating. The 
> sooner it terminates and deletes the SR_lock the more likely we may 
> attempt to lock a deleted lock!

Ah yes... thanks for the reminder. We have seen a few of those in the
past where we're racing to grab the _SR_lock and Elvis is trying to
leave the building...

I'm good with just the minor changes you agreed to make above. I don't
think I need to see a new webrev for the above edits.

Thumbs up!

Dan

>
> Thanks,
> David
>
>> Dan
>>
>>
>>>
>>> This also reminded me to follow up on why the Solaris SR_handler is
>>> different and I found it is not actually installed as a direct signal
>>> handler, but is called from the real signal handler if dealing with a
>>> JavaThread or the VMThread. Consequently the Solaris version of the
>>> SR_handler can not encounter this specific bug and so I have reverted
>>> the changes to os_solaris.cpp
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>> Regards,
>>>> Volker
>>>>
>>>> On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.holmes at oracle.com
>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>
>>>>     webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
>>>> <http://cr.openjdk.java.net/~dholmes/8159461/webrev/>
>>>>
>>>>     bug: https://bugs.openjdk.java.net/browse/JDK-8159461
>>>>     <https://bugs.openjdk.java.net/browse/JDK-8159461>
>>>>
>>>>     The suspend/resume signal (SR_signum) is never sent to a thread 
>>>> once
>>>>     it has started to terminate. On one platform (SuSE 12) we have 
>>>> seen
>>>>     what appears to be a "stuck" signal, which is only delivered when
>>>>     the terminating thread restores its original signal mask (as if
>>>>     pthread_sigmask makes the system realize there is a pending 
>>>> signal -
>>>>     we already check the signal was not blocked). At this point in the
>>>>     thread termination we have freed the osthread, so the the 
>>>> SR_handler
>>>>     would access deallocated memory. In debug builds we first hit an
>>>>     assertion that the current thread is a JavaThread or the 
>>>> VMThread -
>>>>     that assertion fails, even though it is a JavaThread, because we
>>>>     have already executed the ~JavaThread destructor and inside the
>>>>     ~Thread destructor we are a plain Thread not a JavaThread.
>>>>
>>>>     The fix was to make a small adjustment to the thread termination
>>>>     process so that we delete the SR_lock before calling
>>>>     os::free_thread(). In the SR_handler() we can then use a NULL 
>>>> check
>>>>     of SR_lock() to indicate the thread has terminated and we return.
>>>>
>>>>     While only seen on Linux I took the opportunity to apply the 
>>>> fix on
>>>>     all platforms and also cleaned up the code where we were using
>>>>     Thread::current() unsafely in a signal-handling context.
>>>>
>>>>     Testing: regular tier 1 (JPRT)
>>>>              Kitchensink (in progress)
>>>>
>>>>     As we can't readily reproduce the problem I tested this by 
>>>> having a
>>>>     terminating thread raise SR_signum directly from within the 
>>>> ~Thread
>>>>     destructor.
>>>>
>>>>     Thanks,
>>>>     David
>>>>
>>>>
>>