(S) RFR: 8159461: bigapps/Kitchensink/stressExitCode hits assert: Must be VMThread or JavaThread

Fri Aug 12 04:06:18 UTC 2016

On 12/08/2016 11:25 AM, Daniel D. Daugherty wrote:
> David,
>
> Sorry I forgot to respond before I left for Santa Fe, NM...
> More below...

No problem.

>
> On 8/8/16 5:57 PM, David Holmes wrote:
>> Hi Dan,
>>
>> Thanks for the review.
>>
>> On 9/08/2016 2:07 AM, Daniel D. Daugherty wrote:
>>> On 8/4/16 8:28 PM, David Holmes wrote:
>>>> Hi Volker,
>>>>
>>>> Thanks for looking at this.
>>>>
>>>> On 5/08/2016 1:48 AM, Volker Simonis wrote:
>>>>> Hi David,
>>>>>
>>>>> thanks for doing this change on all platforms.
>>>>> The fix looks good. Maybe you can just extend the following comment
>>>>> with
>>>>> something like:
>>>>>
>>>>>  //  Note that the SR_lock plays no role in this suspend/resume
>>>>> protocol.
>>>>>  //  It is only used in SR_handler as a thread termination
>>>>> indicator if
>>>>> NULL.
>>>>
>>>> Darn this code is confusing - too many "SR"'s :( I have added
>>>>
>>>> //  Note that the SR_lock plays no role in this suspend/resume
>>>> protocol,
>>>> //  but is checked for NULL in SR_handler as a thread termination
>>>> indicator.
>>>>
>>>> Updated webrev:
>>>>
>>>> http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/
>>>
>>> src/share/vm/runtime/thread.cpp
>>>     L380:   _SR_lock = NULL;
>>>         I was expecting the _SR_lock to be freed and NULL'ed earlier
>>>         based on the discussion in the bug report. Since the crashing
>>>         assert() happens in a race between the JavaThread destructor
>>>         the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
>>>         field to be dealt with as early as possible in the Thread
>>>         destructor (or even earlier; see my last comment).
>>
>> I will respond after that comment.
>>
>>>  src/os/linux/vm/os_linux.cpp
>>>     L4010:   // mask is changed as part of thread termination. Check the
>>> current thread
>>>         grammar?: "Check the current" -> "Check that the current"
>>
>> Will change.
>>
>>>     L4015:   if (thread->SR_lock() == NULL)
>>>     L4016:     return;
>>>         style nit: multi-line if-statements require '{' and '}'
>>>         Please add the braces or make this a single line if-statement.
>>>         I would prefer the braces. :-)
>>
>> Will fix.
>>
>>>         Isn't there still a window between the completion of the
>>>         JavaThread destructor and where the Thread destructor sets
>>>         _SR_lock = NULL?
>>
>> See below.
>>
>>>     L4020:   OSThread* osthread = thread->osthread();
>>>         Not your bug. This code assumes that osthread != NULL.
>>>         Maybe it needs to be more robust.
>>
>> Depends what kind of impossibilities we want to guard against. :)
>> There should be no possible way a signal can be sent to a thread that
>> doesn't even have a osThread as it means we never successfully
>> started/attached the thread.
>
> That's a really good point. I'm good with what's there
> for osthread.
>
>
>>
>>> src/os/aix/vm/os_aix.cpp
>>>     L2731:   if (thread->SR_lock() == NULL)
>>>     L2732:     return;
>>>         Same style nit.
>>>
>>>         Same race.
>>>
>>>     L2736:   OSThread* osthread = thread->osthread();
>>>         Same robustness comment.
>>>
>>> src/os/bsd/vm/os_bsd.cpp
>>>     L2759:   if (thread->SR_lock() == NULL)
>>>     L2760:     return;
>>>         Same style nit.
>>>
>>>         Same race.
>>>
>>>     L2764:   OSThread* osthread = thread->osthread();
>>>         Same robustness comment.
>>>
>>> It has been a very long time since I've dealt with races in the
>>> suspend/resume code so I'm probably very rusty with this code.
>>> If the _SR_lock is only used by the JavaThread suspend/resume
>>> protocol, then we could consider free'ing and NULL'ing the field
>>> in the JavaThread destructor (as the last piece of work).
>>>
>>> That should eliminate the race that was being observed by the
>>> SR_handler() in this bug. It will open a very small race where
>>> is_Java_thread() can return true, the _SR_lock field is !NULL,
>>> but the _SR_lock has been deleted.
>>
>> Given that it should have been impossible to get into the SR_handler
>> in the first place from this code I was trying to minimize the
>> disruption to the existing logic. Moving the delete/NULLing to just
>> before the call to os::free_thread() fixes the crashes that had been
>> observed. I was not trying to make the entire destruction sequence
>> safe wrt. the SR_handler.
>
> I suspect it is the combination of 1) NULLing the _SR_lock as a sentinel
> and
> 2) doing that before the more expensive os::free_thread() call that results
> in the change in behavior.

Right. The call to pthread_sigmask in os::free_thread is what appeared 
to un-jam the pending signal; so if we bail out before os::free_thread 
we avoid that.

>
>> My major concern with deleting the SR_lock much earlier is the
>> potential race condition that I have previously outlined in:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8152849
>>
>> where there is no protection against a target thread terminating. The
>> sooner it terminates and deletes the SR_lock the more likely we may
>> attempt to lock a deleted lock!
>
> Ah yes... thanks for the reminder. We have seen a few of those in the
> past where we're racing to grab the _SR_lock and Elvis is trying to
> leave the building...
>
> I'm good with just the minor changes you agreed to make above. I don't
> think I need to see a new webrev for the above edits.
>
> Thumbs up!

Thanks!

David
-----

> Dan
>
>
>
>>
>> Thanks,
>> David
>>
>>> Dan
>>>
>>>
>>>>
>>>> This also reminded me to follow up on why the Solaris SR_handler is
>>>> different and I found it is not actually installed as a direct signal
>>>> handler, but is called from the real signal handler if dealing with a
>>>> JavaThread or the VMThread. Consequently the Solaris version of the
>>>> SR_handler can not encounter this specific bug and so I have reverted
>>>> the changes to os_solaris.cpp
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>>> Regards,
>>>>> Volker
>>>>>
>>>>> On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.holmes at oracle.com
>>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>>
>>>>>     webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
>>>>> <http://cr.openjdk.java.net/~dholmes/8159461/webrev/>
>>>>>
>>>>>     bug: https://bugs.openjdk.java.net/browse/JDK-8159461
>>>>>     <https://bugs.openjdk.java.net/browse/JDK-8159461>
>>>>>
>>>>>     The suspend/resume signal (SR_signum) is never sent to a thread
>>>>> once
>>>>>     it has started to terminate. On one platform (SuSE 12) we have
>>>>> seen
>>>>>     what appears to be a "stuck" signal, which is only delivered when
>>>>>     the terminating thread restores its original signal mask (as if
>>>>>     pthread_sigmask makes the system realize there is a pending
>>>>> signal -
>>>>>     we already check the signal was not blocked). At this point in the
>>>>>     thread termination we have freed the osthread, so the the
>>>>> SR_handler
>>>>>     would access deallocated memory. In debug builds we first hit an
>>>>>     assertion that the current thread is a JavaThread or the
>>>>> VMThread -
>>>>>     that assertion fails, even though it is a JavaThread, because we
>>>>>     have already executed the ~JavaThread destructor and inside the
>>>>>     ~Thread destructor we are a plain Thread not a JavaThread.
>>>>>
>>>>>     The fix was to make a small adjustment to the thread termination
>>>>>     process so that we delete the SR_lock before calling
>>>>>     os::free_thread(). In the SR_handler() we can then use a NULL
>>>>> check
>>>>>     of SR_lock() to indicate the thread has terminated and we return.
>>>>>
>>>>>     While only seen on Linux I took the opportunity to apply the
>>>>> fix on
>>>>>     all platforms and also cleaned up the code where we were using
>>>>>     Thread::current() unsafely in a signal-handling context.
>>>>>
>>>>>     Testing: regular tier 1 (JPRT)
>>>>>              Kitchensink (in progress)
>>>>>
>>>>>     As we can't readily reproduce the problem I tested this by
>>>>> having a
>>>>>     terminating thread raise SR_signum directly from within the
>>>>> ~Thread
>>>>>     destructor.
>>>>>
>>>>>     Thanks,
>>>>>     David
>>>>>
>>>>>
>>>
>