(S) RFR: 8159461: bigapps/Kitchensink/stressExitCode hits assert: Must be VMThread or JavaThread

Mon Aug 8 16:07:46 UTC 2016

On 8/4/16 8:28 PM, David Holmes wrote:
> Hi Volker,
>
> Thanks for looking at this.
>
> On 5/08/2016 1:48 AM, Volker Simonis wrote:
>> Hi David,
>>
>> thanks for doing this change on all platforms.
>> The fix looks good. Maybe you can just extend the following comment with
>> something like:
>>
>>  //  Note that the SR_lock plays no role in this suspend/resume 
>> protocol.
>>  //  It is only used in SR_handler as a thread termination indicator if
>> NULL.
>
> Darn this code is confusing - too many "SR"'s :( I have added
>
> //  Note that the SR_lock plays no role in this suspend/resume protocol,
> //  but is checked for NULL in SR_handler as a thread termination 
> indicator.
>
> Updated webrev:
>
> http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/

src/share/vm/runtime/thread.cpp
     L380:   _SR_lock = NULL;
         I was expecting the _SR_lock to be freed and NULL'ed earlier
         based on the discussion in the bug report. Since the crashing
         assert() happens in a race between the JavaThread destructor
         the NULL'ing of the _SR_lock field, I was expecting the _SR_lock
         field to be dealt with as early as possible in the Thread
         destructor (or even earlier; see my last comment).

  src/os/linux/vm/os_linux.cpp
     L4010:   // mask is changed as part of thread termination. Check 
the current thread
         grammar?: "Check the current" -> "Check that the current"

     L4015:   if (thread->SR_lock() == NULL)
     L4016:     return;
         style nit: multi-line if-statements require '{' and '}'
         Please add the braces or make this a single line if-statement.
         I would prefer the braces. :-)

         Isn't there still a window between the completion of the
         JavaThread destructor and where the Thread destructor sets
         _SR_lock = NULL?

     L4020:   OSThread* osthread = thread->osthread();
         Not your bug. This code assumes that osthread != NULL.
         Maybe it needs to be more robust.

src/os/aix/vm/os_aix.cpp
     L2731:   if (thread->SR_lock() == NULL)
     L2732:     return;
         Same style nit.

         Same race.

     L2736:   OSThread* osthread = thread->osthread();
         Same robustness comment.

src/os/bsd/vm/os_bsd.cpp
     L2759:   if (thread->SR_lock() == NULL)
     L2760:     return;
         Same style nit.

         Same race.

     L2764:   OSThread* osthread = thread->osthread();
         Same robustness comment.

It has been a very long time since I've dealt with races in the
suspend/resume code so I'm probably very rusty with this code.
If the _SR_lock is only used by the JavaThread suspend/resume
protocol, then we could consider free'ing and NULL'ing the field
in the JavaThread destructor (as the last piece of work).

That should eliminate the race that was being observed by the
SR_handler() in this bug. It will open a very small race where
is_Java_thread() can return true, the _SR_lock field is !NULL,
but the _SR_lock has been deleted.

Dan

>
> This also reminded me to follow up on why the Solaris SR_handler is 
> different and I found it is not actually installed as a direct signal 
> handler, but is called from the real signal handler if dealing with a 
> JavaThread or the VMThread. Consequently the Solaris version of the 
> SR_handler can not encounter this specific bug and so I have reverted 
> the changes to os_solaris.cpp
>
> Thanks,
> David
>
>
>> Regards,
>> Volker
>>
>> On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
>>     <http://cr.openjdk.java.net/~dholmes/8159461/webrev/>
>>
>>     bug: https://bugs.openjdk.java.net/browse/JDK-8159461
>>     <https://bugs.openjdk.java.net/browse/JDK-8159461>
>>
>>     The suspend/resume signal (SR_signum) is never sent to a thread once
>>     it has started to terminate. On one platform (SuSE 12) we have seen
>>     what appears to be a "stuck" signal, which is only delivered when
>>     the terminating thread restores its original signal mask (as if
>>     pthread_sigmask makes the system realize there is a pending signal -
>>     we already check the signal was not blocked). At this point in the
>>     thread termination we have freed the osthread, so the the SR_handler
>>     would access deallocated memory. In debug builds we first hit an
>>     assertion that the current thread is a JavaThread or the VMThread -
>>     that assertion fails, even though it is a JavaThread, because we
>>     have already executed the ~JavaThread destructor and inside the
>>     ~Thread destructor we are a plain Thread not a JavaThread.
>>
>>     The fix was to make a small adjustment to the thread termination
>>     process so that we delete the SR_lock before calling
>>     os::free_thread(). In the SR_handler() we can then use a NULL check
>>     of SR_lock() to indicate the thread has terminated and we return.
>>
>>     While only seen on Linux I took the opportunity to apply the fix on
>>     all platforms and also cleaned up the code where we were using
>>     Thread::current() unsafely in a signal-handling context.
>>
>>     Testing: regular tier 1 (JPRT)
>>              Kitchensink (in progress)
>>
>>     As we can't readily reproduce the problem I tested this by having a
>>     terminating thread raise SR_signum directly from within the ~Thread
>>     destructor.
>>
>>     Thanks,
>>     David
>>
>>