(S) RFR: 8159461: bigapps/Kitchensink/stressExitCode hits assert: Must be VMThread or JavaThread
David Holmes
david.holmes at oracle.com
Tue Aug 9 00:33:08 UTC 2016
Thanks Volker!
David
On 8/08/2016 11:35 PM, Volker Simonis wrote:
> Hi David,
>
> looks good now.
>
> Thanks,
> Volker
>
>
> On Fri, Aug 5, 2016 at 4:28 AM, David Holmes <david.holmes at oracle.com> wrote:
>> Hi Volker,
>>
>> Thanks for looking at this.
>>
>> On 5/08/2016 1:48 AM, Volker Simonis wrote:
>>>
>>> Hi David,
>>>
>>> thanks for doing this change on all platforms.
>>> The fix looks good. Maybe you can just extend the following comment with
>>> something like:
>>>
>>> // Note that the SR_lock plays no role in this suspend/resume protocol.
>>> // It is only used in SR_handler as a thread termination indicator if
>>> NULL.
>>
>>
>> Darn this code is confusing - too many "SR"'s :( I have added
>>
>> // Note that the SR_lock plays no role in this suspend/resume protocol,
>> // but is checked for NULL in SR_handler as a thread termination indicator.
>>
>> Updated webrev:
>>
>> http://cr.openjdk.java.net/~dholmes/8159461/webrev.v2/
>>
>> This also reminded me to follow up on why the Solaris SR_handler is
>> different and I found it is not actually installed as a direct signal
>> handler, but is called from the real signal handler if dealing with a
>> JavaThread or the VMThread. Consequently the Solaris version of the
>> SR_handler can not encounter this specific bug and so I have reverted the
>> changes to os_solaris.cpp
>>
>> Thanks,
>> David
>>
>>
>>> Regards,
>>> Volker
>>>
>>> On Wed, Aug 3, 2016 at 3:13 AM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>> webrev: http://cr.openjdk.java.net/~dholmes/8159461/webrev/
>>> <http://cr.openjdk.java.net/~dholmes/8159461/webrev/>
>>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8159461
>>> <https://bugs.openjdk.java.net/browse/JDK-8159461>
>>>
>>> The suspend/resume signal (SR_signum) is never sent to a thread once
>>> it has started to terminate. On one platform (SuSE 12) we have seen
>>> what appears to be a "stuck" signal, which is only delivered when
>>> the terminating thread restores its original signal mask (as if
>>> pthread_sigmask makes the system realize there is a pending signal -
>>> we already check the signal was not blocked). At this point in the
>>> thread termination we have freed the osthread, so the the SR_handler
>>> would access deallocated memory. In debug builds we first hit an
>>> assertion that the current thread is a JavaThread or the VMThread -
>>> that assertion fails, even though it is a JavaThread, because we
>>> have already executed the ~JavaThread destructor and inside the
>>> ~Thread destructor we are a plain Thread not a JavaThread.
>>>
>>> The fix was to make a small adjustment to the thread termination
>>> process so that we delete the SR_lock before calling
>>> os::free_thread(). In the SR_handler() we can then use a NULL check
>>> of SR_lock() to indicate the thread has terminated and we return.
>>>
>>> While only seen on Linux I took the opportunity to apply the fix on
>>> all platforms and also cleaned up the code where we were using
>>> Thread::current() unsafely in a signal-handling context.
>>>
>>> Testing: regular tier 1 (JPRT)
>>> Kitchensink (in progress)
>>>
>>> As we can't readily reproduce the problem I tested this by having a
>>> terminating thread raise SR_signum directly from within the ~Thread
>>> destructor.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>
More information about the ppc-aix-port-dev
mailing list