RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Thomas Stüfe thomas.stuefe at gmail.com
Thu Feb 5 07:50:47 UTC 2015


Thank you!

I did not know that, but will keep it in mind for future jtreg tests.

..Thomas

On Thu, Feb 5, 2015 at 3:12 AM, David Holmes <david.holmes at oracle.com>
wrote:

> Thomas, Gerard,
>
> I pushed the changes but didn't realize that hg import wouldn't hg add the
> new test file and so it got left behind. But I then realized that the test
> should be named/located differently as we don't structure the tests on bug
> number any more. So I will file a new bug and rename and relocate the test,
> and then push that:
>
> test/runtime/ErrorHandling/SecondaryErrorTest.java
>
> David
>
>
> On 4/02/2015 8:18 PM, David Holmes wrote:
>
>> On 4/02/2015 8:15 PM, Thomas Stüfe wrote:
>>
>>> Great, thank you!
>>>
>>> @David: Do you still need something from me to sponsor the change?
>>>
>>
>> No I think I have everything I need. Will push it in the morning.
>>
>> You'll be listed as Contributor as you don't have Author status for jdk9
>> project.
>>
>> Thanks,
>> David
>>
>>  Kind Regards, Thaoms
>>>
>>> On Tue, Feb 3, 2015 at 10:15 PM, Gerard Ziemski
>>> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>>>
>>>     Looks good! Thank you for a great fix!
>>>
>>>     cheers
>>>
>>>
>>>     On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
>>>
>>>>     Hi Gerard,
>>>>
>>>>     thank you for your feedback! See here my new change:
>>>>     http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.
>>>> 07/webrev/
>>>>
>>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8065895/webrev.07/webrev/
>>>> >
>>>>
>>>>
>>>>     On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski
>>>>     <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>>
>>>> wrote:
>>>>
>>>>         hi Thomas,
>>>>
>>>>         On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>>>>
>>>>>         On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski
>>>>>         <gerard.ziemski at oracle.com
>>>>>         <mailto:gerard.ziemski at oracle.com>> wrote:
>>>>>
>>>>>             hi Thomas,
>>>>>
>>>>>             #1 The following comment in vmError_bsd.cpp,
>>>>>             vmError_linux.cpp and vmError_solaris.cpp is now outdated
>>>>>             and should be updated from:
>>>>>
>>>>>             // handle all synchronous program error signals which may
>>>>>             happen during error
>>>>>             // reporting. They must be unblocked, caught, handled.
>>>>>             //
>>>>>             // Note that in hotspot signal handlers are installed
>>>>>             with a fully filled
>>>>>             // signal mask, i.e. upon delivery all signals - also
>>>>>             synchronous error signals
>>>>>             // - are blocked. For synchronous error signals this is
>>>>>             deadly - program will
>>>>>             // hang or be terminated immediately if secondary errors
>>>>>             happen during error
>>>>>             // reporting.
>>>>>
>>>>>             to the comment from vmError_aix.cpp:
>>>>>
>>>>>             // Handle all synchronous signals which may happen during
>>>>>             signal handling,
>>>>>             // not just SIGSEGV and SIGBUS.
>>>>>
>>>>>
>>>>>         I am not sure I understand: the problem is that signal
>>>>>         handlers are installed in the hotspot with sa_mask completely
>>>>>         set, so all signals are blocked. See e.g. os_linux.cpp, line
>>>>>         4304++ (os::Linux::set_signal_handler()). So the comment
>>>>>         appears to be correct.
>>>>>
>>>>         We continue to block the signals initially, but the comment,
>>>>         or at least its last sentence, does not apply anymore because
>>>>         of the crux of your fix - ie. the program will not hang
>>>>         because we immediately unblock the synchronous traps in the
>>>>         crash handler. So maybe we should make that clear in that
>>>> comment?
>>>>
>>>>
>>>>     Okay, I see that now. I modified the comments as you suggested
>>>>     earlier.
>>>>
>>>>          POSIX itself mentioned only the four, see documentation on
>>>>>         sigprocmask:
>>>>>
>>>>>         "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are
>>>>>         generated while they are blocked, the result is undefined,
>>>>>         unless the signal was generated by the /kill/()
>>>>>
>>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
>>>>> function,
>>>>>         the /sigqueue/()
>>>>>
>>>>> <http://pubs.opengroup.org/onlinepubs/009695399/
>>>>> functions/sigqueue.html>
>>>>> function,
>>>>>         or the /raise/()
>>>>>
>>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
>>>>> function."
>>>>>
>>>>>
>>>>>         I added SIGTRAP on AIX because it is synchronously delivered
>>>>>         too, but do not know any other signals. I am happy to add any
>>>>>         other signals you know of.
>>>>>
>>>>>          Nope, I do not know any more synchronous signals - just wanted
>>>>         to make sure we take a second look and collectively agree that
>>>>         this list is exhaustive.
>>>>
>>>>          I would like to test it at least twice, to see that signal
>>>>>         handling continues to work after secondary handler ran once.
>>>>>         It needs to correctly unset the signal mask for the
>>>>>         synchronous signals.
>>>>>
>>>>>         Executing it four times may be a bit excessive; if you
>>>>>         prefer, I can remove it for the out_done case.
>>>>>
>>>>>         Note that there is one benefit of repeatedly crashing in the
>>>>>         error handler, it is to stress the stack usage during error
>>>>>         reporting. But I admit that this test could be done much
>>>>> better.
>>>>>
>>>>>          It's a bit confusing at first when someone debugs this code
>>>>         and sees four of those secondary signals coming through, so
>>>>         maybe just a small comment here would suffice?
>>>>
>>>>
>>>>     I changed the coding to only fire for the log case. That should be
>>>>     less confusing.
>>>>
>>>>          vmerrors.sh is very minimal. The bug symptom was incomplete
>>>>>         or empty hs-err files, and you would not catch this with the
>>>>>         vmerrors.sh test, would you not? You also can reproduce this
>>>>>         bug only by using both -XX:ErrorHandlerTest and
>>>>>         -XX:TestCrashInErrorHandler together, so it does not quite
>>>>>         fit into what vmerrors.sh does.
>>>>>
>>>>>         At SAP we have a nice test suite which crashes the VM in a
>>>>>         number of ways and then checks whether the hs-err files are
>>>>>         complete (enough). I was planning to add a Jtreg test which
>>>>>         does something similar, but did not add it to this change
>>>>>         because I did not want to make the change larger than
>>>>> necessary.
>>>>>
>>>>         I would think that the added complexity of the test case,
>>>>         would in fact make it easier to review the test with the fix
>>>>         at the same time. Also, since the test in this case is
>>>>         feasible, isn't it recommended to have the fix committed along
>>>>         with its test?
>>>>
>>>>
>>>>     I added a small jtreg test. Note that the test cannot really
>>>>     demonstrate the bug without my fix, because it needs the new
>>>>     -XX:TestCrashInErrorHandler options which is part of the fix. But
>>>>     it will show regressions.
>>>>
>>>>         cheers
>>>>
>>>>
>>>>     Best Regards, Thomas
>>>>
>>>>
>>>
>>>


More information about the hotspot-runtime-dev mailing list