RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
Thomas Stüfe
thomas.stuefe at gmail.com
Thu Feb 5 07:50:47 UTC 2015
Thank you!
I did not know that, but will keep it in mind for future jtreg tests.
..Thomas
On Thu, Feb 5, 2015 at 3:12 AM, David Holmes <david.holmes at oracle.com>
wrote:
> Thomas, Gerard,
>
> I pushed the changes but didn't realize that hg import wouldn't hg add the
> new test file and so it got left behind. But I then realized that the test
> should be named/located differently as we don't structure the tests on bug
> number any more. So I will file a new bug and rename and relocate the test,
> and then push that:
>
> test/runtime/ErrorHandling/SecondaryErrorTest.java
>
> David
>
>
> On 4/02/2015 8:18 PM, David Holmes wrote:
>
>> On 4/02/2015 8:15 PM, Thomas Stüfe wrote:
>>
>>> Great, thank you!
>>>
>>> @David: Do you still need something from me to sponsor the change?
>>>
>>
>> No I think I have everything I need. Will push it in the morning.
>>
>> You'll be listed as Contributor as you don't have Author status for jdk9
>> project.
>>
>> Thanks,
>> David
>>
>> Kind Regards, Thaoms
>>>
>>> On Tue, Feb 3, 2015 at 10:15 PM, Gerard Ziemski
>>> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>>>
>>> Looks good! Thank you for a great fix!
>>>
>>> cheers
>>>
>>>
>>> On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
>>>
>>>> Hi Gerard,
>>>>
>>>> thank you for your feedback! See here my new change:
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.
>>>> 07/webrev/
>>>>
>>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8065895/webrev.07/webrev/
>>>> >
>>>>
>>>>
>>>> On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski
>>>> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>>
>>>> wrote:
>>>>
>>>> hi Thomas,
>>>>
>>>> On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>>>>
>>>>> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski
>>>>> <gerard.ziemski at oracle.com
>>>>> <mailto:gerard.ziemski at oracle.com>> wrote:
>>>>>
>>>>> hi Thomas,
>>>>>
>>>>> #1 The following comment in vmError_bsd.cpp,
>>>>> vmError_linux.cpp and vmError_solaris.cpp is now outdated
>>>>> and should be updated from:
>>>>>
>>>>> // handle all synchronous program error signals which may
>>>>> happen during error
>>>>> // reporting. They must be unblocked, caught, handled.
>>>>> //
>>>>> // Note that in hotspot signal handlers are installed
>>>>> with a fully filled
>>>>> // signal mask, i.e. upon delivery all signals - also
>>>>> synchronous error signals
>>>>> // - are blocked. For synchronous error signals this is
>>>>> deadly - program will
>>>>> // hang or be terminated immediately if secondary errors
>>>>> happen during error
>>>>> // reporting.
>>>>>
>>>>> to the comment from vmError_aix.cpp:
>>>>>
>>>>> // Handle all synchronous signals which may happen during
>>>>> signal handling,
>>>>> // not just SIGSEGV and SIGBUS.
>>>>>
>>>>>
>>>>> I am not sure I understand: the problem is that signal
>>>>> handlers are installed in the hotspot with sa_mask completely
>>>>> set, so all signals are blocked. See e.g. os_linux.cpp, line
>>>>> 4304++ (os::Linux::set_signal_handler()). So the comment
>>>>> appears to be correct.
>>>>>
>>>> We continue to block the signals initially, but the comment,
>>>> or at least its last sentence, does not apply anymore because
>>>> of the crux of your fix - ie. the program will not hang
>>>> because we immediately unblock the synchronous traps in the
>>>> crash handler. So maybe we should make that clear in that
>>>> comment?
>>>>
>>>>
>>>> Okay, I see that now. I modified the comments as you suggested
>>>> earlier.
>>>>
>>>> POSIX itself mentioned only the four, see documentation on
>>>>> sigprocmask:
>>>>>
>>>>> "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are
>>>>> generated while they are blocked, the result is undefined,
>>>>> unless the signal was generated by the /kill/()
>>>>>
>>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
>>>>> function,
>>>>> the /sigqueue/()
>>>>>
>>>>> <http://pubs.opengroup.org/onlinepubs/009695399/
>>>>> functions/sigqueue.html>
>>>>> function,
>>>>> or the /raise/()
>>>>>
>>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
>>>>> function."
>>>>>
>>>>>
>>>>> I added SIGTRAP on AIX because it is synchronously delivered
>>>>> too, but do not know any other signals. I am happy to add any
>>>>> other signals you know of.
>>>>>
>>>>> Nope, I do not know any more synchronous signals - just wanted
>>>> to make sure we take a second look and collectively agree that
>>>> this list is exhaustive.
>>>>
>>>> I would like to test it at least twice, to see that signal
>>>>> handling continues to work after secondary handler ran once.
>>>>> It needs to correctly unset the signal mask for the
>>>>> synchronous signals.
>>>>>
>>>>> Executing it four times may be a bit excessive; if you
>>>>> prefer, I can remove it for the out_done case.
>>>>>
>>>>> Note that there is one benefit of repeatedly crashing in the
>>>>> error handler, it is to stress the stack usage during error
>>>>> reporting. But I admit that this test could be done much
>>>>> better.
>>>>>
>>>>> It's a bit confusing at first when someone debugs this code
>>>> and sees four of those secondary signals coming through, so
>>>> maybe just a small comment here would suffice?
>>>>
>>>>
>>>> I changed the coding to only fire for the log case. That should be
>>>> less confusing.
>>>>
>>>> vmerrors.sh is very minimal. The bug symptom was incomplete
>>>>> or empty hs-err files, and you would not catch this with the
>>>>> vmerrors.sh test, would you not? You also can reproduce this
>>>>> bug only by using both -XX:ErrorHandlerTest and
>>>>> -XX:TestCrashInErrorHandler together, so it does not quite
>>>>> fit into what vmerrors.sh does.
>>>>>
>>>>> At SAP we have a nice test suite which crashes the VM in a
>>>>> number of ways and then checks whether the hs-err files are
>>>>> complete (enough). I was planning to add a Jtreg test which
>>>>> does something similar, but did not add it to this change
>>>>> because I did not want to make the change larger than
>>>>> necessary.
>>>>>
>>>> I would think that the added complexity of the test case,
>>>> would in fact make it easier to review the test with the fix
>>>> at the same time. Also, since the test in this case is
>>>> feasible, isn't it recommended to have the fix committed along
>>>> with its test?
>>>>
>>>>
>>>> I added a small jtreg test. Note that the test cannot really
>>>> demonstrate the bug without my fix, because it needs the new
>>>> -XX:TestCrashInErrorHandler options which is part of the fix. But
>>>> it will show regressions.
>>>>
>>>> cheers
>>>>
>>>>
>>>> Best Regards, Thomas
>>>>
>>>>
>>>
>>>
More information about the hotspot-runtime-dev
mailing list