RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
David Holmes
david.holmes at oracle.com
Thu Feb 5 02:12:32 UTC 2015
Thomas, Gerard,
I pushed the changes but didn't realize that hg import wouldn't hg add
the new test file and so it got left behind. But I then realized that
the test should be named/located differently as we don't structure the
tests on bug number any more. So I will file a new bug and rename and
relocate the test, and then push that:
test/runtime/ErrorHandling/SecondaryErrorTest.java
David
On 4/02/2015 8:18 PM, David Holmes wrote:
> On 4/02/2015 8:15 PM, Thomas Stüfe wrote:
>> Great, thank you!
>>
>> @David: Do you still need something from me to sponsor the change?
>
> No I think I have everything I need. Will push it in the morning.
>
> You'll be listed as Contributor as you don't have Author status for jdk9
> project.
>
> Thanks,
> David
>
>> Kind Regards, Thaoms
>>
>> On Tue, Feb 3, 2015 at 10:15 PM, Gerard Ziemski
>> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>>
>> Looks good! Thank you for a great fix!
>>
>> cheers
>>
>>
>> On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
>>> Hi Gerard,
>>>
>>> thank you for your feedback! See here my new change:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.07/webrev/
>>>
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8065895/webrev.07/webrev/>
>>>
>>>
>>> On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski
>>> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>>
>>> wrote:
>>>
>>> hi Thomas,
>>>
>>> On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>>>> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski
>>>> <gerard.ziemski at oracle.com
>>>> <mailto:gerard.ziemski at oracle.com>> wrote:
>>>>
>>>> hi Thomas,
>>>>
>>>> #1 The following comment in vmError_bsd.cpp,
>>>> vmError_linux.cpp and vmError_solaris.cpp is now outdated
>>>> and should be updated from:
>>>>
>>>> // handle all synchronous program error signals which may
>>>> happen during error
>>>> // reporting. They must be unblocked, caught, handled.
>>>> //
>>>> // Note that in hotspot signal handlers are installed
>>>> with a fully filled
>>>> // signal mask, i.e. upon delivery all signals - also
>>>> synchronous error signals
>>>> // - are blocked. For synchronous error signals this is
>>>> deadly - program will
>>>> // hang or be terminated immediately if secondary errors
>>>> happen during error
>>>> // reporting.
>>>>
>>>> to the comment from vmError_aix.cpp:
>>>>
>>>> // Handle all synchronous signals which may happen during
>>>> signal handling,
>>>> // not just SIGSEGV and SIGBUS.
>>>>
>>>>
>>>> I am not sure I understand: the problem is that signal
>>>> handlers are installed in the hotspot with sa_mask completely
>>>> set, so all signals are blocked. See e.g. os_linux.cpp, line
>>>> 4304++ (os::Linux::set_signal_handler()). So the comment
>>>> appears to be correct.
>>> We continue to block the signals initially, but the comment,
>>> or at least its last sentence, does not apply anymore because
>>> of the crux of your fix - ie. the program will not hang
>>> because we immediately unblock the synchronous traps in the
>>> crash handler. So maybe we should make that clear in that
>>> comment?
>>>
>>>
>>> Okay, I see that now. I modified the comments as you suggested
>>> earlier.
>>>
>>>> POSIX itself mentioned only the four, see documentation on
>>>> sigprocmask:
>>>>
>>>> "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are
>>>> generated while they are blocked, the result is undefined,
>>>> unless the signal was generated by the /kill/()
>>>>
>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
>>>> function,
>>>> the /sigqueue/()
>>>>
>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>
>>>> function,
>>>> or the /raise/()
>>>>
>>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html> function."
>>>>
>>>>
>>>> I added SIGTRAP on AIX because it is synchronously delivered
>>>> too, but do not know any other signals. I am happy to add any
>>>> other signals you know of.
>>>>
>>> Nope, I do not know any more synchronous signals - just wanted
>>> to make sure we take a second look and collectively agree that
>>> this list is exhaustive.
>>>
>>>> I would like to test it at least twice, to see that signal
>>>> handling continues to work after secondary handler ran once.
>>>> It needs to correctly unset the signal mask for the
>>>> synchronous signals.
>>>>
>>>> Executing it four times may be a bit excessive; if you
>>>> prefer, I can remove it for the out_done case.
>>>>
>>>> Note that there is one benefit of repeatedly crashing in the
>>>> error handler, it is to stress the stack usage during error
>>>> reporting. But I admit that this test could be done much
>>>> better.
>>>>
>>> It's a bit confusing at first when someone debugs this code
>>> and sees four of those secondary signals coming through, so
>>> maybe just a small comment here would suffice?
>>>
>>>
>>> I changed the coding to only fire for the log case. That should be
>>> less confusing.
>>>
>>>> vmerrors.sh is very minimal. The bug symptom was incomplete
>>>> or empty hs-err files, and you would not catch this with the
>>>> vmerrors.sh test, would you not? You also can reproduce this
>>>> bug only by using both -XX:ErrorHandlerTest and
>>>> -XX:TestCrashInErrorHandler together, so it does not quite
>>>> fit into what vmerrors.sh does.
>>>>
>>>> At SAP we have a nice test suite which crashes the VM in a
>>>> number of ways and then checks whether the hs-err files are
>>>> complete (enough). I was planning to add a Jtreg test which
>>>> does something similar, but did not add it to this change
>>>> because I did not want to make the change larger than
>>>> necessary.
>>> I would think that the added complexity of the test case,
>>> would in fact make it easier to review the test with the fix
>>> at the same time. Also, since the test in this case is
>>> feasible, isn't it recommended to have the fix committed along
>>> with its test?
>>>
>>>
>>> I added a small jtreg test. Note that the test cannot really
>>> demonstrate the bug without my fix, because it needs the new
>>> -XX:TestCrashInErrorHandler options which is part of the fix. But
>>> it will show regressions.
>>>
>>> cheers
>>>
>>>
>>> Best Regards, Thomas
>>>
>>
>>
More information about the hotspot-runtime-dev
mailing list