RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Wed Feb 4 10:15:53 UTC 2015

Great, thank you!

@David: Do you still need something from me to sponsor the change?

Kind Regards, Thaoms

On Tue, Feb 3, 2015 at 10:15 PM, Gerard Ziemski <gerard.ziemski at oracle.com>
wrote:

>  Looks good! Thank you for a great fix!
>
> cheers
>
>
> On 2/3/2015 10:40 AM, Thomas Stüfe wrote:
>
> Hi Gerard,
>
>  thank you for your feedback! See here my new change:
> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.07/webrev/
>
>  On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski <gerard.ziemski at oracle.com
> > wrote:
>
>>  hi Thomas,
>>
>> On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>>
>> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski <
>> gerard.ziemski at oracle.com> wrote:
>>
>>> hi Thomas,
>>>
>>> #1 The following comment in vmError_bsd.cpp, vmError_linux.cpp and
>>> vmError_solaris.cpp is now outdated and should be updated from:
>>>
>>> // handle all synchronous program error signals which may happen during
>>> error
>>> // reporting. They must be unblocked, caught, handled.
>>> //
>>> // Note that in hotspot signal handlers are installed with a fully filled
>>> // signal mask, i.e. upon delivery all signals - also synchronous error
>>> signals
>>> // - are blocked. For synchronous error signals this is deadly - program
>>> will
>>> // hang or be terminated immediately if secondary errors happen during
>>> error
>>> // reporting.
>>>
>>> to the comment from vmError_aix.cpp:
>>>
>>> // Handle all synchronous signals which may happen during signal
>>> handling,
>>> // not just SIGSEGV and SIGBUS.
>>>
>>>
>>  I am not sure I understand: the problem is that signal handlers are
>> installed in the hotspot with sa_mask completely set, so all signals are
>> blocked. See e.g. os_linux.cpp, line 4304++
>> (os::Linux::set_signal_handler()). So the comment appears to be correct.
>>
>>  We continue to block the signals initially, but the comment, or at least
>> its last sentence, does not apply anymore because of the crux of your fix -
>> ie. the program will not hang because we immediately unblock the
>> synchronous traps in the crash handler. So maybe we should make that clear
>> in that comment?
>>
>>
>  Okay, I see that now. I modified the comments as you suggested earlier.
>
>
>>   POSIX itself mentioned only the four, see documentation on sigprocmask:
>>
>>  "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated
>> while they are blocked, the result is undefined, unless the signal was
>> generated by the *kill*()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function,
>> the *sigqueue*()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html> function,
>> or the *raise*()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
>>  function."
>>
>>  I added SIGTRAP on AIX because it is synchronously delivered too, but
>> do not know any other signals. I am happy to add any other signals you know
>> of.
>>
>>   Nope, I do not know any more synchronous signals - just wanted to make
>> sure we take a second look and collectively agree that this list is
>> exhaustive.
>>
>>  I would like to test it at least twice, to see that signal handling
>> continues to work after secondary handler ran once. It needs to correctly
>> unset the signal mask for the synchronous signals.
>>
>>  Executing it four times may be a bit excessive; if you prefer, I can
>> remove it for the out_done case.
>>
>>  Note that there is one benefit of repeatedly crashing in the error
>> handler, it is to stress the stack usage during error reporting. But I
>> admit that this test could be done much better.
>>
>>   It's a bit confusing at first when someone debugs this code and sees
>> four of those secondary signals coming through, so maybe just a small
>> comment here would suffice?
>>
>>
>  I changed the coding to only fire for the log case. That should be less
> confusing.
>
>
>>  vmerrors.sh is very minimal. The bug symptom was incomplete or empty
>> hs-err files, and you would not catch this with the vmerrors.sh test, would
>> you not? You also can reproduce this bug only by using both
>> -XX:ErrorHandlerTest and -XX:TestCrashInErrorHandler together, so it does
>> not quite fit into what vmerrors.sh does.
>>
>>  At SAP we have a nice test suite which crashes the VM in a number of
>> ways and then checks whether the hs-err files are complete (enough). I was
>> planning to add a Jtreg test which does something similar, but did not add
>> it to this change because I did not want to make the change larger than
>> necessary.
>>
>>  I would think that the added complexity of the test case, would in fact
>> make it easier to review the test with the fix at the same time. Also,
>> since the test in this case is feasible, isn't it recommended to have the
>> fix committed along with its test?
>>
>>
>  I added a small jtreg test. Note that the test cannot really demonstrate
> the bug without my fix, because it needs the new
> -XX:TestCrashInErrorHandler options which is part of the fix. But it will
> show regressions.
>
>
>>  cheers
>>
>>
>  Best Regards, Thomas
>
>
>
>