RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Tue Feb 3 16:40:31 UTC 2015

Hi Gerard,

thank you for your feedback! See here my new change:
http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.07/webrev/

On Mon, Feb 2, 2015 at 4:19 PM, Gerard Ziemski <gerard.ziemski at oracle.com>
wrote:

>  hi Thomas,
>
> On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
>
> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski <
> gerard.ziemski at oracle.com> wrote:
>
>> hi Thomas,
>>
>> #1 The following comment in vmError_bsd.cpp, vmError_linux.cpp and
>> vmError_solaris.cpp is now outdated and should be updated from:
>>
>> // handle all synchronous program error signals which may happen during
>> error
>> // reporting. They must be unblocked, caught, handled.
>> //
>> // Note that in hotspot signal handlers are installed with a fully filled
>> // signal mask, i.e. upon delivery all signals - also synchronous error
>> signals
>> // - are blocked. For synchronous error signals this is deadly - program
>> will
>> // hang or be terminated immediately if secondary errors happen during
>> error
>> // reporting.
>>
>> to the comment from vmError_aix.cpp:
>>
>> // Handle all synchronous signals which may happen during signal handling,
>> // not just SIGSEGV and SIGBUS.
>>
>>
>  I am not sure I understand: the problem is that signal handlers are
> installed in the hotspot with sa_mask completely set, so all signals are
> blocked. See e.g. os_linux.cpp, line 4304++
> (os::Linux::set_signal_handler()). So the comment appears to be correct.
>
> We continue to block the signals initially, but the comment, or at least
> its last sentence, does not apply anymore because of the crux of your fix -
> ie. the program will not hang because we immediately unblock the
> synchronous traps in the crash handler. So maybe we should make that clear
> in that comment?
>
>
Okay, I see that now. I modified the comments as you suggested earlier.

>  POSIX itself mentioned only the four, see documentation on sigprocmask:
>
>  "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated
> while they are blocked, the result is undefined, unless the signal was
> generated by the *kill*()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function,
> the *sigqueue*()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html> function,
> or the *raise*()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
>  function."
>
>  I added SIGTRAP on AIX because it is synchronously delivered too, but do
> not know any other signals. I am happy to add any other signals you know of.
>
>  Nope, I do not know any more synchronous signals - just wanted to make
> sure we take a second look and collectively agree that this list is
> exhaustive.
>
>  I would like to test it at least twice, to see that signal handling
> continues to work after secondary handler ran once. It needs to correctly
> unset the signal mask for the synchronous signals.
>
>  Executing it four times may be a bit excessive; if you prefer, I can
> remove it for the out_done case.
>
>  Note that there is one benefit of repeatedly crashing in the error
> handler, it is to stress the stack usage during error reporting. But I
> admit that this test could be done much better.
>
>  It's a bit confusing at first when someone debugs this code and sees
> four of those secondary signals coming through, so maybe just a small
> comment here would suffice?
>
>
I changed the coding to only fire for the log case. That should be less
confusing.

>  vmerrors.sh is very minimal. The bug symptom was incomplete or empty
> hs-err files, and you would not catch this with the vmerrors.sh test, would
> you not? You also can reproduce this bug only by using both
> -XX:ErrorHandlerTest and -XX:TestCrashInErrorHandler together, so it does
> not quite fit into what vmerrors.sh does.
>
>  At SAP we have a nice test suite which crashes the VM in a number of
> ways and then checks whether the hs-err files are complete (enough). I was
> planning to add a Jtreg test which does something similar, but did not add
> it to this change because I did not want to make the change larger than
> necessary.
>
> I would think that the added complexity of the test case, would in fact
> make it easier to review the test with the fix at the same time. Also,
> since the test in this case is feasible, isn't it recommended to have the
> fix committed along with its test?
>
>
I added a small jtreg test. Note that the test cannot really demonstrate
the bug without my fix, because it needs the new
-XX:TestCrashInErrorHandler options which is part of the fix. But it will
show regressions.

> cheers
>
>
Best Regards, Thomas