RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Mon Feb 2 15:19:09 UTC 2015

hi Thomas,

On 2/1/2015 9:56 AM, Thomas Stüfe wrote:
> On Fri, Jan 30, 2015 at 11:23 PM, Gerard Ziemski 
> <gerard.ziemski at oracle.com <mailto:gerard.ziemski at oracle.com>> wrote:
>
>     hi Thomas,
>
>     #1 The following comment in vmError_bsd.cpp, vmError_linux.cpp and
>     vmError_solaris.cpp is now outdated and should be updated from:
>
>     // handle all synchronous program error signals which may happen
>     during error
>     // reporting. They must be unblocked, caught, handled.
>     //
>     // Note that in hotspot signal handlers are installed with a fully
>     filled
>     // signal mask, i.e. upon delivery all signals - also synchronous
>     error signals
>     // - are blocked. For synchronous error signals this is deadly -
>     program will
>     // hang or be terminated immediately if secondary errors happen
>     during error
>     // reporting.
>
>     to the comment from vmError_aix.cpp:
>
>     // Handle all synchronous signals which may happen during signal
>     handling,
>     // not just SIGSEGV and SIGBUS.
>
>
> I am not sure I understand: the problem is that signal handlers are 
> installed in the hotspot with sa_mask completely set, so all signals 
> are blocked. See e.g. os_linux.cpp, line 4304++ 
> (os::Linux::set_signal_handler()). So the comment appears to be correct.
We continue to block the signals initially, but the comment, or at least 
its last sentence, does not apply anymore because of the crux of your 
fix - ie. the program will not hang because we immediately unblock the 
synchronous traps in the crash handler. So maybe we should make that 
clear in that comment?

> POSIX itself mentioned only the four, see documentation on sigprocmask:
>
> "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are 
> generated while they are blocked, the result is undefined, unless the 
> signal was generated by the /kill/() 
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html> function, 
> the /sigqueue/() 
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html> function, 
> or the /raise/() 
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html> function."
>
> I added SIGTRAP on AIX because it is synchronously delivered too, but 
> do not know any other signals. I am happy to add any other signals you 
> know of.
>
Nope, I do not know any more synchronous signals - just wanted to make 
sure we take a second look and collectively agree that this list is 
exhaustive.

> I would like to test it at least twice, to see that signal handling 
> continues to work after secondary handler ran once. It needs to 
> correctly unset the signal mask for the synchronous signals.
>
> Executing it four times may be a bit excessive; if you prefer, I can 
> remove it for the out_done case.
>
> Note that there is one benefit of repeatedly crashing in the error 
> handler, it is to stress the stack usage during error reporting. But I 
> admit that this test could be done much better.
>
It's a bit confusing at first when someone debugs this code and sees 
four of those secondary signals coming through, so maybe just a small 
comment here would suffice?

> vmerrors.sh is very minimal. The bug symptom was incomplete or empty 
> hs-err files, and you would not catch this with the vmerrors.sh test, 
> would you not? You also can reproduce this bug only by using both 
> -XX:ErrorHandlerTest and -XX:TestCrashInErrorHandler together, so it 
> does not quite fit into what vmerrors.sh does.
>
> At SAP we have a nice test suite which crashes the VM in a number of 
> ways and then checks whether the hs-err files are complete (enough). I 
> was planning to add a Jtreg test which does something similar, but did 
> not add it to this change because I did not want to make the change 
> larger than necessary.
I would think that the added complexity of the test case, would in fact 
make it easier to review the test with the fix at the same time. Also, 
since the test in this case is feasible, isn't it recommended to have 
the fix committed along with its test?

cheers