RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Tue Nov 25 14:12:19 UTC 2014

Hi all,

I'd like to contribute a fix to error handling to improve stability of
error reporting.

Bug Report:
https://bugs.openjdk.java.net/browse/JDK-8065895

Webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.00/

Problem:

When a synchronous error signal happens during error reporting, and the
signal is different from the original signal which triggered error
reporting, VM may die or hang (depends on platform). This causes empty or
almost-empty hs-err files.

Example: we first crash with a SIGILL (e.g in compiled code), then a
SIGSEGV happens when printing stack trace.

Secondary error handling should catch the SIGSEGV and continue error
reporting with the next step. But that does not work in this case.

Causes:
  - hotspot blocks all signals when installing signal handlers. Within the
secondary signal handler, only the original signal gets unblocked, the rest
remained blocked. If another synchronous error signal happens, it is still
blocked. If the second signal is a synchronous signal, the OS would
terminate the process right away because there is no way to defer
synchronous error signals.
  - when installing signal handlers for secondary error handling, only
signal handlers for SIGBUS and SIGSEGV were added; but more signals may
happen during error handling (we saw SIGTRAP, SIGILL, ..etc).

Fix:
secondary signal handler is installed for all synchronous error signals
(which is now a list and easily expandable in vmError_<os>.cpp). All those
signals are unblocked.

In order to test the fix, some test code was added too:

a) debug.cpp: changed "test_error_handler()" to a more generic
"controlled_crash(int how)", which can be called at arbitrary places, not
only at initialization time. "test_error_handler()" still exists and just
calls "controlled_crash(ErrorHandlerTest)", so its behaviour did not change.

b) expand controlled_crash():
  - added option 14, a guaranteed crash with a SIGSEGV at a predefined
address, which is printed out and can later be tested against. Note that I
realize that this is a bit redundant to option 12 or 13, but the crash is
guaranteed and it crashes with a not-null address which should turn up in
hs-err file (to check that hs-err file is correct).
  - added option 15, a guaranteed crash with a SIGILL at a predefined
instruction address. Here, the point is to get a real-world SIGILL (not
just raising it) at a not-null known pc.

c) Add a parameter "-XX:TestCrashDuringErrorHandler=<n>", which works the
same as "-XX:ErrorHandlerTest=<n>". This parameter is used to trigger
controlled crashes inside the error handler. That way secondary error
handling can be tested.

(a)-(c) allow us to test the fixes manually, for example:

java -XX:ErrorHandlerTest=15  -XX:TestCrashDuringErrorHandler=14

causes a SIGILL during initialization, and a secondary SIGSEGV inside error
handling. This demonstrates the effect of the bug. Without the fix, the VM
will abort right away without finishing the hs-err file.

--

I am in the process of writing some JTreg Tests, but I would like to put
those into a separate change. This is because there are more fixes to error
reporting in our pipeline and I'd like to bundle the jtreg tests in one
change.

Kind Regards,

Thomas Stuefe