RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Wed Nov 26 01:29:28 UTC 2014

Hi Thomas,

A few quick comments as I need to think more about this:

- On Solaris we use the UI thread API thr_* not pthreads
- In debug.cpp for the SIGILL can you define the  all zero case as a 
default so we only need to add platform specific definitions when all 
zeroes doesn't work. I really hate seeing all that CPU selection in 
shared code. :(
- Style nit: please use i++ rather than i ++

Aside: we should eradicate the use of sigprocmask and replace with the 
thread specific version.

Getting back to the "thinking more about this" ... If a synchronous 
signal is blocked at the time it is generated then it should remain 
pending on the thread (POSIX spec) but that doesn't tell us what the 
thread will then do - retry the faulting instruction? Become 
unschedulable? So I can easily imagine that a hang or process 
termination may result. In that sense unblocking those signals whilst 
handling the initial signal may well allow the error reporting process 
to continue further. But I'm unclear exactly how this plays out:

- synchronous signal encountered
- crash_handler invoked
- VMError::report_and_die executes
- secondary signal encountered
- crash_handler invoked again
- VMError::report_and_die executes again and sees the recursion and 
returns (ignoring abort due to excessive recursive errors)

Is that right? So we actually return from the crash_handler? Because 
this puts us in undefined territory according to POSIX:

"The behavior of a process is undefined after it returns normally from a 
signal-catching function for a SIGBUS, SIGFPE, SIGILL, or SIGSEGV signal 
that was not generated by kill(), sigqueue(), or raise()."

On top of that you also have the issue that error reporting does a whole 
bunch of things that are not async-signal-safe so we can easily 
encounter hangs or aborts.

But we're dying anyway so I guess none of this really matters. If 
re-enabling these signals allows error reporting to progress further in 
some cases then that is a win.

Cheers,
David

On 26/11/2014 12:12 AM, Thomas Stüfe wrote:
> Hi all,
>
> I'd like to contribute a fix to error handling to improve stability of
> error reporting.
>
>
> Bug Report:
> https://bugs.openjdk.java.net/browse/JDK-8065895
>
>
> Webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.00/
>
>
> Problem:
>
> When a synchronous error signal happens during error reporting, and the
> signal is different from the original signal which triggered error
> reporting, VM may die or hang (depends on platform). This causes empty or
> almost-empty hs-err files.
>
> Example: we first crash with a SIGILL (e.g in compiled code), then a
> SIGSEGV happens when printing stack trace.
>
> Secondary error handling should catch the SIGSEGV and continue error
> reporting with the next step. But that does not work in this case.
>
> Causes:
>    - hotspot blocks all signals when installing signal handlers. Within the
> secondary signal handler, only the original signal gets unblocked, the rest
> remained blocked. If another synchronous error signal happens, it is still
> blocked. If the second signal is a synchronous signal, the OS would
> terminate the process right away because there is no way to defer
> synchronous error signals.
>    - when installing signal handlers for secondary error handling, only
> signal handlers for SIGBUS and SIGSEGV were added; but more signals may
> happen during error handling (we saw SIGTRAP, SIGILL, ..etc).
>
> Fix:
> secondary signal handler is installed for all synchronous error signals
> (which is now a list and easily expandable in vmError_<os>.cpp). All those
> signals are unblocked.
>
> In order to test the fix, some test code was added too:
>
> a) debug.cpp: changed "test_error_handler()" to a more generic
> "controlled_crash(int how)", which can be called at arbitrary places, not
> only at initialization time. "test_error_handler()" still exists and just
> calls "controlled_crash(ErrorHandlerTest)", so its behaviour did not change.
>
> b) expand controlled_crash():
>    - added option 14, a guaranteed crash with a SIGSEGV at a predefined
> address, which is printed out and can later be tested against. Note that I
> realize that this is a bit redundant to option 12 or 13, but the crash is
> guaranteed and it crashes with a not-null address which should turn up in
> hs-err file (to check that hs-err file is correct).
>    - added option 15, a guaranteed crash with a SIGILL at a predefined
> instruction address. Here, the point is to get a real-world SIGILL (not
> just raising it) at a not-null known pc.
>
> c) Add a parameter "-XX:TestCrashDuringErrorHandler=<n>", which works the
> same as "-XX:ErrorHandlerTest=<n>". This parameter is used to trigger
> controlled crashes inside the error handler. That way secondary error
> handling can be tested.
>
> (a)-(c) allow us to test the fixes manually, for example:
>
> java -XX:ErrorHandlerTest=15  -XX:TestCrashDuringErrorHandler=14
>
> causes a SIGILL during initialization, and a secondary SIGSEGV inside error
> handling. This demonstrates the effect of the bug. Without the fix, the VM
> will abort right away without finishing the hs-err file.
>
> --
>
> I am in the process of writing some JTreg Tests, but I would like to put
> those into a separate change. This is because there are more fixes to error
> reporting in our pipeline and I'd like to bundle the jtreg tests in one
> change.
>
> Kind Regards,
>
> Thomas Stuefe
>