RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Wed Nov 26 07:06:52 UTC 2014

Hi David,

thanks for looking at this. Here is the updated webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.01/

See my comments below.

On Wed, Nov 26, 2014 at 2:29 AM, David Holmes <david.holmes at oracle.com>
wrote:

> Hi Thomas,
>
> A few quick comments as I need to think more about this:
>
> - On Solaris we use the UI thread API thr_* not pthreads
>

Fixed, now I use thr_sigsetmask() (though both sigprocmask and
pthread_sigmask seemed to work too)

> - In debug.cpp for the SIGILL can you define the  all zero case as a
> default so we only need to add platform specific definitions when all
> zeroes doesn't work. I really hate seeing all that CPU selection in shared
> code. :(
>

Agreed and fixed, moved the CPU-specific sections into CPU-specific files.

> - Style nit: please use i++ rather than i ++
>
>
Fixed.

Aside: we should eradicate the use of sigprocmask and replace with the
> thread specific version.
>
>
Agree. Though I never saw any errors stemming from the use of
sigprocmask(). According to POSIX, sigprocmask() is undefined in
multithreaded environment, and I guess most OSes just default to
pthread_sigmask.

> Getting back to the "thinking more about this" ... If a synchronous signal
> is blocked at the time it is generated then it should remain pending on the
> thread (POSIX spec) but that doesn't tell us what the thread will then do -
> retry the faulting instruction? Become unschedulable? So I can easily
> imagine that a hang or process termination may result.

This is exactly what happens, but it is actually covered by POSIX, see doc
on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS
signals are generated while they are blocked, the result is undefined,
unless the signal was generated by the *kill*()
<http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
function, the *sigqueue*()
<http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>
function, or the *raise*()
<http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
function."

In reality, process usually aborts abnormally with the default action for
the signal, e.g. printing out "Illegal Instruction". On MacOS, we hang
(until the Watcherthread finally kills the VM). On old AIXes, we die
without a trace.

This also can be easily tried out by removing SIGILL from the list of
signals in vmError_<os>.cpp and executing:

java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15

which will crash first with a SIGSEGV, then in error handling with a
secondary SIGILL. This will interrupt error reporting and kill or hang the
process.

In that sense unblocking those signals whilst handling the initial signal
> may well allow the error reporting process to continue further. But I'm
> unclear exactly how this plays out:
>
> - synchronous signal encountered
> - crash_handler invoked
>
- VMError::report_and_die executes
> - secondary signal encountered
>
- crash_handler invoked again
>

almost: not again, different signal handler now. First signal was handled
by "JVM_handle_<os>_signal()"

> - VMError::report_and_die executes again and sees the recursion and
> returns (ignoring abort due to excessive recursive errors)
>
>
No..

Is that right? So we actually return from the crash_handler?

Oh, but we dont return. VMError::report_and_die() will just create a new
frame and re-execute VMError::report() of the first VMError object. Which
then will continue with the next STEP. We never return, for each secondary
error signal a new frame is created.

This all happens in VMError::report_and_die:
-> first error ? anchor VMError object in a static variable and execute
VMError::report()
-> secondary error?
   -> different thread? just sleep forever
   -> same thread? new frame, re-enter VMError::report(). Once done, abort.

I always found that rather neat, but in fact that is not our invention but
Sun's :) Anyway, my fix does not change this behaviour for better or worse,
it only makes it usable for more cases.

> Because this puts us in undefined territory according to POSIX:
>
> "The behavior of a process is undefined after it returns normally from a
> signal-catching function for a SIGBUS, SIGFPE, SIGILL, or SIGSEGV signal
> that was not generated by kill(), sigqueue(), or raise()."
>
> true, but we dont return...

> On top of that you also have the issue that error reporting does a whole
> bunch of things that are not async-signal-safe so we can easily encounter
> hangs or aborts.
>
> But we're dying anyway so I guess none of this really matters. If
> re-enabling these signals allows error reporting to progress further in
> some cases then that is a win.
>
>
Actually, this covers a lot of cases, mostly because SIGSEGV during error
reporting is common, so if the original error was not SIGSEGV, but e.g.
SIGILL, this would always result in broken hs-err files.

The back story is that at SAP, we rely heavily on the hs-err files. They
are our main tool for support, because working with cores is often not
feasible. So, we put a lot of work in making error reporting reliable
across all platforms. This is also covered by many tests which crash the VM
in exciting ways and check the hs-err files for completeness.

Kind Regards, Thomas

> Cheers,
> David
>
>
> On 26/11/2014 12:12 AM, Thomas Stüfe wrote:
>
>> Hi all,
>>
>> I'd like to contribute a fix to error handling to improve stability of
>> error reporting.
>>
>>
>> Bug Report:
>> https://bugs.openjdk.java.net/browse/JDK-8065895
>>
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.00/
>>
>>
>> Problem:
>>
>> When a synchronous error signal happens during error reporting, and the
>> signal is different from the original signal which triggered error
>> reporting, VM may die or hang (depends on platform). This causes empty or
>> almost-empty hs-err files.
>>
>> Example: we first crash with a SIGILL (e.g in compiled code), then a
>> SIGSEGV happens when printing stack trace.
>>
>> Secondary error handling should catch the SIGSEGV and continue error
>> reporting with the next step. But that does not work in this case.
>>
>> Causes:
>>    - hotspot blocks all signals when installing signal handlers. Within
>> the
>> secondary signal handler, only the original signal gets unblocked, the
>> rest
>> remained blocked. If another synchronous error signal happens, it is still
>> blocked. If the second signal is a synchronous signal, the OS would
>> terminate the process right away because there is no way to defer
>> synchronous error signals.
>>    - when installing signal handlers for secondary error handling, only
>> signal handlers for SIGBUS and SIGSEGV were added; but more signals may
>> happen during error handling (we saw SIGTRAP, SIGILL, ..etc).
>>
>> Fix:
>> secondary signal handler is installed for all synchronous error signals
>> (which is now a list and easily expandable in vmError_<os>.cpp). All those
>> signals are unblocked.
>>
>> In order to test the fix, some test code was added too:
>>
>> a) debug.cpp: changed "test_error_handler()" to a more generic
>> "controlled_crash(int how)", which can be called at arbitrary places, not
>> only at initialization time. "test_error_handler()" still exists and just
>> calls "controlled_crash(ErrorHandlerTest)", so its behaviour did not
>> change.
>>
>> b) expand controlled_crash():
>>    - added option 14, a guaranteed crash with a SIGSEGV at a predefined
>> address, which is printed out and can later be tested against. Note that I
>> realize that this is a bit redundant to option 12 or 13, but the crash is
>> guaranteed and it crashes with a not-null address which should turn up in
>> hs-err file (to check that hs-err file is correct).
>>    - added option 15, a guaranteed crash with a SIGILL at a predefined
>> instruction address. Here, the point is to get a real-world SIGILL (not
>> just raising it) at a not-null known pc.
>>
>> c) Add a parameter "-XX:TestCrashDuringErrorHandler=<n>", which works the
>> same as "-XX:ErrorHandlerTest=<n>". This parameter is used to trigger
>> controlled crashes inside the error handler. That way secondary error
>> handling can be tested.
>>
>> (a)-(c) allow us to test the fixes manually, for example:
>>
>> java -XX:ErrorHandlerTest=15  -XX:TestCrashDuringErrorHandler=14
>>
>> causes a SIGILL during initialization, and a secondary SIGSEGV inside
>> error
>> handling. This demonstrates the effect of the bug. Without the fix, the VM
>> will abort right away without finishing the hs-err file.
>>
>> --
>>
>> I am in the process of writing some JTreg Tests, but I would like to put
>> those into a separate change. This is because there are more fixes to
>> error
>> reporting in our pipeline and I'd like to bundle the jtreg tests in one
>> change.
>>
>> Kind Regards,
>>
>> Thomas Stuefe
>>
>>