RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Wed Nov 26 09:31:42 UTC 2014

Hi Thomas,

On 26/11/2014 5:06 PM, Thomas Stüfe wrote:
> Hi David,
>
> thanks for looking at this. Here is the updated webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.01/
>
> See my comments below.
>
> On Wed, Nov 26, 2014 at 2:29 AM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
>     Hi Thomas,
>
>     A few quick comments as I need to think more about this:
>
>     - On Solaris we use the UI thread API thr_* not pthreads
>
>
> Fixed, now I use thr_sigsetmask() (though both sigprocmask and
> pthread_sigmask seemed to work too)

Thanks. They are interchangeable semantically but for consistency I 
prefer not to mix them.

>     - In debug.cpp for the SIGILL can you define the  all zero case as a
>     default so we only need to add platform specific definitions when
>     all zeroes doesn't work. I really hate seeing all that CPU selection
>     in shared code. :(
>
>
> Agreed and fixed, moved the CPU-specific sections into CPU-specific files.

I'd really like to see a way to share the all-zeroes case so that we 
don't need to add platform specific code unnecessarily.

>     - Style nit: please use i++ rather than i ++
>
>
> Fixed.
>
>     Aside: we should eradicate the use of sigprocmask and replace with
>     the thread specific version.
>
>
> Agree. Though I never saw any errors stemming from the use of
> sigprocmask(). According to POSIX, sigprocmask() is undefined in
> multithreaded environment, and I guess most OSes just default to
> pthread_sigmask.

Yes "probably" works okay but I hate to see us using something with 
undefined semantics. That's future clean up though.

>     Getting back to the "thinking more about this" ... If a synchronous
>     signal is blocked at the time it is generated then it should remain
>     pending on the thread (POSIX spec) but that doesn't tell us what the
>     thread will then do - retry the faulting instruction? Become
>     unschedulable? So I can easily imagine that a hang or process
>     termination may result.
>
>
> This is exactly what happens, but it is actually covered by POSIX, see
> doc on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or
> SIGBUS signals are generated while they are blocked, the result is
> undefined, unless the signal was generated by the /kill/()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
> function, the /sigqueue/()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>
> function, or the /raise/()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
> function."

Thanks - I managed to miss that part even though I found the other part 
about the signal handling function returning. :(

> In reality, process usually aborts abnormally with the default action
> for the signal, e.g. printing out "Illegal Instruction". On MacOS, we
> hang (until the Watcherthread finally kills the VM). On old AIXes, we
> die without a trace.
>
> This also can be easily tried out by removing SIGILL from the list of
> signals in vmError_<os>.cpp and executing:
>
> java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>
> which will crash first with a SIGSEGV, then in error handling with a
> secondary SIGILL. This will interrupt error reporting and kill or hang
> the process.
>
>
>     In that sense unblocking those signals whilst handling the initial
>     signal may well allow the error reporting process to continue
>     further. But I'm unclear exactly how this plays out:
>
>     - synchronous signal encountered
>     - crash_handler invoked
>
>     - VMError::report_and_die executes
>     - secondary signal encountered
>
>     - crash_handler invoked again
>
>
> almost: not again, different signal handler now. First signal was
> handled by "JVM_handle_<os>_signal()"

Ah missed that - thanks - not that it makes much difference :)

>     - VMError::report_and_die executes again and sees the recursion and
>     returns (ignoring abort due to excessive recursive errors)
>
>
> No..
>
>     Is that right? So we actually return from the crash_handler?
>
>
> Oh, but we dont return. VMError::report_and_die() will just create a new
> frame and re-execute VMError::report() of the first VMError object.
> Which then will continue with the next STEP. We never return, for each
> secondary error signal a new frame is created.

I had trouble tracing through exactly what might happen on the recursive 
call to report_and_die. I see know that report comes from:

     staticBufferStream sbs(buffer, O_BUFLEN, &log);
     first_error->report(&sbs);
     first_error->_current_step = 0;         // reset current_step
     first_error->_current_step_info = "";   // reset current_step string

so the second time through we will call report and _current_step should 
indicate where to start executing from.

> This all happens in VMError::report_and_die:
> -> first error ? anchor VMError object in a static variable and execute
> VMError::report()
> -> secondary error?
>     -> different thread? just sleep forever
>     -> same thread? new frame, re-enter VMError::report(). Once done, abort.
>
> I always found that rather neat, but in fact that is not our invention
> but Sun's :) Anyway, my fix does not change this behaviour for better or
> worse, it only makes it usable for more cases.
>
>     Because this puts us in undefined territory according to POSIX:
>
>     "The behavior of a process is undefined after it returns normally
>     from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or
>     SIGSEGV signal that was not generated by kill(), sigqueue(), or
>     raise()."
>
> true, but we dont return...
>
>     On top of that you also have the issue that error reporting does a
>     whole bunch of things that are not async-signal-safe so we can
>     easily encounter hangs or aborts.
>
>     But we're dying anyway so I guess none of this really matters. If
>     re-enabling these signals allows error reporting to progress further
>     in some cases then that is a win.
>
>
> Actually, this covers a lot of cases, mostly because SIGSEGV during
> error reporting is common, so if the original error was not SIGSEGV, but
> e.g. SIGILL, this would always result in broken hs-err files.
>
> The back story is that at SAP, we rely heavily on the hs-err files. They
> are our main tool for support, because working with cores is often not
> feasible. So, we put a lot of work in making error reporting reliable
> across all platforms. This is also covered by many tests which crash the
> VM in exciting ways and check the hs-err files for completeness.

OK. Modulo the cpu specific SIGILL part everything else seems fine.

Thanks,
David

> Kind Regards, Thomas
>
>     Cheers,
>     David
>
>
>     On 26/11/2014 12:12 AM, Thomas Stüfe wrote:
>
>         Hi all,
>
>         I'd like to contribute a fix to error handling to improve
>         stability of
>         error reporting.
>
>
>         Bug Report:
>         https://bugs.openjdk.java.net/__browse/JDK-8065895
>         <https://bugs.openjdk.java.net/browse/JDK-8065895>
>
>
>         Webrev:
>         http://cr.openjdk.java.net/~__stuefe/webrevs/8065895/webrev.__00/ <http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.00/>
>
>
>         Problem:
>
>         When a synchronous error signal happens during error reporting,
>         and the
>         signal is different from the original signal which triggered error
>         reporting, VM may die or hang (depends on platform). This causes
>         empty or
>         almost-empty hs-err files.
>
>         Example: we first crash with a SIGILL (e.g in compiled code), then a
>         SIGSEGV happens when printing stack trace.
>
>         Secondary error handling should catch the SIGSEGV and continue error
>         reporting with the next step. But that does not work in this case.
>
>         Causes:
>             - hotspot blocks all signals when installing signal
>         handlers. Within the
>         secondary signal handler, only the original signal gets
>         unblocked, the rest
>         remained blocked. If another synchronous error signal happens,
>         it is still
>         blocked. If the second signal is a synchronous signal, the OS would
>         terminate the process right away because there is no way to defer
>         synchronous error signals.
>             - when installing signal handlers for secondary error
>         handling, only
>         signal handlers for SIGBUS and SIGSEGV were added; but more
>         signals may
>         happen during error handling (we saw SIGTRAP, SIGILL, ..etc).
>
>         Fix:
>         secondary signal handler is installed for all synchronous error
>         signals
>         (which is now a list and easily expandable in vmError_<os>.cpp).
>         All those
>         signals are unblocked.
>
>         In order to test the fix, some test code was added too:
>
>         a) debug.cpp: changed "test_error_handler()" to a more generic
>         "controlled_crash(int how)", which can be called at arbitrary
>         places, not
>         only at initialization time. "test_error_handler()" still exists
>         and just
>         calls "controlled_crash(__ErrorHandlerTest)", so its behaviour
>         did not change.
>
>         b) expand controlled_crash():
>             - added option 14, a guaranteed crash with a SIGSEGV at a
>         predefined
>         address, which is printed out and can later be tested against.
>         Note that I
>         realize that this is a bit redundant to option 12 or 13, but the
>         crash is
>         guaranteed and it crashes with a not-null address which should
>         turn up in
>         hs-err file (to check that hs-err file is correct).
>             - added option 15, a guaranteed crash with a SIGILL at a
>         predefined
>         instruction address. Here, the point is to get a real-world
>         SIGILL (not
>         just raising it) at a not-null known pc.
>
>         c) Add a parameter "-XX:__TestCrashDuringErrorHandler=<__n>",
>         which works the
>         same as "-XX:ErrorHandlerTest=<n>". This parameter is used to
>         trigger
>         controlled crashes inside the error handler. That way secondary
>         error
>         handling can be tested.
>
>         (a)-(c) allow us to test the fixes manually, for example:
>
>         java -XX:ErrorHandlerTest=15  -XX:__TestCrashDuringErrorHandler=14
>
>         causes a SIGILL during initialization, and a secondary SIGSEGV
>         inside error
>         handling. This demonstrates the effect of the bug. Without the
>         fix, the VM
>         will abort right away without finishing the hs-err file.
>
>         --
>
>         I am in the process of writing some JTreg Tests, but I would
>         like to put
>         those into a separate change. This is because there are more
>         fixes to error
>         reporting in our pipeline and I'd like to bundle the jtreg tests
>         in one
>         change.
>
>         Kind Regards,
>
>         Thomas Stuefe
>
>