RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
David Holmes
david.holmes at oracle.com
Wed Nov 26 09:31:42 UTC 2014
Hi Thomas,
On 26/11/2014 5:06 PM, Thomas Stüfe wrote:
> Hi David,
>
> thanks for looking at this. Here is the updated webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.01/
>
> See my comments below.
>
> On Wed, Nov 26, 2014 at 2:29 AM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
> Hi Thomas,
>
> A few quick comments as I need to think more about this:
>
> - On Solaris we use the UI thread API thr_* not pthreads
>
>
> Fixed, now I use thr_sigsetmask() (though both sigprocmask and
> pthread_sigmask seemed to work too)
Thanks. They are interchangeable semantically but for consistency I
prefer not to mix them.
> - In debug.cpp for the SIGILL can you define the all zero case as a
> default so we only need to add platform specific definitions when
> all zeroes doesn't work. I really hate seeing all that CPU selection
> in shared code. :(
>
>
> Agreed and fixed, moved the CPU-specific sections into CPU-specific files.
I'd really like to see a way to share the all-zeroes case so that we
don't need to add platform specific code unnecessarily.
> - Style nit: please use i++ rather than i ++
>
>
> Fixed.
>
> Aside: we should eradicate the use of sigprocmask and replace with
> the thread specific version.
>
>
> Agree. Though I never saw any errors stemming from the use of
> sigprocmask(). According to POSIX, sigprocmask() is undefined in
> multithreaded environment, and I guess most OSes just default to
> pthread_sigmask.
Yes "probably" works okay but I hate to see us using something with
undefined semantics. That's future clean up though.
> Getting back to the "thinking more about this" ... If a synchronous
> signal is blocked at the time it is generated then it should remain
> pending on the thread (POSIX spec) but that doesn't tell us what the
> thread will then do - retry the faulting instruction? Become
> unschedulable? So I can easily imagine that a hang or process
> termination may result.
>
>
> This is exactly what happens, but it is actually covered by POSIX, see
> doc on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or
> SIGBUS signals are generated while they are blocked, the result is
> undefined, unless the signal was generated by the /kill/()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
> function, the /sigqueue/()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>
> function, or the /raise/()
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
> function."
Thanks - I managed to miss that part even though I found the other part
about the signal handling function returning. :(
> In reality, process usually aborts abnormally with the default action
> for the signal, e.g. printing out "Illegal Instruction". On MacOS, we
> hang (until the Watcherthread finally kills the VM). On old AIXes, we
> die without a trace.
>
> This also can be easily tried out by removing SIGILL from the list of
> signals in vmError_<os>.cpp and executing:
>
> java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>
> which will crash first with a SIGSEGV, then in error handling with a
> secondary SIGILL. This will interrupt error reporting and kill or hang
> the process.
>
>
> In that sense unblocking those signals whilst handling the initial
> signal may well allow the error reporting process to continue
> further. But I'm unclear exactly how this plays out:
>
> - synchronous signal encountered
> - crash_handler invoked
>
> - VMError::report_and_die executes
> - secondary signal encountered
>
> - crash_handler invoked again
>
>
> almost: not again, different signal handler now. First signal was
> handled by "JVM_handle_<os>_signal()"
Ah missed that - thanks - not that it makes much difference :)
> - VMError::report_and_die executes again and sees the recursion and
> returns (ignoring abort due to excessive recursive errors)
>
>
> No..
>
> Is that right? So we actually return from the crash_handler?
>
>
> Oh, but we dont return. VMError::report_and_die() will just create a new
> frame and re-execute VMError::report() of the first VMError object.
> Which then will continue with the next STEP. We never return, for each
> secondary error signal a new frame is created.
I had trouble tracing through exactly what might happen on the recursive
call to report_and_die. I see know that report comes from:
staticBufferStream sbs(buffer, O_BUFLEN, &log);
first_error->report(&sbs);
first_error->_current_step = 0; // reset current_step
first_error->_current_step_info = ""; // reset current_step string
so the second time through we will call report and _current_step should
indicate where to start executing from.
> This all happens in VMError::report_and_die:
> -> first error ? anchor VMError object in a static variable and execute
> VMError::report()
> -> secondary error?
> -> different thread? just sleep forever
> -> same thread? new frame, re-enter VMError::report(). Once done, abort.
>
> I always found that rather neat, but in fact that is not our invention
> but Sun's :) Anyway, my fix does not change this behaviour for better or
> worse, it only makes it usable for more cases.
>
> Because this puts us in undefined territory according to POSIX:
>
> "The behavior of a process is undefined after it returns normally
> from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or
> SIGSEGV signal that was not generated by kill(), sigqueue(), or
> raise()."
>
> true, but we dont return...
>
> On top of that you also have the issue that error reporting does a
> whole bunch of things that are not async-signal-safe so we can
> easily encounter hangs or aborts.
>
> But we're dying anyway so I guess none of this really matters. If
> re-enabling these signals allows error reporting to progress further
> in some cases then that is a win.
>
>
> Actually, this covers a lot of cases, mostly because SIGSEGV during
> error reporting is common, so if the original error was not SIGSEGV, but
> e.g. SIGILL, this would always result in broken hs-err files.
>
> The back story is that at SAP, we rely heavily on the hs-err files. They
> are our main tool for support, because working with cores is often not
> feasible. So, we put a lot of work in making error reporting reliable
> across all platforms. This is also covered by many tests which crash the
> VM in exciting ways and check the hs-err files for completeness.
OK. Modulo the cpu specific SIGILL part everything else seems fine.
Thanks,
David
> Kind Regards, Thomas
>
> Cheers,
> David
>
>
> On 26/11/2014 12:12 AM, Thomas Stüfe wrote:
>
> Hi all,
>
> I'd like to contribute a fix to error handling to improve
> stability of
> error reporting.
>
>
> Bug Report:
> https://bugs.openjdk.java.net/__browse/JDK-8065895
> <https://bugs.openjdk.java.net/browse/JDK-8065895>
>
>
> Webrev:
> http://cr.openjdk.java.net/~__stuefe/webrevs/8065895/webrev.__00/ <http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.00/>
>
>
> Problem:
>
> When a synchronous error signal happens during error reporting,
> and the
> signal is different from the original signal which triggered error
> reporting, VM may die or hang (depends on platform). This causes
> empty or
> almost-empty hs-err files.
>
> Example: we first crash with a SIGILL (e.g in compiled code), then a
> SIGSEGV happens when printing stack trace.
>
> Secondary error handling should catch the SIGSEGV and continue error
> reporting with the next step. But that does not work in this case.
>
> Causes:
> - hotspot blocks all signals when installing signal
> handlers. Within the
> secondary signal handler, only the original signal gets
> unblocked, the rest
> remained blocked. If another synchronous error signal happens,
> it is still
> blocked. If the second signal is a synchronous signal, the OS would
> terminate the process right away because there is no way to defer
> synchronous error signals.
> - when installing signal handlers for secondary error
> handling, only
> signal handlers for SIGBUS and SIGSEGV were added; but more
> signals may
> happen during error handling (we saw SIGTRAP, SIGILL, ..etc).
>
> Fix:
> secondary signal handler is installed for all synchronous error
> signals
> (which is now a list and easily expandable in vmError_<os>.cpp).
> All those
> signals are unblocked.
>
> In order to test the fix, some test code was added too:
>
> a) debug.cpp: changed "test_error_handler()" to a more generic
> "controlled_crash(int how)", which can be called at arbitrary
> places, not
> only at initialization time. "test_error_handler()" still exists
> and just
> calls "controlled_crash(__ErrorHandlerTest)", so its behaviour
> did not change.
>
> b) expand controlled_crash():
> - added option 14, a guaranteed crash with a SIGSEGV at a
> predefined
> address, which is printed out and can later be tested against.
> Note that I
> realize that this is a bit redundant to option 12 or 13, but the
> crash is
> guaranteed and it crashes with a not-null address which should
> turn up in
> hs-err file (to check that hs-err file is correct).
> - added option 15, a guaranteed crash with a SIGILL at a
> predefined
> instruction address. Here, the point is to get a real-world
> SIGILL (not
> just raising it) at a not-null known pc.
>
> c) Add a parameter "-XX:__TestCrashDuringErrorHandler=<__n>",
> which works the
> same as "-XX:ErrorHandlerTest=<n>". This parameter is used to
> trigger
> controlled crashes inside the error handler. That way secondary
> error
> handling can be tested.
>
> (a)-(c) allow us to test the fixes manually, for example:
>
> java -XX:ErrorHandlerTest=15 -XX:__TestCrashDuringErrorHandler=14
>
> causes a SIGILL during initialization, and a secondary SIGSEGV
> inside error
> handling. This demonstrates the effect of the bug. Without the
> fix, the VM
> will abort right away without finishing the hs-err file.
>
> --
>
> I am in the process of writing some JTreg Tests, but I would
> like to put
> those into a separate change. This is because there are more
> fixes to error
> reporting in our pipeline and I'd like to bundle the jtreg tests
> in one
> change.
>
> Kind Regards,
>
> Thomas Stuefe
>
>
More information about the hotspot-runtime-dev
mailing list