RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
David Holmes
david.holmes at oracle.com
Thu Nov 27 00:49:06 UTC 2014
On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
> Hi David,
>
> here you go: http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/
>
> Reverted SIGILL-generating function back to its original form, plus the
> folding of the 000 case.
Thanks Thomas! While we are awaiting a second reviewer I will test this
out internally. It may take a day or two sorry.
David
> I only can guess what your closed platforms are, but if it is ARM, I
> believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.
>
> Kind regards, Thomas
>
>
> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
> On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>
> Hi David,
> ...
>
> - In debug.cpp for the SIGILL can you define the
> all zero
> case as a
> default so we only need to add platform specific
> definitions when
> all zeroes doesn't work. I really hate seeing all
> that CPU
> selection
> in shared code. :(
>
>
> Agreed and fixed, moved the CPU-specific sections into
> CPU-specific files.
>
>
> I'd really like to see a way to share the all-zeroes case
> so that we
> don't need to add platform specific code unnecessarily.
>
>
> sooo.. back to the original code then, just with the #ifdef,
> just with
> the zero-cases all folded in into the #else path? Or do you prefer
> something else?
>
>
> Elsewhere there is a pattern of defining per-platform values that
> can override the shared definition. eg:
>
> #ifndef HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
> Foo XXX = ...; //shared/default initalization
> #endif
>
> but this assumes a platform specific header has already been
> included that can do:
>
> #define HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
> Foo XXX = ... ; // platform specific initialization
>
> But that is not the case for debug.hpp.
>
> So I guess folding the zero-case into the else path is the best we
> can do. However I'm assuming the zero case will work for our
> internal platforms ... if it doesn't then we'd have to pollute the
> shared code with info for the closed platforms. :(
>
> David
> -----
>
>
> - Style nit: please use i++ rather than i ++
>
>
> Fixed.
>
> Aside: we should eradicate the use of sigprocmask and
> replace with
> the thread specific version.
>
>
> Agree. Though I never saw any errors stemming from the
> use of
> sigprocmask(). According to POSIX, sigprocmask() is
> undefined in
> multithreaded environment, and I guess most OSes just
> default to
> pthread_sigmask.
>
>
> Yes "probably" works okay but I hate to see us using
> something with
> undefined semantics. That's future clean up though.
>
>
> We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask()
> instead of
> sigprocmask. Works fine. We can port this to the OpenJDK.
>
> Getting back to the "thinking more about this" ...
> If a
> synchronous
> signal is blocked at the time it is generated then it
> should remain
> pending on the thread (POSIX spec) but that
> doesn't tell us
> what the
> thread will then do - retry the faulting
> instruction? Become
> unschedulable? So I can easily imagine that a hang
> or process
> termination may result.
>
>
> This is exactly what happens, but it is actually covered by
> POSIX, see
> doc on pthread_sigmask: "If any of the SIGFPE, SIGILL,
> SIGSEGV, or
> SIGBUS signals are generated while they are blocked,
> the result is
> undefined, unless the signal was generated by the /kill/()
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>
> function, the /sigqueue/()
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>
>
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>>>
> function, or the /raise/()
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>
>
>
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>
> function."
>
>
> Thanks - I managed to miss that part even though I found
> the other
> part about the signal handling function returning. :(
>
>
> It is well hidden, I found it by accident :) To me it looks like
> they
> kept it intentionally vague, to not block platforms where those
> signals
> could be somehow dealt with automatically? Hard to see though
> how this
> would work.
>
>
>
> In reality, process usually aborts abnormally with the
> default
> action
> for the signal, e.g. printing out "Illegal Instruction". On
> MacOS, we
> hang (until the Watcherthread finally kills the VM). On old
> AIXes, we
> die without a trace.
>
> This also can be easily tried out by removing SIGILL
> from the
> list of
> signals in vmError_<os>.cpp and executing:
>
> java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>
> which will crash first with a SIGSEGV, then in error
> handling with a
> secondary SIGILL. This will interrupt error reporting
> and kill
> or hang
> the process.
>
>
> In that sense unblocking those signals whilst
> handling the
> initial
> signal may well allow the error reporting process
> to continue
> further. But I'm unclear exactly how this plays out:
>
> - synchronous signal encountered
> - crash_handler invoked
>
> - VMError::report_and_die executes
> - secondary signal encountered
>
> - crash_handler invoked again
>
>
> almost: not again, different signal handler now. First
> signal was
> handled by "JVM_handle_<os>_signal()"
>
>
> Ah missed that - thanks - not that it makes much difference :)
>
>
> I just like nitpicking :)
>
> - VMError::report_and_die executes again and sees the
> recursion and
> returns (ignoring abort due to excessive recursive
> errors)
>
>
> No..
>
> Is that right? So we actually return from the
> crash_handler?
>
>
> Oh, but we dont return. VMError::report_and_die() will just
> create a new
> frame and re-execute VMError::report() of the first
> VMError object.
> Which then will continue with the next STEP. We never
> return,
> for each
> secondary error signal a new frame is created.
>
>
> I had trouble tracing through exactly what might happen on the
> recursive call to report_and_die. I see know that report
> comes from:
>
> staticBufferStream sbs(buffer, O_BUFLEN, &log);
> first_error->report(&sbs);
> first_error->_current_step = 0; // reset
> current_step
> first_error->_current_step_____info = ""; // reset
> current_step
>
> string
>
> so the second time through we will call report and
> _current_step
> should indicate where to start executing from.
>
>
> Exactly. There is also a catch, in that the stack usage goes up. Not
> endlessly, it is limited by the number of error reporting steps.
> The more stack VmError::report() does cost, the less well this
> works,
> especially in stack overflow scenarios.
>
> Which is why we extended SafeFetch and enabled it for the use in the
> error handler, which will be one of the the next patches I'd like to
> port to the OpenJDK, once this one is thru.
>
>
> This all happens in VMError::report_and_die:
> -> first error ? anchor VMError object in a static
> variable and
> execute
> VMError::report()
> -> secondary error?
> -> different thread? just sleep forever
> -> same thread? new frame, re-enter
> VMError::report(). Once
> done, abort.
>
> I always found that rather neat, but in fact that is
> not our
> invention
> but Sun's :) Anyway, my fix does not change this
> behaviour for
> better or
> worse, it only makes it usable for more cases.
>
> Because this puts us in undefined territory
> according to POSIX:
>
> "The behavior of a process is undefined after it
> returns
> normally
> from a signal-catching function for a SIGBUS, SIGFPE,
> SIGILL, or
> SIGSEGV signal that was not generated by kill(),
> sigqueue(), or
> raise()."
>
> true, but we dont return...
>
> On top of that you also have the issue that error
> reporting
> does a
> whole bunch of things that are not
> async-signal-safe so we can
> easily encounter hangs or aborts.
>
> But we're dying anyway so I guess none of this really
> matters. If
> re-enabling these signals allows error reporting to
> progress further
> in some cases then that is a win.
>
>
> Actually, this covers a lot of cases, mostly because
> SIGSEGV during
> error reporting is common, so if the original error was not
> SIGSEGV, but
> e.g. SIGILL, this would always result in broken hs-err
> files.
>
> The back story is that at SAP, we rely heavily on the
> hs-err
> files. They
> are our main tool for support, because working with
> cores is
> often not
> feasible. So, we put a lot of work in making error
> reporting
> reliable
> across all platforms. This is also covered by many
> tests which
> crash the
> VM in exciting ways and check the hs-err files for
> completeness.
>
>
> OK. Modulo the cpu specific SIGILL part everything else
> seems fine.
>
> Great. just tell me how you want that part.
>
> Kind regards, Thomas
>
> Thanks,
> David
>
>
More information about the hotspot-runtime-dev
mailing list