RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Wed Nov 26 12:02:38 UTC 2014

On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
> Hi David,
> ...
>
>              - In debug.cpp for the SIGILL can you define the  all zero
>         case as a
>              default so we only need to add platform specific
>         definitions when
>              all zeroes doesn't work. I really hate seeing all that CPU
>         selection
>              in shared code. :(
>
>
>         Agreed and fixed, moved the CPU-specific sections into
>         CPU-specific files.
>
>
>     I'd really like to see a way to share the all-zeroes case so that we
>     don't need to add platform specific code unnecessarily.
>
>
> sooo.. back to the original code then, just with the #ifdef, just with
> the zero-cases all folded in into the #else path? Or do you prefer
> something else?

Elsewhere there is a pattern of defining per-platform values that can 
override the shared definition. eg:

#ifndef HAS_SPECIAL_PLATFORM_VALUE_FOR_XXXX
   Foo XXX = ...;  //shared/default initalization
#endif

but this assumes a platform specific header has already been included 
that can do:

#define HAS_SPECIAL_PLATFORM_VALUE_FOR_XXXX
Foo XXX = ... ; // platform specific initialization

But that is not the case for debug.hpp.

So I guess folding the zero-case into the else path is the best we can 
do. However I'm assuming the zero case will work for our internal 
platforms ... if it doesn't then we'd have to pollute the shared code 
with info for the closed platforms. :(

David
-----

>
>              - Style nit: please use i++ rather than i ++
>
>
>         Fixed.
>
>              Aside: we should eradicate the use of sigprocmask and
>         replace with
>              the thread specific version.
>
>
>         Agree. Though I never saw any errors stemming from the use of
>         sigprocmask(). According to POSIX, sigprocmask() is undefined in
>         multithreaded environment, and I guess most OSes just default to
>         pthread_sigmask.
>
>
>     Yes "probably" works okay but I hate to see us using something with
>     undefined semantics. That's future clean up though.
>
>
> We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask() instead of
> sigprocmask. Works fine. We can port this to the OpenJDK.
>
>              Getting back to the "thinking more about this" ... If a
>         synchronous
>              signal is blocked at the time it is generated then it
>         should remain
>              pending on the thread (POSIX spec) but that doesn't tell us
>         what the
>              thread will then do - retry the faulting instruction? Become
>              unschedulable? So I can easily imagine that a hang or process
>              termination may result.
>
>
>         This is exactly what happens, but it is actually covered by
>         POSIX, see
>         doc on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or
>         SIGBUS signals are generated while they are blocked, the result is
>         undefined, unless the signal was generated by the /kill/()
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
>         <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>
>         function, the /sigqueue/()
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html
>         <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>>
>         function, or the /raise/()
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>         <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>
>         function."
>
>
>     Thanks - I managed to miss that part even though I found the other
>     part about the signal handling function returning. :(
>
>
> It is well hidden, I found it by accident :) To me it looks like they
> kept it intentionally vague, to not block platforms where those signals
> could be somehow dealt with automatically? Hard to see though how this
> would work.
>
>
>
>         In reality, process usually aborts abnormally with the default
>         action
>         for the signal, e.g. printing out "Illegal Instruction". On
>         MacOS, we
>         hang (until the Watcherthread finally kills the VM). On old
>         AIXes, we
>         die without a trace.
>
>         This also can be easily tried out by removing SIGILL from the
>         list of
>         signals in vmError_<os>.cpp and executing:
>
>         java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>
>         which will crash first with a SIGSEGV, then in error handling with a
>         secondary SIGILL. This will interrupt error reporting and kill
>         or hang
>         the process.
>
>
>              In that sense unblocking those signals whilst handling the
>         initial
>              signal may well allow the error reporting process to continue
>              further. But I'm unclear exactly how this plays out:
>
>              - synchronous signal encountered
>              - crash_handler invoked
>
>              - VMError::report_and_die executes
>              - secondary signal encountered
>
>              - crash_handler invoked again
>
>
>         almost: not again, different signal handler now. First signal was
>         handled by "JVM_handle_<os>_signal()"
>
>
>     Ah missed that - thanks - not that it makes much difference :)
>
>
> I just like nitpicking :)
>
>              - VMError::report_and_die executes again and sees the
>         recursion and
>              returns (ignoring abort due to excessive recursive errors)
>
>
>         No..
>
>              Is that right? So we actually return from the crash_handler?
>
>
>         Oh, but we dont return. VMError::report_and_die() will just
>         create a new
>         frame and re-execute VMError::report() of the first VMError object.
>         Which then will continue with the next STEP. We never return,
>         for each
>         secondary error signal a new frame is created.
>
>
>     I had trouble tracing through exactly what might happen on the
>     recursive call to report_and_die. I see know that report comes from:
>
>          staticBufferStream sbs(buffer, O_BUFLEN, &log);
>          first_error->report(&sbs);
>          first_error->_current_step = 0;         // reset current_step
>          first_error->_current_step___info = "";   // reset current_step
>     string
>
>     so the second time through we will call report and _current_step
>     should indicate where to start executing from.
>
>
> Exactly. There is also a catch, in that the stack usage goes up. Not
> endlessly, it is limited by the number of error reporting steps.
> The more stack VmError::report() does cost, the less well this works,
> especially in stack overflow scenarios.
>
> Which is why we extended SafeFetch and enabled it for the use in the
> error handler, which will be one of the the next patches I'd like to
> port to the OpenJDK, once this one is thru.
>
>
>         This all happens in VMError::report_and_die:
>         -> first error ? anchor VMError object in a static variable and
>         execute
>         VMError::report()
>         -> secondary error?
>              -> different thread? just sleep forever
>              -> same thread? new frame, re-enter VMError::report(). Once
>         done, abort.
>
>         I always found that rather neat, but in fact that is not our
>         invention
>         but Sun's :) Anyway, my fix does not change this behaviour for
>         better or
>         worse, it only makes it usable for more cases.
>
>              Because this puts us in undefined territory according to POSIX:
>
>              "The behavior of a process is undefined after it returns
>         normally
>              from a signal-catching function for a SIGBUS, SIGFPE,
>         SIGILL, or
>              SIGSEGV signal that was not generated by kill(), sigqueue(), or
>              raise()."
>
>         true, but we dont return...
>
>              On top of that you also have the issue that error reporting
>         does a
>              whole bunch of things that are not async-signal-safe so we can
>              easily encounter hangs or aborts.
>
>              But we're dying anyway so I guess none of this really
>         matters. If
>              re-enabling these signals allows error reporting to
>         progress further
>              in some cases then that is a win.
>
>
>         Actually, this covers a lot of cases, mostly because SIGSEGV during
>         error reporting is common, so if the original error was not
>         SIGSEGV, but
>         e.g. SIGILL, this would always result in broken hs-err files.
>
>         The back story is that at SAP, we rely heavily on the hs-err
>         files. They
>         are our main tool for support, because working with cores is
>         often not
>         feasible. So, we put a lot of work in making error reporting
>         reliable
>         across all platforms. This is also covered by many tests which
>         crash the
>         VM in exciting ways and check the hs-err files for completeness.
>
>
>     OK. Modulo the cpu specific SIGILL part everything else seems fine.
>
> Great. just tell me how you want that part.
>
> Kind regards, Thomas
>
>     Thanks,
>     David
>