RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Wed Nov 26 13:33:27 UTC 2014

Hi David,

here you go: http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/

Reverted SIGILL-generating function back to its original form, plus the
folding of the 000 case.

I only can guess what your closed platforms are, but if it is ARM, I
believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.

Kind regards, Thomas

On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com>
wrote:

> On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>
>> Hi David,
>> ...
>>
>>              - In debug.cpp for the SIGILL can you define the  all zero
>>         case as a
>>              default so we only need to add platform specific
>>         definitions when
>>              all zeroes doesn't work. I really hate seeing all that CPU
>>         selection
>>              in shared code. :(
>>
>>
>>         Agreed and fixed, moved the CPU-specific sections into
>>         CPU-specific files.
>>
>>
>>     I'd really like to see a way to share the all-zeroes case so that we
>>     don't need to add platform specific code unnecessarily.
>>
>>
>> sooo.. back to the original code then, just with the #ifdef, just with
>> the zero-cases all folded in into the #else path? Or do you prefer
>> something else?
>>
>
> Elsewhere there is a pattern of defining per-platform values that can
> override the shared definition. eg:
>
> #ifndef HAS_SPECIAL_PLATFORM_VALUE_FOR_XXXX
>   Foo XXX = ...;  //shared/default initalization
> #endif
>
> but this assumes a platform specific header has already been included that
> can do:
>
> #define HAS_SPECIAL_PLATFORM_VALUE_FOR_XXXX
> Foo XXX = ... ; // platform specific initialization
>
> But that is not the case for debug.hpp.
>
> So I guess folding the zero-case into the else path is the best we can do.
> However I'm assuming the zero case will work for our internal platforms ...
> if it doesn't then we'd have to pollute the shared code with info for the
> closed platforms. :(
>
> David
> -----
>
>
>>              - Style nit: please use i++ rather than i ++
>>
>>
>>         Fixed.
>>
>>              Aside: we should eradicate the use of sigprocmask and
>>         replace with
>>              the thread specific version.
>>
>>
>>         Agree. Though I never saw any errors stemming from the use of
>>         sigprocmask(). According to POSIX, sigprocmask() is undefined in
>>         multithreaded environment, and I guess most OSes just default to
>>         pthread_sigmask.
>>
>>
>>     Yes "probably" works okay but I hate to see us using something with
>>     undefined semantics. That's future clean up though.
>>
>>
>> We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask() instead of
>> sigprocmask. Works fine. We can port this to the OpenJDK.
>>
>>              Getting back to the "thinking more about this" ... If a
>>         synchronous
>>              signal is blocked at the time it is generated then it
>>         should remain
>>              pending on the thread (POSIX spec) but that doesn't tell us
>>         what the
>>              thread will then do - retry the faulting instruction? Become
>>              unschedulable? So I can easily imagine that a hang or process
>>              termination may result.
>>
>>
>>         This is exactly what happens, but it is actually covered by
>>         POSIX, see
>>         doc on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or
>>         SIGBUS signals are generated while they are blocked, the result is
>>         undefined, unless the signal was generated by the /kill/()
>>         <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html
>>         <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/kill.html>>
>>         function, the /sigqueue/()
>>         <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html
>>         <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/sigqueue.html>>
>>         function, or the /raise/()
>>         <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html
>>
>>         <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/raise.html>>
>>         function."
>>
>>
>>     Thanks - I managed to miss that part even though I found the other
>>     part about the signal handling function returning. :(
>>
>>
>> It is well hidden, I found it by accident :) To me it looks like they
>> kept it intentionally vague, to not block platforms where those signals
>> could be somehow dealt with automatically? Hard to see though how this
>> would work.
>>
>>
>>
>>         In reality, process usually aborts abnormally with the default
>>         action
>>         for the signal, e.g. printing out "Illegal Instruction". On
>>         MacOS, we
>>         hang (until the Watcherthread finally kills the VM). On old
>>         AIXes, we
>>         die without a trace.
>>
>>         This also can be easily tried out by removing SIGILL from the
>>         list of
>>         signals in vmError_<os>.cpp and executing:
>>
>>         java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>>
>>         which will crash first with a SIGSEGV, then in error handling
>> with a
>>         secondary SIGILL. This will interrupt error reporting and kill
>>         or hang
>>         the process.
>>
>>
>>              In that sense unblocking those signals whilst handling the
>>         initial
>>              signal may well allow the error reporting process to continue
>>              further. But I'm unclear exactly how this plays out:
>>
>>              - synchronous signal encountered
>>              - crash_handler invoked
>>
>>              - VMError::report_and_die executes
>>              - secondary signal encountered
>>
>>              - crash_handler invoked again
>>
>>
>>         almost: not again, different signal handler now. First signal was
>>         handled by "JVM_handle_<os>_signal()"
>>
>>
>>     Ah missed that - thanks - not that it makes much difference :)
>>
>>
>> I just like nitpicking :)
>>
>>              - VMError::report_and_die executes again and sees the
>>         recursion and
>>              returns (ignoring abort due to excessive recursive errors)
>>
>>
>>         No..
>>
>>              Is that right? So we actually return from the crash_handler?
>>
>>
>>         Oh, but we dont return. VMError::report_and_die() will just
>>         create a new
>>         frame and re-execute VMError::report() of the first VMError
>> object.
>>         Which then will continue with the next STEP. We never return,
>>         for each
>>         secondary error signal a new frame is created.
>>
>>
>>     I had trouble tracing through exactly what might happen on the
>>     recursive call to report_and_die. I see know that report comes from:
>>
>>          staticBufferStream sbs(buffer, O_BUFLEN, &log);
>>          first_error->report(&sbs);
>>          first_error->_current_step = 0;         // reset current_step
>>          first_error->_current_step___info = "";   // reset current_step
>>
>>     string
>>
>>     so the second time through we will call report and _current_step
>>     should indicate where to start executing from.
>>
>>
>> Exactly. There is also a catch, in that the stack usage goes up. Not
>> endlessly, it is limited by the number of error reporting steps.
>> The more stack VmError::report() does cost, the less well this works,
>> especially in stack overflow scenarios.
>>
>> Which is why we extended SafeFetch and enabled it for the use in the
>> error handler, which will be one of the the next patches I'd like to
>> port to the OpenJDK, once this one is thru.
>>
>>
>>         This all happens in VMError::report_and_die:
>>         -> first error ? anchor VMError object in a static variable and
>>         execute
>>         VMError::report()
>>         -> secondary error?
>>              -> different thread? just sleep forever
>>              -> same thread? new frame, re-enter VMError::report(). Once
>>         done, abort.
>>
>>         I always found that rather neat, but in fact that is not our
>>         invention
>>         but Sun's :) Anyway, my fix does not change this behaviour for
>>         better or
>>         worse, it only makes it usable for more cases.
>>
>>              Because this puts us in undefined territory according to
>> POSIX:
>>
>>              "The behavior of a process is undefined after it returns
>>         normally
>>              from a signal-catching function for a SIGBUS, SIGFPE,
>>         SIGILL, or
>>              SIGSEGV signal that was not generated by kill(), sigqueue(),
>> or
>>              raise()."
>>
>>         true, but we dont return...
>>
>>              On top of that you also have the issue that error reporting
>>         does a
>>              whole bunch of things that are not async-signal-safe so we
>> can
>>              easily encounter hangs or aborts.
>>
>>              But we're dying anyway so I guess none of this really
>>         matters. If
>>              re-enabling these signals allows error reporting to
>>         progress further
>>              in some cases then that is a win.
>>
>>
>>         Actually, this covers a lot of cases, mostly because SIGSEGV
>> during
>>         error reporting is common, so if the original error was not
>>         SIGSEGV, but
>>         e.g. SIGILL, this would always result in broken hs-err files.
>>
>>         The back story is that at SAP, we rely heavily on the hs-err
>>         files. They
>>         are our main tool for support, because working with cores is
>>         often not
>>         feasible. So, we put a lot of work in making error reporting
>>         reliable
>>         across all platforms. This is also covered by many tests which
>>         crash the
>>         VM in exciting ways and check the hs-err files for completeness.
>>
>>
>>     OK. Modulo the cpu specific SIGILL part everything else seems fine.
>>
>> Great. just tell me how you want that part.
>>
>> Kind regards, Thomas
>>
>>     Thanks,
>>     David
>>
>>