RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

David Holmes david.holmes at oracle.com
Thu Nov 27 00:49:06 UTC 2014


On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
> Hi David,
>
> here you go: http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/
>
> Reverted SIGILL-generating function back to its original form, plus the
> folding of the 000 case.

Thanks Thomas! While we are awaiting a second reviewer I will test this 
out internally. It may take a day or two sorry.

David

> I only can guess what your closed platforms are, but if it is ARM, I
> believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.
>
> Kind regards, Thomas
>
>
> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
>     On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>
>         Hi David,
>         ...
>
>                       - In debug.cpp for the SIGILL can you define the
>         all zero
>                  case as a
>                       default so we only need to add platform specific
>                  definitions when
>                       all zeroes doesn't work. I really hate seeing all
>         that CPU
>                  selection
>                       in shared code. :(
>
>
>                  Agreed and fixed, moved the CPU-specific sections into
>                  CPU-specific files.
>
>
>              I'd really like to see a way to share the all-zeroes case
>         so that we
>              don't need to add platform specific code unnecessarily.
>
>
>         sooo.. back to the original code then, just with the #ifdef,
>         just with
>         the zero-cases all folded in into the #else path? Or do you prefer
>         something else?
>
>
>     Elsewhere there is a pattern of defining per-platform values that
>     can override the shared definition. eg:
>
>     #ifndef HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>        Foo XXX = ...;  //shared/default initalization
>     #endif
>
>     but this assumes a platform specific header has already been
>     included that can do:
>
>     #define HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>     Foo XXX = ... ; // platform specific initialization
>
>     But that is not the case for debug.hpp.
>
>     So I guess folding the zero-case into the else path is the best we
>     can do. However I'm assuming the zero case will work for our
>     internal platforms ... if it doesn't then we'd have to pollute the
>     shared code with info for the closed platforms. :(
>
>     David
>     -----
>
>
>                       - Style nit: please use i++ rather than i ++
>
>
>                  Fixed.
>
>                       Aside: we should eradicate the use of sigprocmask and
>                  replace with
>                       the thread specific version.
>
>
>                  Agree. Though I never saw any errors stemming from the
>         use of
>                  sigprocmask(). According to POSIX, sigprocmask() is
>         undefined in
>                  multithreaded environment, and I guess most OSes just
>         default to
>                  pthread_sigmask.
>
>
>              Yes "probably" works okay but I hate to see us using
>         something with
>              undefined semantics. That's future clean up though.
>
>
>         We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask()
>         instead of
>         sigprocmask. Works fine. We can port this to the OpenJDK.
>
>                       Getting back to the "thinking more about this" ...
>         If a
>                  synchronous
>                       signal is blocked at the time it is generated then it
>                  should remain
>                       pending on the thread (POSIX spec) but that
>         doesn't tell us
>                  what the
>                       thread will then do - retry the faulting
>         instruction? Become
>                       unschedulable? So I can easily imagine that a hang
>         or process
>                       termination may result.
>
>
>                  This is exactly what happens, but it is actually covered by
>                  POSIX, see
>                  doc on pthread_sigmask: "If any of the SIGFPE, SIGILL,
>         SIGSEGV, or
>                  SIGBUS signals are generated while they are blocked,
>         the result is
>                  undefined, unless the signal was generated by the /kill/()
>
>         <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
>         <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>
>                  function, the /sigqueue/()
>
>         <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>
>
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html
>         <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>>>
>                  function, or the /raise/()
>
>         <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>
>
>
>         <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>         <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>
>                  function."
>
>
>              Thanks - I managed to miss that part even though I found
>         the other
>              part about the signal handling function returning. :(
>
>
>         It is well hidden, I found it by accident :) To me it looks like
>         they
>         kept it intentionally vague, to not block platforms where those
>         signals
>         could be somehow dealt with automatically? Hard to see though
>         how this
>         would work.
>
>
>
>                  In reality, process usually aborts abnormally with the
>         default
>                  action
>                  for the signal, e.g. printing out "Illegal Instruction". On
>                  MacOS, we
>                  hang (until the Watcherthread finally kills the VM). On old
>                  AIXes, we
>                  die without a trace.
>
>                  This also can be easily tried out by removing SIGILL
>         from the
>                  list of
>                  signals in vmError_<os>.cpp and executing:
>
>                  java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>
>                  which will crash first with a SIGSEGV, then in error
>         handling with a
>                  secondary SIGILL. This will interrupt error reporting
>         and kill
>                  or hang
>                  the process.
>
>
>                       In that sense unblocking those signals whilst
>         handling the
>                  initial
>                       signal may well allow the error reporting process
>         to continue
>                       further. But I'm unclear exactly how this plays out:
>
>                       - synchronous signal encountered
>                       - crash_handler invoked
>
>                       - VMError::report_and_die executes
>                       - secondary signal encountered
>
>                       - crash_handler invoked again
>
>
>                  almost: not again, different signal handler now. First
>         signal was
>                  handled by "JVM_handle_<os>_signal()"
>
>
>              Ah missed that - thanks - not that it makes much difference :)
>
>
>         I just like nitpicking :)
>
>                       - VMError::report_and_die executes again and sees the
>                  recursion and
>                       returns (ignoring abort due to excessive recursive
>         errors)
>
>
>                  No..
>
>                       Is that right? So we actually return from the
>         crash_handler?
>
>
>                  Oh, but we dont return. VMError::report_and_die() will just
>                  create a new
>                  frame and re-execute VMError::report() of the first
>         VMError object.
>                  Which then will continue with the next STEP. We never
>         return,
>                  for each
>                  secondary error signal a new frame is created.
>
>
>              I had trouble tracing through exactly what might happen on the
>              recursive call to report_and_die. I see know that report
>         comes from:
>
>                   staticBufferStream sbs(buffer, O_BUFLEN, &log);
>                   first_error->report(&sbs);
>                   first_error->_current_step = 0;         // reset
>         current_step
>                   first_error->_current_step_____info = "";   // reset
>         current_step
>
>              string
>
>              so the second time through we will call report and
>         _current_step
>              should indicate where to start executing from.
>
>
>         Exactly. There is also a catch, in that the stack usage goes up. Not
>         endlessly, it is limited by the number of error reporting steps.
>         The more stack VmError::report() does cost, the less well this
>         works,
>         especially in stack overflow scenarios.
>
>         Which is why we extended SafeFetch and enabled it for the use in the
>         error handler, which will be one of the the next patches I'd like to
>         port to the OpenJDK, once this one is thru.
>
>
>                  This all happens in VMError::report_and_die:
>                  -> first error ? anchor VMError object in a static
>         variable and
>                  execute
>                  VMError::report()
>                  -> secondary error?
>                       -> different thread? just sleep forever
>                       -> same thread? new frame, re-enter
>         VMError::report(). Once
>                  done, abort.
>
>                  I always found that rather neat, but in fact that is
>         not our
>                  invention
>                  but Sun's :) Anyway, my fix does not change this
>         behaviour for
>                  better or
>                  worse, it only makes it usable for more cases.
>
>                       Because this puts us in undefined territory
>         according to POSIX:
>
>                       "The behavior of a process is undefined after it
>         returns
>                  normally
>                       from a signal-catching function for a SIGBUS, SIGFPE,
>                  SIGILL, or
>                       SIGSEGV signal that was not generated by kill(),
>         sigqueue(), or
>                       raise()."
>
>                  true, but we dont return...
>
>                       On top of that you also have the issue that error
>         reporting
>                  does a
>                       whole bunch of things that are not
>         async-signal-safe so we can
>                       easily encounter hangs or aborts.
>
>                       But we're dying anyway so I guess none of this really
>                  matters. If
>                       re-enabling these signals allows error reporting to
>                  progress further
>                       in some cases then that is a win.
>
>
>                  Actually, this covers a lot of cases, mostly because
>         SIGSEGV during
>                  error reporting is common, so if the original error was not
>                  SIGSEGV, but
>                  e.g. SIGILL, this would always result in broken hs-err
>         files.
>
>                  The back story is that at SAP, we rely heavily on the
>         hs-err
>                  files. They
>                  are our main tool for support, because working with
>         cores is
>                  often not
>                  feasible. So, we put a lot of work in making error
>         reporting
>                  reliable
>                  across all platforms. This is also covered by many
>         tests which
>                  crash the
>                  VM in exciting ways and check the hs-err files for
>         completeness.
>
>
>              OK. Modulo the cpu specific SIGILL part everything else
>         seems fine.
>
>         Great. just tell me how you want that part.
>
>         Kind regards, Thomas
>
>              Thanks,
>              David
>
>


More information about the hotspot-runtime-dev mailing list