RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Thu Nov 27 09:01:05 UTC 2014

On 27/11/2014 5:36 PM, Thomas Stüfe wrote:
> Unfortunately, I cannot test it, as I have no ARM environment. The best
> I can come up with without testing is this:
> http://stackoverflow.com/questions/16081618/programmatically-cause-undefined-instruction-exception

The issue is how to handle this? Put ifdefs for ARM in the open code? 
Revert to your per-platform solution? Some other variation? Or do we 
just not care if we can't trigger SIGILL on ARM? Though I'd like to hear 
from the AARCH64 folk too.

David

> Kind regards, Thomas
>
> On Thu, Nov 27, 2014 at 6:18 AM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
>     On 27/11/2014 10:49 AM, David Holmes wrote:
>
>         On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>
>             Hi David,
>
>             here you go:
>             http://cr.openjdk.java.net/~__stuefe/webrevs/8065895/webrev.__02/
>             <http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/>
>
>             Reverted SIGILL-generating function back to its original
>             form, plus the
>             folding of the 000 case.
>
>
>         Thanks Thomas! While we are awaiting a second reviewer I will
>         test this
>         out internally. It may take a day or two sorry.
>
>
>     Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV
>     instead:
>
>     will jump to PC 0xb6fb1000, which should cause a SIGILL.
>     #
>     # A fatal error has been detected by the Java Runtime Environment:
>     #
>     #  SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280
>
>     If I read the ARM architecture manual correctly all zeroes will map
>     to a conditional AND instruction (Ref A8.6.12 AND(register))
>
>     David
>
>
>         David
>
>             I only can guess what your closed platforms are, but if it
>             is ARM, I
>             believe opcodes 0-31 are undefined. For ia64, 0 is undefined
>             as well.
>
>             Kind regards, Thomas
>
>
>             On Wed, Nov 26, 2014 at 1:02 PM, David Holmes
>             <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>             <mailto:david.holmes at oracle.__com
>             <mailto:david.holmes at oracle.com>>> wrote:
>
>                  On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>
>                      Hi David,
>                      ...
>
>                                    - In debug.cpp for the SIGILL can you
>             define the
>                      all zero
>                               case as a
>                                    default so we only need to add
>             platform specific
>                               definitions when
>                                    all zeroes doesn't work. I really
>             hate seeing all
>                      that CPU
>                               selection
>                                    in shared code. :(
>
>
>                               Agreed and fixed, moved the CPU-specific
>             sections into
>                               CPU-specific files.
>
>
>                           I'd really like to see a way to share the
>             all-zeroes case
>                      so that we
>                           don't need to add platform specific code
>             unnecessarily.
>
>
>                      sooo.. back to the original code then, just with
>             the #ifdef,
>                      just with
>                      the zero-cases all folded in into the #else path?
>             Or do you
>             prefer
>                      something else?
>
>
>                  Elsewhere there is a pattern of defining per-platform
>             values that
>                  can override the shared definition. eg:
>
>                  #ifndef HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
>                     Foo XXX = ...;  //shared/default initalization
>                  #endif
>
>                  but this assumes a platform specific header has already
>             been
>                  included that can do:
>
>                  #define HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
>                  Foo XXX = ... ; // platform specific initialization
>
>                  But that is not the case for debug.hpp.
>
>                  So I guess folding the zero-case into the else path is
>             the best we
>                  can do. However I'm assuming the zero case will work
>             for our
>                  internal platforms ... if it doesn't then we'd have to
>             pollute the
>                  shared code with info for the closed platforms. :(
>
>                  David
>                  -----
>
>
>                                    - Style nit: please use i++ rather
>             than i ++
>
>
>                               Fixed.
>
>                                    Aside: we should eradicate the use of
>             sigprocmask and
>                               replace with
>                                    the thread specific version.
>
>
>                               Agree. Though I never saw any errors
>             stemming from the
>                      use of
>                               sigprocmask(). According to POSIX,
>             sigprocmask() is
>                      undefined in
>                               multithreaded environment, and I guess
>             most OSes just
>                      default to
>                               pthread_sigmask.
>
>
>                           Yes "probably" works okay but I hate to see us
>             using
>                      something with
>                           undefined semantics. That's future clean up
>             though.
>
>
>                      We (SAP JVM) already use pthread_sigmask() /
>             thr_sigsetmask()
>                      instead of
>                      sigprocmask. Works fine. We can port this to the
>             OpenJDK.
>
>                                    Getting back to the "thinking more
>             about this" ...
>                      If a
>                               synchronous
>                                    signal is blocked at the time it is
>             generated
>             then it
>                               should remain
>                                    pending on the thread (POSIX spec)
>             but that
>                      doesn't tell us
>                               what the
>                                    thread will then do - retry the faulting
>                      instruction? Become
>                                    unschedulable? So I can easily
>             imagine that a hang
>                      or process
>                                    termination may result.
>
>
>                               This is exactly what happens, but it is
>             actually
>             covered by
>                               POSIX, see
>                               doc on pthread_sigmask: "If any of the
>             SIGFPE, SIGILL,
>                      SIGSEGV, or
>                               SIGBUS signals are generated while they
>             are blocked,
>                      the result is
>                               undefined, unless the signal was generated
>             by the
>             /kill/()
>
>
>             <http://pubs.opengroup.org/______onlinepubs/009695399/______functions/kill.html
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html>
>
>
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>>
>
>
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
>             <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>>
>                               function, the /sigqueue/()
>
>
>             <http://pubs.opengroup.org/______onlinepubs/009695399/______functions/sigqueue.html
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html>
>
>
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>>
>
>
>
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>
>
>
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html
>             <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>>>>
>
>                               function, or the /raise/()
>
>
>             <http://pubs.opengroup.org/______onlinepubs/009695399/______functions/raise.html
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html>
>
>
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>>
>
>
>
>             <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>
>
>             <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>             <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>>
>                               function."
>
>
>                           Thanks - I managed to miss that part even
>             though I found
>                      the other
>                           part about the signal handling function
>             returning. :(
>
>
>                      It is well hidden, I found it by accident :) To me
>             it looks like
>                      they
>                      kept it intentionally vague, to not block platforms
>             where those
>                      signals
>                      could be somehow dealt with automatically? Hard to
>             see though
>                      how this
>                      would work.
>
>
>
>                               In reality, process usually aborts
>             abnormally with the
>                      default
>                               action
>                               for the signal, e.g. printing out "Illegal
>             Instruction". On
>                               MacOS, we
>                               hang (until the Watcherthread finally
>             kills the VM).
>             On old
>                               AIXes, we
>                               die without a trace.
>
>                               This also can be easily tried out by
>             removing SIGILL
>                      from the
>                               list of
>                               signals in vmError_<os>.cpp and executing:
>
>                               java -XX:ErrorHandlerTest=14
>             -XX:TestCrashInErrorHandler=15
>
>                               which will crash first with a SIGSEGV,
>             then in error
>                      handling with a
>                               secondary SIGILL. This will interrupt
>             error reporting
>                      and kill
>                               or hang
>                               the process.
>
>
>                                    In that sense unblocking those
>             signals whilst
>                      handling the
>                               initial
>                                    signal may well allow the error
>             reporting process
>                      to continue
>                                    further. But I'm unclear exactly how
>             this plays
>             out:
>
>                                    - synchronous signal encountered
>                                    - crash_handler invoked
>
>                                    - VMError::report_and_die executes
>                                    - secondary signal encountered
>
>                                    - crash_handler invoked again
>
>
>                               almost: not again, different signal
>             handler now. First
>                      signal was
>                               handled by "JVM_handle_<os>_signal()"
>
>
>                           Ah missed that - thanks - not that it makes much
>             difference :)
>
>
>                      I just like nitpicking :)
>
>                                    - VMError::report_and_die executes
>             again and
>             sees the
>                               recursion and
>                                    returns (ignoring abort due to
>             excessive recursive
>                      errors)
>
>
>                               No..
>
>                                    Is that right? So we actually return
>             from the
>                      crash_handler?
>
>
>                               Oh, but we dont return.
>             VMError::report_and_die()
>             will just
>                               create a new
>                               frame and re-execute VMError::report() of
>             the first
>                      VMError object.
>                               Which then will continue with the next
>             STEP. We never
>                      return,
>                               for each
>                               secondary error signal a new frame is created.
>
>
>                           I had trouble tracing through exactly what
>             might happen
>             on the
>                           recursive call to report_and_die. I see know
>             that report
>                      comes from:
>
>                                staticBufferStream sbs(buffer, O_BUFLEN,
>             &log);
>                                first_error->report(&sbs);
>                                first_error->_current_step = 0;
>               // reset
>                      current_step
>                                first_error->_current_step_______info =
>             "";   // reset
>                      current_step
>
>                           string
>
>                           so the second time through we will call report and
>                      _current_step
>                           should indicate where to start executing from.
>
>
>                      Exactly. There is also a catch, in that the stack
>             usage goes
>             up. Not
>                      endlessly, it is limited by the number of error
>             reporting steps.
>                      The more stack VmError::report() does cost, the
>             less well this
>                      works,
>                      especially in stack overflow scenarios.
>
>                      Which is why we extended SafeFetch and enabled it
>             for the use
>             in the
>                      error handler, which will be one of the the next
>             patches I'd
>             like to
>                      port to the OpenJDK, once this one is thru.
>
>
>                               This all happens in VMError::report_and_die:
>                               -> first error ? anchor VMError object in
>             a static
>                      variable and
>                               execute
>                               VMError::report()
>                               -> secondary error?
>                                    -> different thread? just sleep forever
>                                    -> same thread? new frame, re-enter
>                      VMError::report(). Once
>                               done, abort.
>
>                               I always found that rather neat, but in
>             fact that is
>                      not our
>                               invention
>                               but Sun's :) Anyway, my fix does not
>             change this
>                      behaviour for
>                               better or
>                               worse, it only makes it usable for more cases.
>
>                                    Because this puts us in undefined
>             territory
>                      according to POSIX:
>
>                                    "The behavior of a process is
>             undefined after it
>                      returns
>                               normally
>                                    from a signal-catching function for a
>             SIGBUS,
>             SIGFPE,
>                               SIGILL, or
>                                    SIGSEGV signal that was not generated
>             by kill(),
>                      sigqueue(), or
>                                    raise()."
>
>                               true, but we dont return...
>
>                                    On top of that you also have the
>             issue that error
>                      reporting
>                               does a
>                                    whole bunch of things that are not
>                      async-signal-safe so we can
>                                    easily encounter hangs or aborts.
>
>                                    But we're dying anyway so I guess
>             none of this
>             really
>                               matters. If
>                                    re-enabling these signals allows
>             error reporting to
>                               progress further
>                                    in some cases then that is a win.
>
>
>                               Actually, this covers a lot of cases,
>             mostly because
>                      SIGSEGV during
>                               error reporting is common, so if the
>             original error
>             was not
>                               SIGSEGV, but
>                               e.g. SIGILL, this would always result in
>             broken hs-err
>                      files.
>
>                               The back story is that at SAP, we rely
>             heavily on the
>                      hs-err
>                               files. They
>                               are our main tool for support, because
>             working with
>                      cores is
>                               often not
>                               feasible. So, we put a lot of work in
>             making error
>                      reporting
>                               reliable
>                               across all platforms. This is also covered
>             by many
>                      tests which
>                               crash the
>                               VM in exciting ways and check the hs-err
>             files for
>                      completeness.
>
>
>                           OK. Modulo the cpu specific SIGILL part
>             everything else
>                      seems fine.
>
>                      Great. just tell me how you want that part.
>
>                      Kind regards, Thomas
>
>                           Thanks,
>                           David
>
>
>