RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Thomas Stüfe thomas.stuefe at gmail.com
Thu Nov 27 09:27:02 UTC 2014


I am preparing a jtreg tests which would fail if no SIGILL is produced. A
real SIGILL is needed to make the test meaningful, although I guess a fake
SIGILL (kill() or raise()) would make the test pass too. Which could be a
workaround for the time being.

I could live with either #ifdef in shared code - shared code already
contains lots of #ifdef ARM - or with cpu-specific files; I also could add
the debug_<cpu>.hpp files needed for your solution.

Kind Regards, Thomas

On Thu, Nov 27, 2014 at 10:01 AM, David Holmes <david.holmes at oracle.com>
wrote:

> On 27/11/2014 5:36 PM, Thomas Stüfe wrote:
>
>> Unfortunately, I cannot test it, as I have no ARM environment. The best
>> I can come up with without testing is this:
>> http://stackoverflow.com/questions/16081618/programmatically-cause-
>> undefined-instruction-exception
>>
>
> The issue is how to handle this? Put ifdefs for ARM in the open code?
> Revert to your per-platform solution? Some other variation? Or do we just
> not care if we can't trigger SIGILL on ARM? Though I'd like to hear from
> the AARCH64 folk too.
>
> David
>
>  Kind regards, Thomas
>>
>> On Thu, Nov 27, 2014 at 6:18 AM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     On 27/11/2014 10:49 AM, David Holmes wrote:
>>
>>         On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>>
>>             Hi David,
>>
>>             here you go:
>>             http://cr.openjdk.java.net/~__stuefe/webrevs/8065895/webrev.
>> __02/
>>
>>             <http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.
>> 02/>
>>
>>             Reverted SIGILL-generating function back to its original
>>             form, plus the
>>             folding of the 000 case.
>>
>>
>>         Thanks Thomas! While we are awaiting a second reviewer I will
>>         test this
>>         out internally. It may take a day or two sorry.
>>
>>
>>     Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV
>>     instead:
>>
>>     will jump to PC 0xb6fb1000, which should cause a SIGILL.
>>     #
>>     # A fatal error has been detected by the Java Runtime Environment:
>>     #
>>     #  SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280
>>
>>     If I read the ARM architecture manual correctly all zeroes will map
>>     to a conditional AND instruction (Ref A8.6.12 AND(register))
>>
>>     David
>>
>>
>>         David
>>
>>             I only can guess what your closed platforms are, but if it
>>             is ARM, I
>>             believe opcodes 0-31 are undefined. For ia64, 0 is undefined
>>             as well.
>>
>>             Kind regards, Thomas
>>
>>
>>             On Wed, Nov 26, 2014 at 1:02 PM, David Holmes
>>             <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>             <mailto:david.holmes at oracle.__com
>>
>>             <mailto:david.holmes at oracle.com>>> wrote:
>>
>>                  On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>>
>>                      Hi David,
>>                      ...
>>
>>                                    - In debug.cpp for the SIGILL can you
>>             define the
>>                      all zero
>>                               case as a
>>                                    default so we only need to add
>>             platform specific
>>                               definitions when
>>                                    all zeroes doesn't work. I really
>>             hate seeing all
>>                      that CPU
>>                               selection
>>                                    in shared code. :(
>>
>>
>>                               Agreed and fixed, moved the CPU-specific
>>             sections into
>>                               CPU-specific files.
>>
>>
>>                           I'd really like to see a way to share the
>>             all-zeroes case
>>                      so that we
>>                           don't need to add platform specific code
>>             unnecessarily.
>>
>>
>>                      sooo.. back to the original code then, just with
>>             the #ifdef,
>>                      just with
>>                      the zero-cases all folded in into the #else path?
>>             Or do you
>>             prefer
>>                      something else?
>>
>>
>>                  Elsewhere there is a pattern of defining per-platform
>>             values that
>>                  can override the shared definition. eg:
>>
>>                  #ifndef HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
>>                     Foo XXX = ...;  //shared/default initalization
>>                  #endif
>>
>>                  but this assumes a platform specific header has already
>>             been
>>                  included that can do:
>>
>>                  #define HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
>>
>>                  Foo XXX = ... ; // platform specific initialization
>>
>>                  But that is not the case for debug.hpp.
>>
>>                  So I guess folding the zero-case into the else path is
>>             the best we
>>                  can do. However I'm assuming the zero case will work
>>             for our
>>                  internal platforms ... if it doesn't then we'd have to
>>             pollute the
>>                  shared code with info for the closed platforms. :(
>>
>>                  David
>>                  -----
>>
>>
>>                                    - Style nit: please use i++ rather
>>             than i ++
>>
>>
>>                               Fixed.
>>
>>                                    Aside: we should eradicate the use of
>>             sigprocmask and
>>                               replace with
>>                                    the thread specific version.
>>
>>
>>                               Agree. Though I never saw any errors
>>             stemming from the
>>                      use of
>>                               sigprocmask(). According to POSIX,
>>             sigprocmask() is
>>                      undefined in
>>                               multithreaded environment, and I guess
>>             most OSes just
>>                      default to
>>                               pthread_sigmask.
>>
>>
>>                           Yes "probably" works okay but I hate to see us
>>             using
>>                      something with
>>                           undefined semantics. That's future clean up
>>             though.
>>
>>
>>                      We (SAP JVM) already use pthread_sigmask() /
>>             thr_sigsetmask()
>>                      instead of
>>                      sigprocmask. Works fine. We can port this to the
>>             OpenJDK.
>>
>>                                    Getting back to the "thinking more
>>             about this" ...
>>                      If a
>>                               synchronous
>>                                    signal is blocked at the time it is
>>             generated
>>             then it
>>                               should remain
>>                                    pending on the thread (POSIX spec)
>>             but that
>>                      doesn't tell us
>>                               what the
>>                                    thread will then do - retry the
>> faulting
>>                      instruction? Become
>>                                    unschedulable? So I can easily
>>             imagine that a hang
>>                      or process
>>                                    termination may result.
>>
>>
>>                               This is exactly what happens, but it is
>>             actually
>>             covered by
>>                               POSIX, see
>>                               doc on pthread_sigmask: "If any of the
>>             SIGFPE, SIGILL,
>>                      SIGSEGV, or
>>                               SIGBUS signals are generated while they
>>             are blocked,
>>                      the result is
>>                               undefined, unless the signal was generated
>>             by the
>>             /kill/()
>>
>>
>>             <http://pubs.opengroup.org/______onlinepubs/009695399/______
>> functions/kill.html
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/kill.html>
>>
>>
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/kill.html
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html>>
>>
>>
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/kill.html
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html>
>>
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html
>>             <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/kill.html>>>>
>>                               function, the /sigqueue/()
>>
>>
>>             <http://pubs.opengroup.org/______onlinepubs/009695399/______
>> functions/sigqueue.html
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/sigqueue.html>
>>
>>
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/sigqueue.html
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html>>
>>
>>
>>
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/sigqueue.html
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html>
>>
>>
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html
>>             <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/sigqueue.html>>>>
>>
>>                               function, or the /raise/()
>>
>>
>>             <http://pubs.opengroup.org/______onlinepubs/009695399/______
>> functions/raise.html
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/raise.html>
>>
>>
>>
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/raise.html
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html>>
>>
>>
>>
>>             <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/raise.html
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html>
>>
>>             <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html
>>             <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/raise.html>>>>
>>                               function."
>>
>>
>>                           Thanks - I managed to miss that part even
>>             though I found
>>                      the other
>>                           part about the signal handling function
>>             returning. :(
>>
>>
>>                      It is well hidden, I found it by accident :) To me
>>             it looks like
>>                      they
>>                      kept it intentionally vague, to not block platforms
>>             where those
>>                      signals
>>                      could be somehow dealt with automatically? Hard to
>>             see though
>>                      how this
>>                      would work.
>>
>>
>>
>>                               In reality, process usually aborts
>>             abnormally with the
>>                      default
>>                               action
>>                               for the signal, e.g. printing out "Illegal
>>             Instruction". On
>>                               MacOS, we
>>                               hang (until the Watcherthread finally
>>             kills the VM).
>>             On old
>>                               AIXes, we
>>                               die without a trace.
>>
>>                               This also can be easily tried out by
>>             removing SIGILL
>>                      from the
>>                               list of
>>                               signals in vmError_<os>.cpp and executing:
>>
>>                               java -XX:ErrorHandlerTest=14
>>             -XX:TestCrashInErrorHandler=15
>>
>>                               which will crash first with a SIGSEGV,
>>             then in error
>>                      handling with a
>>                               secondary SIGILL. This will interrupt
>>             error reporting
>>                      and kill
>>                               or hang
>>                               the process.
>>
>>
>>                                    In that sense unblocking those
>>             signals whilst
>>                      handling the
>>                               initial
>>                                    signal may well allow the error
>>             reporting process
>>                      to continue
>>                                    further. But I'm unclear exactly how
>>             this plays
>>             out:
>>
>>                                    - synchronous signal encountered
>>                                    - crash_handler invoked
>>
>>                                    - VMError::report_and_die executes
>>                                    - secondary signal encountered
>>
>>                                    - crash_handler invoked again
>>
>>
>>                               almost: not again, different signal
>>             handler now. First
>>                      signal was
>>                               handled by "JVM_handle_<os>_signal()"
>>
>>
>>                           Ah missed that - thanks - not that it makes much
>>             difference :)
>>
>>
>>                      I just like nitpicking :)
>>
>>                                    - VMError::report_and_die executes
>>             again and
>>             sees the
>>                               recursion and
>>                                    returns (ignoring abort due to
>>             excessive recursive
>>                      errors)
>>
>>
>>                               No..
>>
>>                                    Is that right? So we actually return
>>             from the
>>                      crash_handler?
>>
>>
>>                               Oh, but we dont return.
>>             VMError::report_and_die()
>>             will just
>>                               create a new
>>                               frame and re-execute VMError::report() of
>>             the first
>>                      VMError object.
>>                               Which then will continue with the next
>>             STEP. We never
>>                      return,
>>                               for each
>>                               secondary error signal a new frame is
>> created.
>>
>>
>>                           I had trouble tracing through exactly what
>>             might happen
>>             on the
>>                           recursive call to report_and_die. I see know
>>             that report
>>                      comes from:
>>
>>                                staticBufferStream sbs(buffer, O_BUFLEN,
>>             &log);
>>                                first_error->report(&sbs);
>>                                first_error->_current_step = 0;
>>               // reset
>>                      current_step
>>                                first_error->_current_step_______info =
>>
>>             "";   // reset
>>                      current_step
>>
>>                           string
>>
>>                           so the second time through we will call report
>> and
>>                      _current_step
>>                           should indicate where to start executing from.
>>
>>
>>                      Exactly. There is also a catch, in that the stack
>>             usage goes
>>             up. Not
>>                      endlessly, it is limited by the number of error
>>             reporting steps.
>>                      The more stack VmError::report() does cost, the
>>             less well this
>>                      works,
>>                      especially in stack overflow scenarios.
>>
>>                      Which is why we extended SafeFetch and enabled it
>>             for the use
>>             in the
>>                      error handler, which will be one of the the next
>>             patches I'd
>>             like to
>>                      port to the OpenJDK, once this one is thru.
>>
>>
>>                               This all happens in VMError::report_and_die:
>>                               -> first error ? anchor VMError object in
>>             a static
>>                      variable and
>>                               execute
>>                               VMError::report()
>>                               -> secondary error?
>>                                    -> different thread? just sleep forever
>>                                    -> same thread? new frame, re-enter
>>                      VMError::report(). Once
>>                               done, abort.
>>
>>                               I always found that rather neat, but in
>>             fact that is
>>                      not our
>>                               invention
>>                               but Sun's :) Anyway, my fix does not
>>             change this
>>                      behaviour for
>>                               better or
>>                               worse, it only makes it usable for more
>> cases.
>>
>>                                    Because this puts us in undefined
>>             territory
>>                      according to POSIX:
>>
>>                                    "The behavior of a process is
>>             undefined after it
>>                      returns
>>                               normally
>>                                    from a signal-catching function for a
>>             SIGBUS,
>>             SIGFPE,
>>                               SIGILL, or
>>                                    SIGSEGV signal that was not generated
>>             by kill(),
>>                      sigqueue(), or
>>                                    raise()."
>>
>>                               true, but we dont return...
>>
>>                                    On top of that you also have the
>>             issue that error
>>                      reporting
>>                               does a
>>                                    whole bunch of things that are not
>>                      async-signal-safe so we can
>>                                    easily encounter hangs or aborts.
>>
>>                                    But we're dying anyway so I guess
>>             none of this
>>             really
>>                               matters. If
>>                                    re-enabling these signals allows
>>             error reporting to
>>                               progress further
>>                                    in some cases then that is a win.
>>
>>
>>                               Actually, this covers a lot of cases,
>>             mostly because
>>                      SIGSEGV during
>>                               error reporting is common, so if the
>>             original error
>>             was not
>>                               SIGSEGV, but
>>                               e.g. SIGILL, this would always result in
>>             broken hs-err
>>                      files.
>>
>>                               The back story is that at SAP, we rely
>>             heavily on the
>>                      hs-err
>>                               files. They
>>                               are our main tool for support, because
>>             working with
>>                      cores is
>>                               often not
>>                               feasible. So, we put a lot of work in
>>             making error
>>                      reporting
>>                               reliable
>>                               across all platforms. This is also covered
>>             by many
>>                      tests which
>>                               crash the
>>                               VM in exciting ways and check the hs-err
>>             files for
>>                      completeness.
>>
>>
>>                           OK. Modulo the cpu specific SIGILL part
>>             everything else
>>                      seems fine.
>>
>>                      Great. just tell me how you want that part.
>>
>>                      Kind regards, Thomas
>>
>>                           Thanks,
>>                           David
>>
>>
>>
>>


More information about the hotspot-runtime-dev mailing list