RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Thomas Stüfe thomas.stuefe at gmail.com
Thu Nov 27 07:36:44 UTC 2014


Unfortunately, I cannot test it, as I have no ARM environment. The best I
can come up with without testing is this:
http://stackoverflow.com/questions/16081618/programmatically-cause-undefined-instruction-exception

Kind regards, Thomas

On Thu, Nov 27, 2014 at 6:18 AM, David Holmes <david.holmes at oracle.com>
wrote:

> On 27/11/2014 10:49 AM, David Holmes wrote:
>
>> On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>>
>>> Hi David,
>>>
>>> here you go:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/
>>>
>>> Reverted SIGILL-generating function back to its original form, plus the
>>> folding of the 000 case.
>>>
>>
>> Thanks Thomas! While we are awaiting a second reviewer I will test this
>> out internally. It may take a day or two sorry.
>>
>
> Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV instead:
>
> will jump to PC 0xb6fb1000, which should cause a SIGILL.
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280
>
> If I read the ARM architecture manual correctly all zeroes will map to a
> conditional AND instruction (Ref A8.6.12 AND(register))
>
> David
>
>
>  David
>>
>>  I only can guess what your closed platforms are, but if it is ARM, I
>>> believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.
>>>
>>> Kind regards, Thomas
>>>
>>>
>>> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>>     On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>>>
>>>         Hi David,
>>>         ...
>>>
>>>                       - In debug.cpp for the SIGILL can you define the
>>>         all zero
>>>                  case as a
>>>                       default so we only need to add platform specific
>>>                  definitions when
>>>                       all zeroes doesn't work. I really hate seeing all
>>>         that CPU
>>>                  selection
>>>                       in shared code. :(
>>>
>>>
>>>                  Agreed and fixed, moved the CPU-specific sections into
>>>                  CPU-specific files.
>>>
>>>
>>>              I'd really like to see a way to share the all-zeroes case
>>>         so that we
>>>              don't need to add platform specific code unnecessarily.
>>>
>>>
>>>         sooo.. back to the original code then, just with the #ifdef,
>>>         just with
>>>         the zero-cases all folded in into the #else path? Or do you
>>> prefer
>>>         something else?
>>>
>>>
>>>     Elsewhere there is a pattern of defining per-platform values that
>>>     can override the shared definition. eg:
>>>
>>>     #ifndef HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>>>        Foo XXX = ...;  //shared/default initalization
>>>     #endif
>>>
>>>     but this assumes a platform specific header has already been
>>>     included that can do:
>>>
>>>     #define HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>>>     Foo XXX = ... ; // platform specific initialization
>>>
>>>     But that is not the case for debug.hpp.
>>>
>>>     So I guess folding the zero-case into the else path is the best we
>>>     can do. However I'm assuming the zero case will work for our
>>>     internal platforms ... if it doesn't then we'd have to pollute the
>>>     shared code with info for the closed platforms. :(
>>>
>>>     David
>>>     -----
>>>
>>>
>>>                       - Style nit: please use i++ rather than i ++
>>>
>>>
>>>                  Fixed.
>>>
>>>                       Aside: we should eradicate the use of
>>> sigprocmask and
>>>                  replace with
>>>                       the thread specific version.
>>>
>>>
>>>                  Agree. Though I never saw any errors stemming from the
>>>         use of
>>>                  sigprocmask(). According to POSIX, sigprocmask() is
>>>         undefined in
>>>                  multithreaded environment, and I guess most OSes just
>>>         default to
>>>                  pthread_sigmask.
>>>
>>>
>>>              Yes "probably" works okay but I hate to see us using
>>>         something with
>>>              undefined semantics. That's future clean up though.
>>>
>>>
>>>         We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask()
>>>         instead of
>>>         sigprocmask. Works fine. We can port this to the OpenJDK.
>>>
>>>                       Getting back to the "thinking more about this" ...
>>>         If a
>>>                  synchronous
>>>                       signal is blocked at the time it is generated
>>> then it
>>>                  should remain
>>>                       pending on the thread (POSIX spec) but that
>>>         doesn't tell us
>>>                  what the
>>>                       thread will then do - retry the faulting
>>>         instruction? Become
>>>                       unschedulable? So I can easily imagine that a hang
>>>         or process
>>>                       termination may result.
>>>
>>>
>>>                  This is exactly what happens, but it is actually
>>> covered by
>>>                  POSIX, see
>>>                  doc on pthread_sigmask: "If any of the SIGFPE, SIGILL,
>>>         SIGSEGV, or
>>>                  SIGBUS signals are generated while they are blocked,
>>>         the result is
>>>                  undefined, unless the signal was generated by the
>>> /kill/()
>>>
>>>
>>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>>> functions/kill.html
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
>>>
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>
>>>                  function, the /sigqueue/()
>>>
>>>
>>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>>> functions/sigqueue.html
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>>> functions/sigqueue.html>
>>>
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>>> functions/sigqueue.html
>>>
>>>
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html
>>> >>>
>>>
>>>                  function, or the /raise/()
>>>
>>>
>>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>>> functions/raise.html
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>>> >
>>>
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>>>
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>
>>>                  function."
>>>
>>>
>>>              Thanks - I managed to miss that part even though I found
>>>         the other
>>>              part about the signal handling function returning. :(
>>>
>>>
>>>         It is well hidden, I found it by accident :) To me it looks like
>>>         they
>>>         kept it intentionally vague, to not block platforms where those
>>>         signals
>>>         could be somehow dealt with automatically? Hard to see though
>>>         how this
>>>         would work.
>>>
>>>
>>>
>>>                  In reality, process usually aborts abnormally with the
>>>         default
>>>                  action
>>>                  for the signal, e.g. printing out "Illegal
>>> Instruction". On
>>>                  MacOS, we
>>>                  hang (until the Watcherthread finally kills the VM).
>>> On old
>>>                  AIXes, we
>>>                  die without a trace.
>>>
>>>                  This also can be easily tried out by removing SIGILL
>>>         from the
>>>                  list of
>>>                  signals in vmError_<os>.cpp and executing:
>>>
>>>                  java -XX:ErrorHandlerTest=14
>>> -XX:TestCrashInErrorHandler=15
>>>
>>>                  which will crash first with a SIGSEGV, then in error
>>>         handling with a
>>>                  secondary SIGILL. This will interrupt error reporting
>>>         and kill
>>>                  or hang
>>>                  the process.
>>>
>>>
>>>                       In that sense unblocking those signals whilst
>>>         handling the
>>>                  initial
>>>                       signal may well allow the error reporting process
>>>         to continue
>>>                       further. But I'm unclear exactly how this plays
>>> out:
>>>
>>>                       - synchronous signal encountered
>>>                       - crash_handler invoked
>>>
>>>                       - VMError::report_and_die executes
>>>                       - secondary signal encountered
>>>
>>>                       - crash_handler invoked again
>>>
>>>
>>>                  almost: not again, different signal handler now. First
>>>         signal was
>>>                  handled by "JVM_handle_<os>_signal()"
>>>
>>>
>>>              Ah missed that - thanks - not that it makes much
>>> difference :)
>>>
>>>
>>>         I just like nitpicking :)
>>>
>>>                       - VMError::report_and_die executes again and
>>> sees the
>>>                  recursion and
>>>                       returns (ignoring abort due to excessive recursive
>>>         errors)
>>>
>>>
>>>                  No..
>>>
>>>                       Is that right? So we actually return from the
>>>         crash_handler?
>>>
>>>
>>>                  Oh, but we dont return. VMError::report_and_die()
>>> will just
>>>                  create a new
>>>                  frame and re-execute VMError::report() of the first
>>>         VMError object.
>>>                  Which then will continue with the next STEP. We never
>>>         return,
>>>                  for each
>>>                  secondary error signal a new frame is created.
>>>
>>>
>>>              I had trouble tracing through exactly what might happen
>>> on the
>>>              recursive call to report_and_die. I see know that report
>>>         comes from:
>>>
>>>                   staticBufferStream sbs(buffer, O_BUFLEN, &log);
>>>                   first_error->report(&sbs);
>>>                   first_error->_current_step = 0;         // reset
>>>         current_step
>>>                   first_error->_current_step_____info = "";   // reset
>>>         current_step
>>>
>>>              string
>>>
>>>              so the second time through we will call report and
>>>         _current_step
>>>              should indicate where to start executing from.
>>>
>>>
>>>         Exactly. There is also a catch, in that the stack usage goes
>>> up. Not
>>>         endlessly, it is limited by the number of error reporting steps.
>>>         The more stack VmError::report() does cost, the less well this
>>>         works,
>>>         especially in stack overflow scenarios.
>>>
>>>         Which is why we extended SafeFetch and enabled it for the use
>>> in the
>>>         error handler, which will be one of the the next patches I'd
>>> like to
>>>         port to the OpenJDK, once this one is thru.
>>>
>>>
>>>                  This all happens in VMError::report_and_die:
>>>                  -> first error ? anchor VMError object in a static
>>>         variable and
>>>                  execute
>>>                  VMError::report()
>>>                  -> secondary error?
>>>                       -> different thread? just sleep forever
>>>                       -> same thread? new frame, re-enter
>>>         VMError::report(). Once
>>>                  done, abort.
>>>
>>>                  I always found that rather neat, but in fact that is
>>>         not our
>>>                  invention
>>>                  but Sun's :) Anyway, my fix does not change this
>>>         behaviour for
>>>                  better or
>>>                  worse, it only makes it usable for more cases.
>>>
>>>                       Because this puts us in undefined territory
>>>         according to POSIX:
>>>
>>>                       "The behavior of a process is undefined after it
>>>         returns
>>>                  normally
>>>                       from a signal-catching function for a SIGBUS,
>>> SIGFPE,
>>>                  SIGILL, or
>>>                       SIGSEGV signal that was not generated by kill(),
>>>         sigqueue(), or
>>>                       raise()."
>>>
>>>                  true, but we dont return...
>>>
>>>                       On top of that you also have the issue that error
>>>         reporting
>>>                  does a
>>>                       whole bunch of things that are not
>>>         async-signal-safe so we can
>>>                       easily encounter hangs or aborts.
>>>
>>>                       But we're dying anyway so I guess none of this
>>> really
>>>                  matters. If
>>>                       re-enabling these signals allows error reporting to
>>>                  progress further
>>>                       in some cases then that is a win.
>>>
>>>
>>>                  Actually, this covers a lot of cases, mostly because
>>>         SIGSEGV during
>>>                  error reporting is common, so if the original error
>>> was not
>>>                  SIGSEGV, but
>>>                  e.g. SIGILL, this would always result in broken hs-err
>>>         files.
>>>
>>>                  The back story is that at SAP, we rely heavily on the
>>>         hs-err
>>>                  files. They
>>>                  are our main tool for support, because working with
>>>         cores is
>>>                  often not
>>>                  feasible. So, we put a lot of work in making error
>>>         reporting
>>>                  reliable
>>>                  across all platforms. This is also covered by many
>>>         tests which
>>>                  crash the
>>>                  VM in exciting ways and check the hs-err files for
>>>         completeness.
>>>
>>>
>>>              OK. Modulo the cpu specific SIGILL part everything else
>>>         seems fine.
>>>
>>>         Great. just tell me how you want that part.
>>>
>>>         Kind regards, Thomas
>>>
>>>              Thanks,
>>>              David
>>>
>>>
>>>


More information about the hotspot-runtime-dev mailing list