RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Thu Nov 27 05:18:15 UTC 2014

On 27/11/2014 10:49 AM, David Holmes wrote:
> On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>> Hi David,
>>
>> here you go:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/
>>
>> Reverted SIGILL-generating function back to its original form, plus the
>> folding of the 000 case.
>
> Thanks Thomas! While we are awaiting a second reviewer I will test this
> out internally. It may take a day or two sorry.

Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV instead:

will jump to PC 0xb6fb1000, which should cause a SIGILL.
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280

If I read the ARM architecture manual correctly all zeroes will map to a 
conditional AND instruction (Ref A8.6.12 AND(register))

David

> David
>
>> I only can guess what your closed platforms are, but if it is ARM, I
>> believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.
>>
>> Kind regards, Thomas
>>
>>
>> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>>
>>         Hi David,
>>         ...
>>
>>                       - In debug.cpp for the SIGILL can you define the
>>         all zero
>>                  case as a
>>                       default so we only need to add platform specific
>>                  definitions when
>>                       all zeroes doesn't work. I really hate seeing all
>>         that CPU
>>                  selection
>>                       in shared code. :(
>>
>>
>>                  Agreed and fixed, moved the CPU-specific sections into
>>                  CPU-specific files.
>>
>>
>>              I'd really like to see a way to share the all-zeroes case
>>         so that we
>>              don't need to add platform specific code unnecessarily.
>>
>>
>>         sooo.. back to the original code then, just with the #ifdef,
>>         just with
>>         the zero-cases all folded in into the #else path? Or do you
>> prefer
>>         something else?
>>
>>
>>     Elsewhere there is a pattern of defining per-platform values that
>>     can override the shared definition. eg:
>>
>>     #ifndef HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>>        Foo XXX = ...;  //shared/default initalization
>>     #endif
>>
>>     but this assumes a platform specific header has already been
>>     included that can do:
>>
>>     #define HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>>     Foo XXX = ... ; // platform specific initialization
>>
>>     But that is not the case for debug.hpp.
>>
>>     So I guess folding the zero-case into the else path is the best we
>>     can do. However I'm assuming the zero case will work for our
>>     internal platforms ... if it doesn't then we'd have to pollute the
>>     shared code with info for the closed platforms. :(
>>
>>     David
>>     -----
>>
>>
>>                       - Style nit: please use i++ rather than i ++
>>
>>
>>                  Fixed.
>>
>>                       Aside: we should eradicate the use of
>> sigprocmask and
>>                  replace with
>>                       the thread specific version.
>>
>>
>>                  Agree. Though I never saw any errors stemming from the
>>         use of
>>                  sigprocmask(). According to POSIX, sigprocmask() is
>>         undefined in
>>                  multithreaded environment, and I guess most OSes just
>>         default to
>>                  pthread_sigmask.
>>
>>
>>              Yes "probably" works okay but I hate to see us using
>>         something with
>>              undefined semantics. That's future clean up though.
>>
>>
>>         We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask()
>>         instead of
>>         sigprocmask. Works fine. We can port this to the OpenJDK.
>>
>>                       Getting back to the "thinking more about this" ...
>>         If a
>>                  synchronous
>>                       signal is blocked at the time it is generated
>> then it
>>                  should remain
>>                       pending on the thread (POSIX spec) but that
>>         doesn't tell us
>>                  what the
>>                       thread will then do - retry the faulting
>>         instruction? Become
>>                       unschedulable? So I can easily imagine that a hang
>>         or process
>>                       termination may result.
>>
>>
>>                  This is exactly what happens, but it is actually
>> covered by
>>                  POSIX, see
>>                  doc on pthread_sigmask: "If any of the SIGFPE, SIGILL,
>>         SIGSEGV, or
>>                  SIGBUS signals are generated while they are blocked,
>>         the result is
>>                  undefined, unless the signal was generated by the
>> /kill/()
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
>>
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>
>>                  function, the /sigqueue/()
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>
>>
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html
>>
>>
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>>>
>>
>>                  function, or the /raise/()
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>
>>
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>>
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>
>>                  function."
>>
>>
>>              Thanks - I managed to miss that part even though I found
>>         the other
>>              part about the signal handling function returning. :(
>>
>>
>>         It is well hidden, I found it by accident :) To me it looks like
>>         they
>>         kept it intentionally vague, to not block platforms where those
>>         signals
>>         could be somehow dealt with automatically? Hard to see though
>>         how this
>>         would work.
>>
>>
>>
>>                  In reality, process usually aborts abnormally with the
>>         default
>>                  action
>>                  for the signal, e.g. printing out "Illegal
>> Instruction". On
>>                  MacOS, we
>>                  hang (until the Watcherthread finally kills the VM).
>> On old
>>                  AIXes, we
>>                  die without a trace.
>>
>>                  This also can be easily tried out by removing SIGILL
>>         from the
>>                  list of
>>                  signals in vmError_<os>.cpp and executing:
>>
>>                  java -XX:ErrorHandlerTest=14
>> -XX:TestCrashInErrorHandler=15
>>
>>                  which will crash first with a SIGSEGV, then in error
>>         handling with a
>>                  secondary SIGILL. This will interrupt error reporting
>>         and kill
>>                  or hang
>>                  the process.
>>
>>
>>                       In that sense unblocking those signals whilst
>>         handling the
>>                  initial
>>                       signal may well allow the error reporting process
>>         to continue
>>                       further. But I'm unclear exactly how this plays
>> out:
>>
>>                       - synchronous signal encountered
>>                       - crash_handler invoked
>>
>>                       - VMError::report_and_die executes
>>                       - secondary signal encountered
>>
>>                       - crash_handler invoked again
>>
>>
>>                  almost: not again, different signal handler now. First
>>         signal was
>>                  handled by "JVM_handle_<os>_signal()"
>>
>>
>>              Ah missed that - thanks - not that it makes much
>> difference :)
>>
>>
>>         I just like nitpicking :)
>>
>>                       - VMError::report_and_die executes again and
>> sees the
>>                  recursion and
>>                       returns (ignoring abort due to excessive recursive
>>         errors)
>>
>>
>>                  No..
>>
>>                       Is that right? So we actually return from the
>>         crash_handler?
>>
>>
>>                  Oh, but we dont return. VMError::report_and_die()
>> will just
>>                  create a new
>>                  frame and re-execute VMError::report() of the first
>>         VMError object.
>>                  Which then will continue with the next STEP. We never
>>         return,
>>                  for each
>>                  secondary error signal a new frame is created.
>>
>>
>>              I had trouble tracing through exactly what might happen
>> on the
>>              recursive call to report_and_die. I see know that report
>>         comes from:
>>
>>                   staticBufferStream sbs(buffer, O_BUFLEN, &log);
>>                   first_error->report(&sbs);
>>                   first_error->_current_step = 0;         // reset
>>         current_step
>>                   first_error->_current_step_____info = "";   // reset
>>         current_step
>>
>>              string
>>
>>              so the second time through we will call report and
>>         _current_step
>>              should indicate where to start executing from.
>>
>>
>>         Exactly. There is also a catch, in that the stack usage goes
>> up. Not
>>         endlessly, it is limited by the number of error reporting steps.
>>         The more stack VmError::report() does cost, the less well this
>>         works,
>>         especially in stack overflow scenarios.
>>
>>         Which is why we extended SafeFetch and enabled it for the use
>> in the
>>         error handler, which will be one of the the next patches I'd
>> like to
>>         port to the OpenJDK, once this one is thru.
>>
>>
>>                  This all happens in VMError::report_and_die:
>>                  -> first error ? anchor VMError object in a static
>>         variable and
>>                  execute
>>                  VMError::report()
>>                  -> secondary error?
>>                       -> different thread? just sleep forever
>>                       -> same thread? new frame, re-enter
>>         VMError::report(). Once
>>                  done, abort.
>>
>>                  I always found that rather neat, but in fact that is
>>         not our
>>                  invention
>>                  but Sun's :) Anyway, my fix does not change this
>>         behaviour for
>>                  better or
>>                  worse, it only makes it usable for more cases.
>>
>>                       Because this puts us in undefined territory
>>         according to POSIX:
>>
>>                       "The behavior of a process is undefined after it
>>         returns
>>                  normally
>>                       from a signal-catching function for a SIGBUS,
>> SIGFPE,
>>                  SIGILL, or
>>                       SIGSEGV signal that was not generated by kill(),
>>         sigqueue(), or
>>                       raise()."
>>
>>                  true, but we dont return...
>>
>>                       On top of that you also have the issue that error
>>         reporting
>>                  does a
>>                       whole bunch of things that are not
>>         async-signal-safe so we can
>>                       easily encounter hangs or aborts.
>>
>>                       But we're dying anyway so I guess none of this
>> really
>>                  matters. If
>>                       re-enabling these signals allows error reporting to
>>                  progress further
>>                       in some cases then that is a win.
>>
>>
>>                  Actually, this covers a lot of cases, mostly because
>>         SIGSEGV during
>>                  error reporting is common, so if the original error
>> was not
>>                  SIGSEGV, but
>>                  e.g. SIGILL, this would always result in broken hs-err
>>         files.
>>
>>                  The back story is that at SAP, we rely heavily on the
>>         hs-err
>>                  files. They
>>                  are our main tool for support, because working with
>>         cores is
>>                  often not
>>                  feasible. So, we put a lot of work in making error
>>         reporting
>>                  reliable
>>                  across all platforms. This is also covered by many
>>         tests which
>>                  crash the
>>                  VM in exciting ways and check the hs-err files for
>>         completeness.
>>
>>
>>              OK. Modulo the cpu specific SIGILL part everything else
>>         seems fine.
>>
>>         Great. just tell me how you want that part.
>>
>>         Kind regards, Thomas
>>
>>              Thanks,
>>              David
>>
>>