RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
Thomas Stüfe
thomas.stuefe at gmail.com
Thu Nov 27 07:36:44 UTC 2014
Unfortunately, I cannot test it, as I have no ARM environment. The best I
can come up with without testing is this:
http://stackoverflow.com/questions/16081618/programmatically-cause-undefined-instruction-exception
Kind regards, Thomas
On Thu, Nov 27, 2014 at 6:18 AM, David Holmes <david.holmes at oracle.com>
wrote:
> On 27/11/2014 10:49 AM, David Holmes wrote:
>
>> On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>>
>>> Hi David,
>>>
>>> here you go:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/
>>>
>>> Reverted SIGILL-generating function back to its original form, plus the
>>> folding of the 000 case.
>>>
>>
>> Thanks Thomas! While we are awaiting a second reviewer I will test this
>> out internally. It may take a day or two sorry.
>>
>
> Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV instead:
>
> will jump to PC 0xb6fb1000, which should cause a SIGILL.
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280
>
> If I read the ARM architecture manual correctly all zeroes will map to a
> conditional AND instruction (Ref A8.6.12 AND(register))
>
> David
>
>
> David
>>
>> I only can guess what your closed platforms are, but if it is ARM, I
>>> believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.
>>>
>>> Kind regards, Thomas
>>>
>>>
>>> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>> On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>>>
>>> Hi David,
>>> ...
>>>
>>> - In debug.cpp for the SIGILL can you define the
>>> all zero
>>> case as a
>>> default so we only need to add platform specific
>>> definitions when
>>> all zeroes doesn't work. I really hate seeing all
>>> that CPU
>>> selection
>>> in shared code. :(
>>>
>>>
>>> Agreed and fixed, moved the CPU-specific sections into
>>> CPU-specific files.
>>>
>>>
>>> I'd really like to see a way to share the all-zeroes case
>>> so that we
>>> don't need to add platform specific code unnecessarily.
>>>
>>>
>>> sooo.. back to the original code then, just with the #ifdef,
>>> just with
>>> the zero-cases all folded in into the #else path? Or do you
>>> prefer
>>> something else?
>>>
>>>
>>> Elsewhere there is a pattern of defining per-platform values that
>>> can override the shared definition. eg:
>>>
>>> #ifndef HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>>> Foo XXX = ...; //shared/default initalization
>>> #endif
>>>
>>> but this assumes a platform specific header has already been
>>> included that can do:
>>>
>>> #define HAS_SPECIAL_PLATFORM_VALUE___FOR_XXXX
>>> Foo XXX = ... ; // platform specific initialization
>>>
>>> But that is not the case for debug.hpp.
>>>
>>> So I guess folding the zero-case into the else path is the best we
>>> can do. However I'm assuming the zero case will work for our
>>> internal platforms ... if it doesn't then we'd have to pollute the
>>> shared code with info for the closed platforms. :(
>>>
>>> David
>>> -----
>>>
>>>
>>> - Style nit: please use i++ rather than i ++
>>>
>>>
>>> Fixed.
>>>
>>> Aside: we should eradicate the use of
>>> sigprocmask and
>>> replace with
>>> the thread specific version.
>>>
>>>
>>> Agree. Though I never saw any errors stemming from the
>>> use of
>>> sigprocmask(). According to POSIX, sigprocmask() is
>>> undefined in
>>> multithreaded environment, and I guess most OSes just
>>> default to
>>> pthread_sigmask.
>>>
>>>
>>> Yes "probably" works okay but I hate to see us using
>>> something with
>>> undefined semantics. That's future clean up though.
>>>
>>>
>>> We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask()
>>> instead of
>>> sigprocmask. Works fine. We can port this to the OpenJDK.
>>>
>>> Getting back to the "thinking more about this" ...
>>> If a
>>> synchronous
>>> signal is blocked at the time it is generated
>>> then it
>>> should remain
>>> pending on the thread (POSIX spec) but that
>>> doesn't tell us
>>> what the
>>> thread will then do - retry the faulting
>>> instruction? Become
>>> unschedulable? So I can easily imagine that a hang
>>> or process
>>> termination may result.
>>>
>>>
>>> This is exactly what happens, but it is actually
>>> covered by
>>> POSIX, see
>>> doc on pthread_sigmask: "If any of the SIGFPE, SIGILL,
>>> SIGSEGV, or
>>> SIGBUS signals are generated while they are blocked,
>>> the result is
>>> undefined, unless the signal was generated by the
>>> /kill/()
>>>
>>>
>>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>>> functions/kill.html
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
>>>
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>
>>> function, the /sigqueue/()
>>>
>>>
>>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>>> functions/sigqueue.html
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>>> functions/sigqueue.html>
>>>
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>>> functions/sigqueue.html
>>>
>>>
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html
>>> >>>
>>>
>>> function, or the /raise/()
>>>
>>>
>>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>>> functions/raise.html
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>>> >
>>>
>>>
>>>
>>> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
>>>
>>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>
>>> function."
>>>
>>>
>>> Thanks - I managed to miss that part even though I found
>>> the other
>>> part about the signal handling function returning. :(
>>>
>>>
>>> It is well hidden, I found it by accident :) To me it looks like
>>> they
>>> kept it intentionally vague, to not block platforms where those
>>> signals
>>> could be somehow dealt with automatically? Hard to see though
>>> how this
>>> would work.
>>>
>>>
>>>
>>> In reality, process usually aborts abnormally with the
>>> default
>>> action
>>> for the signal, e.g. printing out "Illegal
>>> Instruction". On
>>> MacOS, we
>>> hang (until the Watcherthread finally kills the VM).
>>> On old
>>> AIXes, we
>>> die without a trace.
>>>
>>> This also can be easily tried out by removing SIGILL
>>> from the
>>> list of
>>> signals in vmError_<os>.cpp and executing:
>>>
>>> java -XX:ErrorHandlerTest=14
>>> -XX:TestCrashInErrorHandler=15
>>>
>>> which will crash first with a SIGSEGV, then in error
>>> handling with a
>>> secondary SIGILL. This will interrupt error reporting
>>> and kill
>>> or hang
>>> the process.
>>>
>>>
>>> In that sense unblocking those signals whilst
>>> handling the
>>> initial
>>> signal may well allow the error reporting process
>>> to continue
>>> further. But I'm unclear exactly how this plays
>>> out:
>>>
>>> - synchronous signal encountered
>>> - crash_handler invoked
>>>
>>> - VMError::report_and_die executes
>>> - secondary signal encountered
>>>
>>> - crash_handler invoked again
>>>
>>>
>>> almost: not again, different signal handler now. First
>>> signal was
>>> handled by "JVM_handle_<os>_signal()"
>>>
>>>
>>> Ah missed that - thanks - not that it makes much
>>> difference :)
>>>
>>>
>>> I just like nitpicking :)
>>>
>>> - VMError::report_and_die executes again and
>>> sees the
>>> recursion and
>>> returns (ignoring abort due to excessive recursive
>>> errors)
>>>
>>>
>>> No..
>>>
>>> Is that right? So we actually return from the
>>> crash_handler?
>>>
>>>
>>> Oh, but we dont return. VMError::report_and_die()
>>> will just
>>> create a new
>>> frame and re-execute VMError::report() of the first
>>> VMError object.
>>> Which then will continue with the next STEP. We never
>>> return,
>>> for each
>>> secondary error signal a new frame is created.
>>>
>>>
>>> I had trouble tracing through exactly what might happen
>>> on the
>>> recursive call to report_and_die. I see know that report
>>> comes from:
>>>
>>> staticBufferStream sbs(buffer, O_BUFLEN, &log);
>>> first_error->report(&sbs);
>>> first_error->_current_step = 0; // reset
>>> current_step
>>> first_error->_current_step_____info = ""; // reset
>>> current_step
>>>
>>> string
>>>
>>> so the second time through we will call report and
>>> _current_step
>>> should indicate where to start executing from.
>>>
>>>
>>> Exactly. There is also a catch, in that the stack usage goes
>>> up. Not
>>> endlessly, it is limited by the number of error reporting steps.
>>> The more stack VmError::report() does cost, the less well this
>>> works,
>>> especially in stack overflow scenarios.
>>>
>>> Which is why we extended SafeFetch and enabled it for the use
>>> in the
>>> error handler, which will be one of the the next patches I'd
>>> like to
>>> port to the OpenJDK, once this one is thru.
>>>
>>>
>>> This all happens in VMError::report_and_die:
>>> -> first error ? anchor VMError object in a static
>>> variable and
>>> execute
>>> VMError::report()
>>> -> secondary error?
>>> -> different thread? just sleep forever
>>> -> same thread? new frame, re-enter
>>> VMError::report(). Once
>>> done, abort.
>>>
>>> I always found that rather neat, but in fact that is
>>> not our
>>> invention
>>> but Sun's :) Anyway, my fix does not change this
>>> behaviour for
>>> better or
>>> worse, it only makes it usable for more cases.
>>>
>>> Because this puts us in undefined territory
>>> according to POSIX:
>>>
>>> "The behavior of a process is undefined after it
>>> returns
>>> normally
>>> from a signal-catching function for a SIGBUS,
>>> SIGFPE,
>>> SIGILL, or
>>> SIGSEGV signal that was not generated by kill(),
>>> sigqueue(), or
>>> raise()."
>>>
>>> true, but we dont return...
>>>
>>> On top of that you also have the issue that error
>>> reporting
>>> does a
>>> whole bunch of things that are not
>>> async-signal-safe so we can
>>> easily encounter hangs or aborts.
>>>
>>> But we're dying anyway so I guess none of this
>>> really
>>> matters. If
>>> re-enabling these signals allows error reporting to
>>> progress further
>>> in some cases then that is a win.
>>>
>>>
>>> Actually, this covers a lot of cases, mostly because
>>> SIGSEGV during
>>> error reporting is common, so if the original error
>>> was not
>>> SIGSEGV, but
>>> e.g. SIGILL, this would always result in broken hs-err
>>> files.
>>>
>>> The back story is that at SAP, we rely heavily on the
>>> hs-err
>>> files. They
>>> are our main tool for support, because working with
>>> cores is
>>> often not
>>> feasible. So, we put a lot of work in making error
>>> reporting
>>> reliable
>>> across all platforms. This is also covered by many
>>> tests which
>>> crash the
>>> VM in exciting ways and check the hs-err files for
>>> completeness.
>>>
>>>
>>> OK. Modulo the cpu specific SIGILL part everything else
>>> seems fine.
>>>
>>> Great. just tell me how you want that part.
>>>
>>> Kind regards, Thomas
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>
More information about the hotspot-runtime-dev
mailing list