RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
Thomas Stüfe
thomas.stuefe at gmail.com
Thu Nov 27 09:27:02 UTC 2014
I am preparing a jtreg tests which would fail if no SIGILL is produced. A
real SIGILL is needed to make the test meaningful, although I guess a fake
SIGILL (kill() or raise()) would make the test pass too. Which could be a
workaround for the time being.
I could live with either #ifdef in shared code - shared code already
contains lots of #ifdef ARM - or with cpu-specific files; I also could add
the debug_<cpu>.hpp files needed for your solution.
Kind Regards, Thomas
On Thu, Nov 27, 2014 at 10:01 AM, David Holmes <david.holmes at oracle.com>
wrote:
> On 27/11/2014 5:36 PM, Thomas Stüfe wrote:
>
>> Unfortunately, I cannot test it, as I have no ARM environment. The best
>> I can come up with without testing is this:
>> http://stackoverflow.com/questions/16081618/programmatically-cause-
>> undefined-instruction-exception
>>
>
> The issue is how to handle this? Put ifdefs for ARM in the open code?
> Revert to your per-platform solution? Some other variation? Or do we just
> not care if we can't trigger SIGILL on ARM? Though I'd like to hear from
> the AARCH64 folk too.
>
> David
>
> Kind regards, Thomas
>>
>> On Thu, Nov 27, 2014 at 6:18 AM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>> On 27/11/2014 10:49 AM, David Holmes wrote:
>>
>> On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>>
>> Hi David,
>>
>> here you go:
>> http://cr.openjdk.java.net/~__stuefe/webrevs/8065895/webrev.
>> __02/
>>
>> <http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.
>> 02/>
>>
>> Reverted SIGILL-generating function back to its original
>> form, plus the
>> folding of the 000 case.
>>
>>
>> Thanks Thomas! While we are awaiting a second reviewer I will
>> test this
>> out internally. It may take a day or two sorry.
>>
>>
>> Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV
>> instead:
>>
>> will jump to PC 0xb6fb1000, which should cause a SIGILL.
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280
>>
>> If I read the ARM architecture manual correctly all zeroes will map
>> to a conditional AND instruction (Ref A8.6.12 AND(register))
>>
>> David
>>
>>
>> David
>>
>> I only can guess what your closed platforms are, but if it
>> is ARM, I
>> believe opcodes 0-31 are undefined. For ia64, 0 is undefined
>> as well.
>>
>> Kind regards, Thomas
>>
>>
>> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes
>> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.__com
>>
>> <mailto:david.holmes at oracle.com>>> wrote:
>>
>> On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>>
>> Hi David,
>> ...
>>
>> - In debug.cpp for the SIGILL can you
>> define the
>> all zero
>> case as a
>> default so we only need to add
>> platform specific
>> definitions when
>> all zeroes doesn't work. I really
>> hate seeing all
>> that CPU
>> selection
>> in shared code. :(
>>
>>
>> Agreed and fixed, moved the CPU-specific
>> sections into
>> CPU-specific files.
>>
>>
>> I'd really like to see a way to share the
>> all-zeroes case
>> so that we
>> don't need to add platform specific code
>> unnecessarily.
>>
>>
>> sooo.. back to the original code then, just with
>> the #ifdef,
>> just with
>> the zero-cases all folded in into the #else path?
>> Or do you
>> prefer
>> something else?
>>
>>
>> Elsewhere there is a pattern of defining per-platform
>> values that
>> can override the shared definition. eg:
>>
>> #ifndef HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
>> Foo XXX = ...; //shared/default initalization
>> #endif
>>
>> but this assumes a platform specific header has already
>> been
>> included that can do:
>>
>> #define HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
>>
>> Foo XXX = ... ; // platform specific initialization
>>
>> But that is not the case for debug.hpp.
>>
>> So I guess folding the zero-case into the else path is
>> the best we
>> can do. However I'm assuming the zero case will work
>> for our
>> internal platforms ... if it doesn't then we'd have to
>> pollute the
>> shared code with info for the closed platforms. :(
>>
>> David
>> -----
>>
>>
>> - Style nit: please use i++ rather
>> than i ++
>>
>>
>> Fixed.
>>
>> Aside: we should eradicate the use of
>> sigprocmask and
>> replace with
>> the thread specific version.
>>
>>
>> Agree. Though I never saw any errors
>> stemming from the
>> use of
>> sigprocmask(). According to POSIX,
>> sigprocmask() is
>> undefined in
>> multithreaded environment, and I guess
>> most OSes just
>> default to
>> pthread_sigmask.
>>
>>
>> Yes "probably" works okay but I hate to see us
>> using
>> something with
>> undefined semantics. That's future clean up
>> though.
>>
>>
>> We (SAP JVM) already use pthread_sigmask() /
>> thr_sigsetmask()
>> instead of
>> sigprocmask. Works fine. We can port this to the
>> OpenJDK.
>>
>> Getting back to the "thinking more
>> about this" ...
>> If a
>> synchronous
>> signal is blocked at the time it is
>> generated
>> then it
>> should remain
>> pending on the thread (POSIX spec)
>> but that
>> doesn't tell us
>> what the
>> thread will then do - retry the
>> faulting
>> instruction? Become
>> unschedulable? So I can easily
>> imagine that a hang
>> or process
>> termination may result.
>>
>>
>> This is exactly what happens, but it is
>> actually
>> covered by
>> POSIX, see
>> doc on pthread_sigmask: "If any of the
>> SIGFPE, SIGILL,
>> SIGSEGV, or
>> SIGBUS signals are generated while they
>> are blocked,
>> the result is
>> undefined, unless the signal was generated
>> by the
>> /kill/()
>>
>>
>> <http://pubs.opengroup.org/______onlinepubs/009695399/______
>> functions/kill.html
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/kill.html>
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/kill.html
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html>>
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/kill.html
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html
>> <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/kill.html>>>>
>> function, the /sigqueue/()
>>
>>
>> <http://pubs.opengroup.org/______onlinepubs/009695399/______
>> functions/sigqueue.html
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/sigqueue.html>
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/sigqueue.html
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html>>
>>
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/sigqueue.html
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html>
>>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html
>> <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/sigqueue.html>>>>
>>
>> function, or the /raise/()
>>
>>
>> <http://pubs.opengroup.org/______onlinepubs/009695399/______
>> functions/raise.html
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/raise.html>
>>
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/raise.html
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html>>
>>
>>
>>
>> <http://pubs.opengroup.org/____onlinepubs/009695399/____
>> functions/raise.html
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html>
>>
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html
>> <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/raise.html>>>>
>> function."
>>
>>
>> Thanks - I managed to miss that part even
>> though I found
>> the other
>> part about the signal handling function
>> returning. :(
>>
>>
>> It is well hidden, I found it by accident :) To me
>> it looks like
>> they
>> kept it intentionally vague, to not block platforms
>> where those
>> signals
>> could be somehow dealt with automatically? Hard to
>> see though
>> how this
>> would work.
>>
>>
>>
>> In reality, process usually aborts
>> abnormally with the
>> default
>> action
>> for the signal, e.g. printing out "Illegal
>> Instruction". On
>> MacOS, we
>> hang (until the Watcherthread finally
>> kills the VM).
>> On old
>> AIXes, we
>> die without a trace.
>>
>> This also can be easily tried out by
>> removing SIGILL
>> from the
>> list of
>> signals in vmError_<os>.cpp and executing:
>>
>> java -XX:ErrorHandlerTest=14
>> -XX:TestCrashInErrorHandler=15
>>
>> which will crash first with a SIGSEGV,
>> then in error
>> handling with a
>> secondary SIGILL. This will interrupt
>> error reporting
>> and kill
>> or hang
>> the process.
>>
>>
>> In that sense unblocking those
>> signals whilst
>> handling the
>> initial
>> signal may well allow the error
>> reporting process
>> to continue
>> further. But I'm unclear exactly how
>> this plays
>> out:
>>
>> - synchronous signal encountered
>> - crash_handler invoked
>>
>> - VMError::report_and_die executes
>> - secondary signal encountered
>>
>> - crash_handler invoked again
>>
>>
>> almost: not again, different signal
>> handler now. First
>> signal was
>> handled by "JVM_handle_<os>_signal()"
>>
>>
>> Ah missed that - thanks - not that it makes much
>> difference :)
>>
>>
>> I just like nitpicking :)
>>
>> - VMError::report_and_die executes
>> again and
>> sees the
>> recursion and
>> returns (ignoring abort due to
>> excessive recursive
>> errors)
>>
>>
>> No..
>>
>> Is that right? So we actually return
>> from the
>> crash_handler?
>>
>>
>> Oh, but we dont return.
>> VMError::report_and_die()
>> will just
>> create a new
>> frame and re-execute VMError::report() of
>> the first
>> VMError object.
>> Which then will continue with the next
>> STEP. We never
>> return,
>> for each
>> secondary error signal a new frame is
>> created.
>>
>>
>> I had trouble tracing through exactly what
>> might happen
>> on the
>> recursive call to report_and_die. I see know
>> that report
>> comes from:
>>
>> staticBufferStream sbs(buffer, O_BUFLEN,
>> &log);
>> first_error->report(&sbs);
>> first_error->_current_step = 0;
>> // reset
>> current_step
>> first_error->_current_step_______info =
>>
>> ""; // reset
>> current_step
>>
>> string
>>
>> so the second time through we will call report
>> and
>> _current_step
>> should indicate where to start executing from.
>>
>>
>> Exactly. There is also a catch, in that the stack
>> usage goes
>> up. Not
>> endlessly, it is limited by the number of error
>> reporting steps.
>> The more stack VmError::report() does cost, the
>> less well this
>> works,
>> especially in stack overflow scenarios.
>>
>> Which is why we extended SafeFetch and enabled it
>> for the use
>> in the
>> error handler, which will be one of the the next
>> patches I'd
>> like to
>> port to the OpenJDK, once this one is thru.
>>
>>
>> This all happens in VMError::report_and_die:
>> -> first error ? anchor VMError object in
>> a static
>> variable and
>> execute
>> VMError::report()
>> -> secondary error?
>> -> different thread? just sleep forever
>> -> same thread? new frame, re-enter
>> VMError::report(). Once
>> done, abort.
>>
>> I always found that rather neat, but in
>> fact that is
>> not our
>> invention
>> but Sun's :) Anyway, my fix does not
>> change this
>> behaviour for
>> better or
>> worse, it only makes it usable for more
>> cases.
>>
>> Because this puts us in undefined
>> territory
>> according to POSIX:
>>
>> "The behavior of a process is
>> undefined after it
>> returns
>> normally
>> from a signal-catching function for a
>> SIGBUS,
>> SIGFPE,
>> SIGILL, or
>> SIGSEGV signal that was not generated
>> by kill(),
>> sigqueue(), or
>> raise()."
>>
>> true, but we dont return...
>>
>> On top of that you also have the
>> issue that error
>> reporting
>> does a
>> whole bunch of things that are not
>> async-signal-safe so we can
>> easily encounter hangs or aborts.
>>
>> But we're dying anyway so I guess
>> none of this
>> really
>> matters. If
>> re-enabling these signals allows
>> error reporting to
>> progress further
>> in some cases then that is a win.
>>
>>
>> Actually, this covers a lot of cases,
>> mostly because
>> SIGSEGV during
>> error reporting is common, so if the
>> original error
>> was not
>> SIGSEGV, but
>> e.g. SIGILL, this would always result in
>> broken hs-err
>> files.
>>
>> The back story is that at SAP, we rely
>> heavily on the
>> hs-err
>> files. They
>> are our main tool for support, because
>> working with
>> cores is
>> often not
>> feasible. So, we put a lot of work in
>> making error
>> reporting
>> reliable
>> across all platforms. This is also covered
>> by many
>> tests which
>> crash the
>> VM in exciting ways and check the hs-err
>> files for
>> completeness.
>>
>>
>> OK. Modulo the cpu specific SIGILL part
>> everything else
>> seems fine.
>>
>> Great. just tell me how you want that part.
>>
>> Kind regards, Thomas
>>
>> Thanks,
>> David
>>
>>
>>
>>
More information about the hotspot-runtime-dev
mailing list