RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
Thomas Stüfe
thomas.stuefe at gmail.com
Wed Nov 26 13:33:27 UTC 2014
Hi David,
here you go: http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/
Reverted SIGILL-generating function back to its original form, plus the
folding of the 000 case.
I only can guess what your closed platforms are, but if it is ARM, I
believe opcodes 0-31 are undefined. For ia64, 0 is undefined as well.
Kind regards, Thomas
On Wed, Nov 26, 2014 at 1:02 PM, David Holmes <david.holmes at oracle.com>
wrote:
> On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>
>> Hi David,
>> ...
>>
>> - In debug.cpp for the SIGILL can you define the all zero
>> case as a
>> default so we only need to add platform specific
>> definitions when
>> all zeroes doesn't work. I really hate seeing all that CPU
>> selection
>> in shared code. :(
>>
>>
>> Agreed and fixed, moved the CPU-specific sections into
>> CPU-specific files.
>>
>>
>> I'd really like to see a way to share the all-zeroes case so that we
>> don't need to add platform specific code unnecessarily.
>>
>>
>> sooo.. back to the original code then, just with the #ifdef, just with
>> the zero-cases all folded in into the #else path? Or do you prefer
>> something else?
>>
>
> Elsewhere there is a pattern of defining per-platform values that can
> override the shared definition. eg:
>
> #ifndef HAS_SPECIAL_PLATFORM_VALUE_FOR_XXXX
> Foo XXX = ...; //shared/default initalization
> #endif
>
> but this assumes a platform specific header has already been included that
> can do:
>
> #define HAS_SPECIAL_PLATFORM_VALUE_FOR_XXXX
> Foo XXX = ... ; // platform specific initialization
>
> But that is not the case for debug.hpp.
>
> So I guess folding the zero-case into the else path is the best we can do.
> However I'm assuming the zero case will work for our internal platforms ...
> if it doesn't then we'd have to pollute the shared code with info for the
> closed platforms. :(
>
> David
> -----
>
>
>> - Style nit: please use i++ rather than i ++
>>
>>
>> Fixed.
>>
>> Aside: we should eradicate the use of sigprocmask and
>> replace with
>> the thread specific version.
>>
>>
>> Agree. Though I never saw any errors stemming from the use of
>> sigprocmask(). According to POSIX, sigprocmask() is undefined in
>> multithreaded environment, and I guess most OSes just default to
>> pthread_sigmask.
>>
>>
>> Yes "probably" works okay but I hate to see us using something with
>> undefined semantics. That's future clean up though.
>>
>>
>> We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask() instead of
>> sigprocmask. Works fine. We can port this to the OpenJDK.
>>
>> Getting back to the "thinking more about this" ... If a
>> synchronous
>> signal is blocked at the time it is generated then it
>> should remain
>> pending on the thread (POSIX spec) but that doesn't tell us
>> what the
>> thread will then do - retry the faulting instruction? Become
>> unschedulable? So I can easily imagine that a hang or process
>> termination may result.
>>
>>
>> This is exactly what happens, but it is actually covered by
>> POSIX, see
>> doc on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or
>> SIGBUS signals are generated while they are blocked, the result is
>> undefined, unless the signal was generated by the /kill/()
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/kill.html
>> <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/kill.html>>
>> function, the /sigqueue/()
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/sigqueue.html
>> <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/sigqueue.html>>
>> function, or the /raise/()
>> <http://pubs.opengroup.org/__onlinepubs/009695399/__
>> functions/raise.html
>>
>> <http://pubs.opengroup.org/onlinepubs/009695399/
>> functions/raise.html>>
>> function."
>>
>>
>> Thanks - I managed to miss that part even though I found the other
>> part about the signal handling function returning. :(
>>
>>
>> It is well hidden, I found it by accident :) To me it looks like they
>> kept it intentionally vague, to not block platforms where those signals
>> could be somehow dealt with automatically? Hard to see though how this
>> would work.
>>
>>
>>
>> In reality, process usually aborts abnormally with the default
>> action
>> for the signal, e.g. printing out "Illegal Instruction". On
>> MacOS, we
>> hang (until the Watcherthread finally kills the VM). On old
>> AIXes, we
>> die without a trace.
>>
>> This also can be easily tried out by removing SIGILL from the
>> list of
>> signals in vmError_<os>.cpp and executing:
>>
>> java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>>
>> which will crash first with a SIGSEGV, then in error handling
>> with a
>> secondary SIGILL. This will interrupt error reporting and kill
>> or hang
>> the process.
>>
>>
>> In that sense unblocking those signals whilst handling the
>> initial
>> signal may well allow the error reporting process to continue
>> further. But I'm unclear exactly how this plays out:
>>
>> - synchronous signal encountered
>> - crash_handler invoked
>>
>> - VMError::report_and_die executes
>> - secondary signal encountered
>>
>> - crash_handler invoked again
>>
>>
>> almost: not again, different signal handler now. First signal was
>> handled by "JVM_handle_<os>_signal()"
>>
>>
>> Ah missed that - thanks - not that it makes much difference :)
>>
>>
>> I just like nitpicking :)
>>
>> - VMError::report_and_die executes again and sees the
>> recursion and
>> returns (ignoring abort due to excessive recursive errors)
>>
>>
>> No..
>>
>> Is that right? So we actually return from the crash_handler?
>>
>>
>> Oh, but we dont return. VMError::report_and_die() will just
>> create a new
>> frame and re-execute VMError::report() of the first VMError
>> object.
>> Which then will continue with the next STEP. We never return,
>> for each
>> secondary error signal a new frame is created.
>>
>>
>> I had trouble tracing through exactly what might happen on the
>> recursive call to report_and_die. I see know that report comes from:
>>
>> staticBufferStream sbs(buffer, O_BUFLEN, &log);
>> first_error->report(&sbs);
>> first_error->_current_step = 0; // reset current_step
>> first_error->_current_step___info = ""; // reset current_step
>>
>> string
>>
>> so the second time through we will call report and _current_step
>> should indicate where to start executing from.
>>
>>
>> Exactly. There is also a catch, in that the stack usage goes up. Not
>> endlessly, it is limited by the number of error reporting steps.
>> The more stack VmError::report() does cost, the less well this works,
>> especially in stack overflow scenarios.
>>
>> Which is why we extended SafeFetch and enabled it for the use in the
>> error handler, which will be one of the the next patches I'd like to
>> port to the OpenJDK, once this one is thru.
>>
>>
>> This all happens in VMError::report_and_die:
>> -> first error ? anchor VMError object in a static variable and
>> execute
>> VMError::report()
>> -> secondary error?
>> -> different thread? just sleep forever
>> -> same thread? new frame, re-enter VMError::report(). Once
>> done, abort.
>>
>> I always found that rather neat, but in fact that is not our
>> invention
>> but Sun's :) Anyway, my fix does not change this behaviour for
>> better or
>> worse, it only makes it usable for more cases.
>>
>> Because this puts us in undefined territory according to
>> POSIX:
>>
>> "The behavior of a process is undefined after it returns
>> normally
>> from a signal-catching function for a SIGBUS, SIGFPE,
>> SIGILL, or
>> SIGSEGV signal that was not generated by kill(), sigqueue(),
>> or
>> raise()."
>>
>> true, but we dont return...
>>
>> On top of that you also have the issue that error reporting
>> does a
>> whole bunch of things that are not async-signal-safe so we
>> can
>> easily encounter hangs or aborts.
>>
>> But we're dying anyway so I guess none of this really
>> matters. If
>> re-enabling these signals allows error reporting to
>> progress further
>> in some cases then that is a win.
>>
>>
>> Actually, this covers a lot of cases, mostly because SIGSEGV
>> during
>> error reporting is common, so if the original error was not
>> SIGSEGV, but
>> e.g. SIGILL, this would always result in broken hs-err files.
>>
>> The back story is that at SAP, we rely heavily on the hs-err
>> files. They
>> are our main tool for support, because working with cores is
>> often not
>> feasible. So, we put a lot of work in making error reporting
>> reliable
>> across all platforms. This is also covered by many tests which
>> crash the
>> VM in exciting ways and check the hs-err files for completeness.
>>
>>
>> OK. Modulo the cpu specific SIGILL part everything else seems fine.
>>
>> Great. just tell me how you want that part.
>>
>> Kind regards, Thomas
>>
>> Thanks,
>> David
>>
>>
More information about the hotspot-runtime-dev
mailing list