RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process

Thomas Stüfe thomas.stuefe at gmail.com
Wed Nov 26 11:37:44 UTC 2014


Hi David,
...


>      - In debug.cpp for the SIGILL can you define the  all zero case as a
>>     default so we only need to add platform specific definitions when
>>     all zeroes doesn't work. I really hate seeing all that CPU selection
>>     in shared code. :(
>>
>>
>> Agreed and fixed, moved the CPU-specific sections into CPU-specific files.
>>
>
> I'd really like to see a way to share the all-zeroes case so that we don't
> need to add platform specific code unnecessarily.
>
>
sooo.. back to the original code then, just with the #ifdef, just with the
zero-cases all folded in into the #else path? Or do you prefer something
else?


>      - Style nit: please use i++ rather than i ++
>>
>>
>> Fixed.
>>
>>     Aside: we should eradicate the use of sigprocmask and replace with
>>     the thread specific version.
>>
>>
>> Agree. Though I never saw any errors stemming from the use of
>> sigprocmask(). According to POSIX, sigprocmask() is undefined in
>> multithreaded environment, and I guess most OSes just default to
>> pthread_sigmask.
>>
>
> Yes "probably" works okay but I hate to see us using something with
> undefined semantics. That's future clean up though.
>
>
We (SAP JVM) already use pthread_sigmask() / thr_sigsetmask() instead of
sigprocmask. Works fine. We can port this to the OpenJDK.



>      Getting back to the "thinking more about this" ... If a synchronous
>>     signal is blocked at the time it is generated then it should remain
>>     pending on the thread (POSIX spec) but that doesn't tell us what the
>>     thread will then do - retry the faulting instruction? Become
>>     unschedulable? So I can easily imagine that a hang or process
>>     termination may result.
>>
>>
>> This is exactly what happens, but it is actually covered by POSIX, see
>> doc on pthread_sigmask: "If any of the SIGFPE, SIGILL, SIGSEGV, or
>> SIGBUS signals are generated while they are blocked, the result is
>> undefined, unless the signal was generated by the /kill/()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>
>> function, the /sigqueue/()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>
>> function, or the /raise/()
>> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>
>> function."
>>
>
> Thanks - I managed to miss that part even though I found the other part
> about the signal handling function returning. :(


It is well hidden, I found it by accident :) To me it looks like they kept
it intentionally vague, to not block platforms where those signals could be
somehow dealt with automatically? Hard to see though how this would work.


>
>
>  In reality, process usually aborts abnormally with the default action
>> for the signal, e.g. printing out "Illegal Instruction". On MacOS, we
>> hang (until the Watcherthread finally kills the VM). On old AIXes, we
>> die without a trace.
>>
>> This also can be easily tried out by removing SIGILL from the list of
>> signals in vmError_<os>.cpp and executing:
>>
>> java -XX:ErrorHandlerTest=14 -XX:TestCrashInErrorHandler=15
>>
>> which will crash first with a SIGSEGV, then in error handling with a
>> secondary SIGILL. This will interrupt error reporting and kill or hang
>> the process.
>>
>>
>>     In that sense unblocking those signals whilst handling the initial
>>     signal may well allow the error reporting process to continue
>>     further. But I'm unclear exactly how this plays out:
>>
>>     - synchronous signal encountered
>>     - crash_handler invoked
>>
>>     - VMError::report_and_die executes
>>     - secondary signal encountered
>>
>>     - crash_handler invoked again
>>
>>
>> almost: not again, different signal handler now. First signal was
>> handled by "JVM_handle_<os>_signal()"
>>
>
> Ah missed that - thanks - not that it makes much difference :)
>
>
I just like nitpicking :)


>      - VMError::report_and_die executes again and sees the recursion and
>>     returns (ignoring abort due to excessive recursive errors)
>>
>>
>> No..
>>
>>     Is that right? So we actually return from the crash_handler?
>>
>>
>> Oh, but we dont return. VMError::report_and_die() will just create a new
>> frame and re-execute VMError::report() of the first VMError object.
>> Which then will continue with the next STEP. We never return, for each
>> secondary error signal a new frame is created.
>>
>
> I had trouble tracing through exactly what might happen on the recursive
> call to report_and_die. I see know that report comes from:
>
>     staticBufferStream sbs(buffer, O_BUFLEN, &log);
>     first_error->report(&sbs);
>     first_error->_current_step = 0;         // reset current_step
>     first_error->_current_step_info = "";   // reset current_step string
>
> so the second time through we will call report and _current_step should
> indicate where to start executing from.
>
>
Exactly. There is also a catch, in that the stack usage goes up. Not
endlessly, it is limited by the number of error reporting steps.
The more stack VmError::report() does cost, the less well this works,
especially in stack overflow scenarios.

Which is why we extended SafeFetch and enabled it for the use in the error
handler, which will be one of the the next patches I'd like to port to the
OpenJDK, once this one is thru.


>
>  This all happens in VMError::report_and_die:
>> -> first error ? anchor VMError object in a static variable and execute
>> VMError::report()
>> -> secondary error?
>>     -> different thread? just sleep forever
>>     -> same thread? new frame, re-enter VMError::report(). Once done,
>> abort.
>>
>> I always found that rather neat, but in fact that is not our invention
>> but Sun's :) Anyway, my fix does not change this behaviour for better or
>> worse, it only makes it usable for more cases.
>>
>>     Because this puts us in undefined territory according to POSIX:
>>
>>     "The behavior of a process is undefined after it returns normally
>>     from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or
>>     SIGSEGV signal that was not generated by kill(), sigqueue(), or
>>     raise()."
>>
>> true, but we dont return...
>>
>>     On top of that you also have the issue that error reporting does a
>>     whole bunch of things that are not async-signal-safe so we can
>>     easily encounter hangs or aborts.
>>
>>     But we're dying anyway so I guess none of this really matters. If
>>     re-enabling these signals allows error reporting to progress further
>>     in some cases then that is a win.
>>
>>
>> Actually, this covers a lot of cases, mostly because SIGSEGV during
>> error reporting is common, so if the original error was not SIGSEGV, but
>> e.g. SIGILL, this would always result in broken hs-err files.
>>
>> The back story is that at SAP, we rely heavily on the hs-err files. They
>> are our main tool for support, because working with cores is often not
>> feasible. So, we put a lot of work in making error reporting reliable
>> across all platforms. This is also covered by many tests which crash the
>> VM in exciting ways and check the hs-err files for completeness.
>>
>
> OK. Modulo the cpu specific SIGILL part everything else seems fine.
>
> Great. just tell me how you want that part.

Kind regards, Thomas


> Thanks,
> David
>


More information about the hotspot-runtime-dev mailing list