RFR: 8250637: UseOSErrorReporting times out (on Mac and Linux)

Sat Oct 24 10:25:35 UTC 2020

On Fri, 23 Oct 2020 16:20:54 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:

>>> Some issues:
>>> 
>>> * 0 is not a valid signal number
>> 
>> Right, I used to start at 1, but then copied the "for loop" code from another part of existing code for "NSIG" and forgot to fix the start value" I will fix that.
>> 
>>> * NSIG is obscure; may be ridiculously large (e.g. AIX I believe 1024) and still may not contain all signals for some platforms (e.g. real time signals on BSD (?) are not included)
>> 
>> 1. The fix will work for any NSIG value, why would its value be relevant here? That's the value we use anywhere else, so if it's wrong here, then it's wrong everywhere else and we have a bigger issue to deal here than just this fix.
>> 2. We are not accounting for real time signals on BSD anywhere else, so why do we need to do it here?
>> 
>>> * You now flatten here all signal handlers across the board. What if some third party app had installed their own. I would only reset those which had been originally installed for hotspot usage (beware ReduceSignalUsage).
>> 
>> The purpose of the "UseOSErrorReporting" flag is to let the process die, so that it can be handled by the OS. If we let third party app signal handlers jump in, we can't guarantee that "UseOSErrorReporting" does its job.
>> 
>>> Which brings me to the point that if we fail to re-raise the original error signal, now we run with signal handlers completely disabled :( ... see my original concerns.
>> 
>> I will loosely quote what David Holmes likes to say about signal handlers and hope I get it right: "this is a best effort, nothing is guaranteed in signal handlers". By that I mean yes, you are right that we can't guarantee correct behavior, but at this point of process lifecycle we caught the crash and produced hs_err log, anything else is just great if it works and no big loss if things go wrong now (how bad exactly can things go at this point?). From my experience, however, it works.
>
>> Thinking further, we probably want to just the default handler for the error signal that happened. And we also want to do this only for synchronous error signals (segv, sigill, sigbus, sigfpe), and possibly only if they are "real" (had not been sent by kill or pthread_kill).
> 
> Handling "UseOSErrorReporting" is the last thing we ask the process to do before letting it crash and get caught by the OS. I'm not sure we need to finesse here with which signals we want to flatten. This keeps the code simple, though I will add a comment to "PosixSignals::clear_signal_handlers()" describing the assumptions and expectations.

(resuming the discussion, which had been split between here and the JBS comment section):

To summarize my concerns stated in the JBS: I think returning from fatal error signal handling in the hope of retriggering the same fault can be a bit dangerous. 

If all we risk was more colorful crashes I'd see no issue here. We are already crashing, so what. But to my mind the worst thing which can happen are not crashes or hangs but data corruption. That is highly unlikely but cannot be ruled out.

First off, Posix states clearly that this is undefined behavior. Lets ignore this :-) and continue. 

Spurious crashes are not that rare. Consider this example: GC has a bug which temporarily leaves an oop in an invalid state. Our thread triggers a SEGV and enters signal handling. We write the hs-err file. Maybe three other threads also trigger SEGVs in the meantime. They enter error handling and are now indefinitely parked, never to return.

Now we return. The original signal has been consumed and taken from the signal queue.

The first issue I see is that we may now first process all deferred signals, before resuming normal work and returning to the original pc to re-trigger the original fault. Most signals we block while in the handler. When returning we unblock, and these signals then are processed. IIRC Posix does not state when this happens, but it may very well be we do this right now. Even if this does not cause problems it prolongs the time even further before we get to retrigger the original fault.

The concurrently running GC may have fixed the issue in the meantime. The oop is now valid, we do not re-crash. Now we are in some invalid state but continue to run. Also three of our threads have stopped working and are stuck in VMError. My worry with all this is that this undefined VM state can cause application problems in ways a normal crash cannot.

One of the more benign problems of failing to re-trigger the fault could be just more confusion: since the handlers are now all set to default, the next polling page access or stack bangs will tear down the VM. Your OS reporters will show a VM having crashed at places which are not real crashes.

I know this may all sound contrived. I probably can't dissuade you from doing it. No-one has to set the switch after all. 

------

The following remarks are the actual review:

1) Your patch does not work for asserts, which makes kind of sense. UseOSErrorReporting will just cause the VM to continue after an assert. Easy to reproduce with `java  -XX:ErrorHandlerTest=1 -XX:+UseOSErrorReporting` - we now just loop.
2) We should not do this for any signal other than synchronous error signals (`SIGSEGV`, `SIGILL`, `SIGFPE`, `SIGBUS`) where we can be sure that we re-crash right at the return pc.
3) Also, these signals should not have been raised with `kill()`/`raise()`/etc. In other words, they should be real faults.
4) We should not remove signal handlers for all signals. That may expose us to more problems. If the intent is to retrigger our fault, we only should install the default signal handler for that one fault. 

-------

I also tried your patch on my linux box. When I trigger a crash - either sending a signal with kill or generating this signal with `-XX:ErrorHandlerTest=xx` - with UseOsErrorReporting, I get a hs-err file, then a core dump, but the core seems to have nothing to do with the crash. Regardless of the original signal, it always shows a SIGSEV in some assert which to me indicates an assertion polling page access.

-------------

PR: https://git.openjdk.java.net/jdk/pull/813