RFR: 8250637: UseOSErrorReporting times out (on Mac and Linux)

Mon Oct 26 22:40:59 UTC 2020

On 27/10/2020 1:35 am, Gerard Ziemski wrote:
> On Mon, 26 Oct 2020 04:33:03 GMT, David Holmes <dholmes at openjdk.org> wrote:
> 
>> Hi Gerard,
>>
>> I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.
>>
>> For non-Windows there is no pre-established alternative code path for report_and_die() returning.
>>
>> In the bug report you write:
>>
>>> On Mac/Linux it would look more like this:
>>> #1 catch signal in our handler
>>> #2 generate hs_err log
>>> #3 turn off our signal handler
>>> #4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated
>>
>> To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.
>>
>> I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)
>>
>> Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.
>>
>> So my preferred approaches here would be:
>>
>> 1. Make UseOSErrorReporting Windows only; or
>> 2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.
> 
> hi David,
> 
> Many thanks for the review and finding the background info on the history of this issue.
> 
> How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail.

No there is a semantic underpining as to what it means for there to be 
OS error reporting on a given platform. Windows has a nicely defined 
model. Other platforms not so nice. On macOS they really don't want apps 
to attempt any kind of crash handling on their own. :)

> On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:
> 
> "unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
> 
> That's how Apple suggest we do it for Mac.

That is a blog by an Apple developer giving some very general advice, 
and IMO lacking in some necessary detail. That quote above is in the 
context of answering:

  "Finally, there’s the question of how to exit from your signal handler."

The suggestion to "then return" hits UB for the synchronous error 
signals - a fact not mentioned in the blog entry. The assertion that:

"This will cause the crashed process to continue execution, crash again, 
... "

is a naive oversimplification. If you just seg-faulted doing a read from 
memory how can you continue execution? What does that mean when the read 
yielded no value? Will you just continue with a random value? Will the 
system try to re-execute the read and so crash again? Maybe it will 
crash again, maybe it won't. Maybe it will do something in the meantime 
that leads to totally unexpected behaviour (as Thomas previously 
described). Hence my suggestion that if you are going to attempt this 
path for macOS then you need to introduce the second crash so we know 
exactly what will happen. Returning from the original signal handler is 
not an option IMO.

> I can limit the scope of this fix to just macOS here, like I was planning it for JDK-8237727 and worry about Linux in a different issue.

Yes please limit to macOS only. We should look at how to remove the flag 
from platforms where it has no well-defined meaning.

Thanks,
David
-----

> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/813
>