RFR: 8250637: UseOSErrorReporting times out (on Mac and Linux)

Tue Oct 27 16:08:21 UTC 2020

On Mon, 26 Oct 2020 15:32:49 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:

>> Hi Gerard,
>> 
>> I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.
>> 
>> For non-Windows there is no pre-established alternative code path for report_and_die() returning.
>> 
>> In the bug report you write:
>> 
>>> On Mac/Linux it would look more like this:
>>> 
>>> #1 catch signal in our handler
>>> #2 generate hs_err log
>>> #3 turn off our signal handler
>>> #4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated 
>>> 
>> 
>> To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.
>> 
>> I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)
>> 
>> Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.
>> 
>> So my preferred approaches here would be:
>> 
>> 1. Make UseOSErrorReporting Windows only; or
>> 2.  Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.
>> 
>> Thanks,
>> David
>
>> Hi Gerard,
>> 
>> I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.
>> 
>> For non-Windows there is no pre-established alternative code path for report_and_die() returning.
>> 
>> In the bug report you write:
>> 
>> > On Mac/Linux it would look more like this:
>> > #1 catch signal in our handler
>> > #2 generate hs_err log
>> > #3 turn off our signal handler
>> > #4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated
>> 
>> To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.
>> 
>> I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)
>> 
>> Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.
>> 
>> So my preferred approaches here would be:
>> 
>> 1. Make UseOSErrorReporting Windows only; or
>> 2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.
> 
> hi David,
> 
> Many thanks for the review and finding the background info on the history of this issue.
> 
> How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail.
> 
> On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:
> 
> "unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
> 
> That's how Apple suggest we do it for Mac.
> 
> I can limit the scope of this fix to just macOS here, like I was planning it for JDK-8237727, and for Linux simply disable the flag for now and leave any more sophisticated fix for a next issue. I do think, however, that on Linux anything better than 2 min hang would be better.

> _Mailing list message from [David Holmes](mailto:david.holmes at oracle.com) on [hotspot-dev](mailto:hotspot-dev at openjdk.java.net):_
> 
> On 27/10/2020 1:35 am, Gerard Ziemski wrote:
> 
> > On Mon, 26 Oct 2020 04:33:03 GMT, David Holmes <dholmes at openjdk.org> wrote:
> > > Hi Gerard,
> > > I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.
> > > For non-Windows there is no pre-established alternative code path for report_and_die() returning.
> > > In the bug report you write:
> > > > On Mac/Linux it would look more like this:
> > > > #1 catch signal in our handler
> > > > #2 generate hs_err log
> > > > #3 turn off our signal handler
> > > > #4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated
> > > 
> > > 
> > > To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.
> > > I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)
> > > Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.
> > > So my preferred approaches here would be:
> > > 1. Make UseOSErrorReporting Windows only; or
> > > 2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.
> > 
> > 
> > hi David,
> > Many thanks for the review and finding the background info on the history of this issue.
> > How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail.
> 
> No there is a semantic underpining as to what it means for there to be
> OS error reporting on a given platform. Windows has a nicely defined
> model. Other platforms not so nice. On macOS they really don't want apps
> to attempt any kind of crash handling on their own. :)
> 
> > On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:
> > "unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
> > That's how Apple suggest we do it for Mac.
> 
> That is a blog by an Apple developer giving some very general advice,
> and IMO lacking in some necessary detail. That quote above is in the
> context of answering:
> 
> "Finally, there?s the question of how to exit from your signal handler."
> 
> The suggestion to "then return" hits UB for the synchronous error
> signals - a fact not mentioned in the blog entry. The assertion that:
> 
> "This will cause the crashed process to continue execution, crash again,
> ... "
> 
> is a naive oversimplification. If you just seg-faulted doing a read from
> memory how can you continue execution?

My understanding is that we would not be going to continue execution past the seg-faulted instruction, but instead resume at the seg-fault instruction (with the same memory/register contents, unless our signal handler modified any of that), which would cause the same signal to be raised at the exact same frame, resulting in the exact same behavior. That's what my experimentation shows and what I understood the Apple's recommendation is based on.

>  What does that mean when the read
> yielded no value? Will you just continue with a random value? Will the
> system try to re-execute the read and so crash again? Maybe it will
> crash again, maybe it won't. Maybe it will do something in the meantime
> that leads to totally unexpected behaviour (as Thomas previously
> described). Hence my suggestion that if you are going to attempt this
> path for macOS then you need to introduce the second crash so we know
> exactly what will happen.

But that will show up as a different crash and might be confusing.

> Returning from the original signal handler is
> not an option IMO.

I think our differences of opinion all hinges on what happens when code returns from its signal handler:

#1 Does it resume and actually redoes the exact same instruction? (which this time may succeed?)
#2 Does it resume and raise the exact same signal? (exhibits the exact same behavior as original?)

You and Thomas seem to believe that it's #1, I thought (based on https://developer.apple.com/forums/thread/113742 ) that it was more like #2.

I will continue this investigation in JDK-8237727

Here I will not be as ambitious and I will simply fix the problem at hand: i.e. address the 2 minutes hang by disabling the option for macOS and Linux.

-------------

PR: https://git.openjdk.java.net/jdk/pull/813