RFR: 8250637: UseOSErrorReporting times out (on Mac and Linux)

Wed Oct 28 07:44:48 UTC 2020

<trimming>

On 28/10/2020 2:08 am, Gerard Ziemski wrote:
> On Mon, 26 Oct 2020 15:32:49 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:
>>> On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:
>>> "unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
>>> That's how Apple suggest we do it for Mac.
>>
>> That is a blog by an Apple developer giving some very general advice,
>> and IMO lacking in some necessary detail. That quote above is in the
>> context of answering:
>>
>> "Finally, there?s the question of how to exit from your signal handler."
>>
>> The suggestion to "then return" hits UB for the synchronous error
>> signals - a fact not mentioned in the blog entry. The assertion that:
>>
>> "This will cause the crashed process to continue execution, crash again,
>> ... "
>>
>> is a naive oversimplification. If you just seg-faulted doing a read from
>> memory how can you continue execution?
> 
> My understanding is that we would not be going to continue execution past the seg-faulted instruction, but instead resume at the seg-fault instruction (with the same memory/register contents, unless our signal handler modified any of that), which would cause the same signal to be raised at the exact same frame, resulting in the exact same behavior. That's what my experimentation shows and what I understood the Apple's recommendation is based on.
> 
>>   What does that mean when the read
>> yielded no value? Will you just continue with a random value? Will the
>> system try to re-execute the read and so crash again? Maybe it will
>> crash again, maybe it won't. Maybe it will do something in the meantime
>> that leads to totally unexpected behaviour (as Thomas previously
>> described). Hence my suggestion that if you are going to attempt this
>> path for macOS then you need to introduce the second crash so we know
>> exactly what will happen.
> 
> But that will show up as a different crash and might be confusing.
> 
>> Returning from the original signal handler is
>> not an option IMO.
> 
> I think our differences of opinion all hinges on what happens when code returns from its signal handler:
> 
> #1 Does it resume and actually redoes the exact same instruction? (which this time may succeed?)
> #2 Does it resume and raise the exact same signal? (exhibits the exact same behavior as original?)
> 
> You and Thomas seem to believe that it's #1, I thought (based on https://developer.apple.com/forums/thread/113742 ) that it was more like #2.

My position was based purely on the POSIX specification that returning 
from a signal handler, for specific signals, leads to undefined 
behaviour. I had overlooked (thanks Thomas for flagging it!) the fact 
that we already utilise returning normally from signal handlers for a 
range of things - safepoint/handshake polls; implicit null pointer checks.

So I was looking for something more definitive from macOS that things 
would work as you suggest. And the sigaction manpage does seem to 
suggest that:

"The call to the handler is arranged so that if the signal handling 
routine returns normally the process will resume execution in the 
context from before the signal's delivery."

So as Thomas discusses the issue is not whether #1 or #2 is correct, as 
they both are, it just depends on the exact context of the original 
signal whether re-executing the failed instruction will fail again, or 
whether it could succeed. While I can imagine general scenarios where 
the instruction could now succeed, I don't know how realistic they are 
in the JVM context.

> I will continue this investigation in JDK-8237727
> 
> Here I will not be as ambitious and I will simply fix the problem at hand: i.e. address the 2 minutes hang by disabling the option for macOS and Linux.

Okay.

Thanks,
David
-----

> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/813
>