RFR: 8250637: UseOSErrorReporting times out (on Mac and Linux)

David Holmes dholmes at openjdk.java.net
Mon Oct 26 04:35:37 UTC 2020


On Thu, 22 Oct 2020 16:40:43 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:

> hi all,
> 
> Please review this fix for POSIX platforms, which addresses a time out that occurs while handling a crash with UseOSErrorReporting ON
> 
> The timeout was caused by the crash handling code, looping infinitively, because it incorrectly assumed that the signal handlers were reset to their defaults, while what really was happening was that the code was resetting the signal handlers to our default signal handler.
> 
> To avoid similar confusion in the future I did the following:
> 
> - renamed the VMError::reset_signal_handlers() to VMError:: rearm_signal_handlers()
> - introduced a new API VMError::clear_signal_handlers() which is implemented in PosixSignals
> 
> PosixSignals::clear_signal_handlers() is where the actual fix is done and it simply resets all signal handlers to their system defaults.
> 
> A similar problem occurs on Windows, with the only difference being that before a process times out (takes 2 minutes) it runs out of stack space in about 250 loops, so that's the only reason it doesn't linger for that long. Windows issue is tracked separately by https://bugs.openjdk.java.net/browse/JDK-8250782
> 
> Note: The expectation for "UseOSErrorReporting" is for the OS to catch the crashed process and to produce its own crash log (in addition to Hotspot creating hs_err log file) - see https://bugs.openjdk.java.net/browse/JDK-8237727 for relevant discussion. It does not affect whether core dump is written or not (that is controlled by CreateCoredumpOnCrash)

Hi Gerard,

I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.

For non-Windows there is no pre-established alternative code path for report_and_die() returning.

In the bug report you write:

> On Mac/Linux it would look more like this:
> 
> #1 catch signal in our handler
> #2 generate hs_err log
> #3 turn off our signal handler
> #4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated 
> 

To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.

I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)

Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.

So my preferred approaches here would be:

1. Make UseOSErrorReporting Windows only; or
2.  Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.

Thanks,
David

-------------

PR: https://git.openjdk.java.net/jdk/pull/813


More information about the hotspot-dev mailing list