RFR: JDK-8303861: Error handling step timeouts should never be blocked by OnError and others [v2]
David Holmes
dholmes at openjdk.org
Fri Mar 10 06:40:05 UTC 2023
On Thu, 9 Mar 2023 10:09:28 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
>> Fatal error handling is subject to several timeouts:
>> - a global timeout (controlled via ErrorLogTimeout)
>> - local error reporting step timeouts.
>>
>> The latter aims to "give the JVM a kick" if it gets stuck in one particular place during error reporting. This prevents one error reporting step from hogging all the time allotted to error reporting under ErrorLogTimeout.
>>
>> There are three situations where atm we suppress the global error timeout:
>> - if the JVM is embedded and the launcher has its abort hook installed. Obviously, that must be allowed to run.
>> - if the user specified one or more OnError commands to run, and these did not yet run. These must have a chance to run unmolested.
>> - if the user (typically developer) specified ShowMessageBoxOnError, and the error box has not yet been shown
>>
>> There is a bug though, that also prevents the step timeout from firing if either condition is true. That is plain wrong.
>>
>> In addition to that, the test interval WatcherThread uses to check for timeouts should be decreased. It sits at 1 second, which is too coarse-grained.
>>
>> --------
>>
>> Patch:
>> - reworks `VMError::check_timeout()` to never block step timeouts
>> - adds clarifying comments
>> - quadruples timeout check frequency by watcher thread
>> - adds regression test for timeout handling with OnError
>> - additionally limits timeout per individual error reporting step to 5 seconds. 5 seconds is usually enough to distinguish a slow error reporting step from one that is endlessly hanging.
>>
>> Tested locally on Linux x64.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
>
> limit step timeout to 5 seconds max
Changes seem fine. Thanks for the clear explanation.
src/hotspot/share/runtime/nonJavaThread.cpp line 274:
> 272:
> 273: // Wait a second, then recheck for timeout.
> 274: os::naked_short_sleep(999);
Harmless change but I don't see why we need sub-second resolution when the ErrorLogTimeout is in seconds. ??
-------------
Marked as reviewed by dholmes (Reviewer).
PR: https://git.openjdk.org/jdk/pull/12936
More information about the hotspot-dev
mailing list