RFR: JDK-8303861: Error handling step timeouts should never be blocked by OnError and others [v2]

David Holmes dholmes at openjdk.org
Fri Mar 10 06:40:05 UTC 2023


On Thu, 9 Mar 2023 10:09:28 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> Fatal error handling is subject to several timeouts:
>> - a global timeout (controlled via ErrorLogTimeout)
>> - local error reporting step timeouts.
>> 
>> The latter aims to "give the JVM a kick" if it gets stuck in one particular place during error reporting. This prevents one error reporting step from hogging all the time allotted to error reporting under ErrorLogTimeout.
>> 
>> There are three situations where atm we suppress the global error timeout:
>> - if the JVM is embedded and the launcher has its abort hook installed. Obviously, that must be allowed to run.
>> - if the user specified one or more OnError commands to run, and these did not yet run. These must have a chance to run unmolested.
>> - if the user (typically developer) specified ShowMessageBoxOnError, and the error box has not yet been shown
>> 
>> There is a bug though, that also prevents the step timeout from firing if either condition is true. That is plain wrong.
>> 
>> In addition to that, the test interval WatcherThread uses to check for timeouts should be decreased. It sits at 1 second, which is too coarse-grained. 
>> 
>> --------
>> 
>> Patch:
>> - reworks `VMError::check_timeout()` to never block step timeouts
>> - adds clarifying comments
>> - quadruples timeout check frequency by watcher thread
>> - adds regression test for timeout handling with OnError
>> - additionally limits timeout per individual error reporting step to 5 seconds. 5 seconds is usually enough to distinguish a slow error reporting step from one that is endlessly hanging. 
>> 
>> Tested locally on Linux x64.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
> 
>   limit step timeout to 5 seconds max

Changes seem fine. Thanks for the clear explanation.

src/hotspot/share/runtime/nonJavaThread.cpp line 274:

> 272: 
> 273:         // Wait a second, then recheck for timeout.
> 274:         os::naked_short_sleep(999);

Harmless change but I don't see why we need sub-second resolution when the ErrorLogTimeout is in seconds. ??

-------------

Marked as reviewed by dholmes (Reviewer).

PR: https://git.openjdk.org/jdk/pull/12936


More information about the hotspot-dev mailing list