RFR: JDK-8303861: Error handling step timeouts should never be blocked by OnError and others [v2]
Roman Kennke
rkennke at openjdk.org
Fri Mar 10 15:01:18 UTC 2023
On Thu, 9 Mar 2023 10:09:28 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
>> Fatal error handling is subject to several timeouts:
>> - a global timeout (controlled via ErrorLogTimeout)
>> - local error reporting step timeouts.
>>
>> The latter aims to "give the JVM a kick" if it gets stuck in one particular place during error reporting. This prevents one error reporting step from hogging all the time allotted to error reporting under ErrorLogTimeout.
>>
>> There are three situations where atm we suppress the global error timeout:
>> - if the JVM is embedded and the launcher has its abort hook installed. Obviously, that must be allowed to run.
>> - if the user specified one or more OnError commands to run, and these did not yet run. These must have a chance to run unmolested.
>> - if the user (typically developer) specified ShowMessageBoxOnError, and the error box has not yet been shown
>>
>> There is a bug though, that also prevents the step timeout from firing if either condition is true. That is plain wrong.
>>
>> In addition to that, the test interval WatcherThread uses to check for timeouts should be decreased. It sits at 1 second, which is too coarse-grained.
>>
>> --------
>>
>> Patch:
>> - reworks `VMError::check_timeout()` to never block step timeouts
>> - adds clarifying comments
>> - quadruples timeout check frequency by watcher thread
>> - adds regression test for timeout handling with OnError
>> - additionally limits timeout per individual error reporting step to 5 seconds. 5 seconds is usually enough to distinguish a slow error reporting step from one that is endlessly hanging.
>>
>> Tested locally on Linux x64.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
>
> limit step timeout to 5 seconds max
Looks good to me. I only have 2 comments. If you change this, I don't need another review.
Thank you!
Roman
src/hotspot/share/utilities/vmError.cpp line 1787:
> 1785: // Global timeout hit?
> 1786: if (!ignore_global_timeout) {
> 1787: const jlong reporting_start_time_l = get_reporting_start_time();
What is the meaning of _l here? Is it to indicate the type of the variable? If so, I would suggest to remove it. (I know it is pre-existing. I leave this up to you.)
test/hotspot/jtreg/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java line 40:
> 38:
> 39: /*
> 40: * @test TimeoutInErrorHandlingTest-default
Isn't the better way to name this just '@test TimeoutInErrorHandlingTest' and then '@id=default' and '@id=with-on-error' ?
-------------
Marked as reviewed by rkennke (Reviewer).
PR: https://git.openjdk.org/jdk/pull/12936
More information about the hotspot-dev
mailing list