RFR(XS): 8188872: runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails intermittently

Fri May 31 08:04:49 UTC 2019

Hi Dan,

On 30/05/2019 11:40 pm, Daniel D. Daugherty wrote:
> On 5/30/19 1:09 AM, David Holmes wrote:
>> Hi Dan,
>>
>> Reading this email and the bug report really made my head hurt.
> 
> Yes, I can imagine. I've been working this bug off and on for
> many months.
> 
> 
>> I'm just left bemused. Are we really fixing something here or have we 
>> written a test that is just trying to be far too clever?
> 
> Yes, we are fixing something things in the error reporting timeout
> detection code. The fixes make that code more robust. In this case,
> the feature is clever so the test itself had to be clever in order
> to test the feature.
> 
> 
>> That said your changes seem relatively innocuous.
> 
> Thanks! I'll take that as a 'thumbs up'.

Yes :)

> 
>> One aspect that I can't work through is what is supposed to happen 
>> when the error reporting thread is sent the SIGILL. What does the 
>> signal handler do in response to that? If it returns then we will stay 
>> in the infinite_sleep. If it jumps back into error reporting then 
>> we're now potentially in nested signal-handling contexts - in which 
>> case if the WatcherThread tries to interrupt us again (because we're 
>> still not responsive enough) won't the SIGILL be blocked because the 
>> handler is still active (unless we explicitly unblock it in the 
>> handler) ?
> 
> The SIGILL causes an error handler invocation on top of where the error
> reporting thread is "stuck". That error handler invocation causes another
> report() call which picks up at the next "STEP". In the baseline code,
> that SIGILL is sent once per second which means if your system is
> overloaded and it takes the current report() call longer than 1 second
> to finish, it gets interrupted. You end up with report() call on top
> of report() call on top of report() call. So no the SIGILL doesn't seem
> to be blocked.

That is strange because as far as I can see it should be blocked unless 
we explicitly unblock it, or arrange for it not to be blocked in the 
first place.

If I get some spare time (ROTFL) I'll try to investigate further.

Thanks,
David
-----

> Have you run into an hs_err_pid file where the calling stack was just
> a bunch of report() calls? In the test, we limit ERROR_LOG_TIMEOUT to
> 16 seconds so there would be at most 16 report() calls on the stack
> before the total timeout is reached. The default for the ErrorLogTimeout
> option is 120 seconds. So if we hit an ErrorLogTimeout, there can be
> up to 120 report() calls on the stack.
> 
> Thanks for the review.
> 
> Dan
> 
>>
>> Thanks,
>> David
>>
>> On 29/05/2019 10:12 am, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I have a fix for the following longstanding bug:
>>>
>>>      JDK-8188872 
>>> runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails 
>>> intermittently
>>>      https://bugs.openjdk.java.net/browse/JDK-8188872
>>>
>>> I've include Thomas Stüfe directly since I'm modifying his code...
>>>
>>> This fix include changes to the error handling code, the VM parts
>>> of the test (-XX:+TestUnresponsiveErrorHandler) and the test itself.
>>> The changes themselves are small, but the reasons are complicated
>>> so a detailed explanation is required.
>>>
>>> Summary of the changes:
>>>
>>> - src/hotspot/share/utilities/vmError.cpp
>>>    - add VMError::clear_step_start_time() and call it from the
>>>      error reporting END macro.
>>>      - VMError::report() is called twice: first, to generate a summary
>>>        for stdout and second, to generate hs_err_pid output.
>>>      - Adding clear_step_start_time() prevents 
>>> interrupt_reporting_thread()
>>>        from interrupting the error reporting thread between the two 
>>> calls to
>>>        VMError::report().
>>>      - This solves the problem where hs_err_pid file creation gets 
>>> interrupted
>>>        and the hs_err_pid file ends up being created in 
>>> /tmp/hs_err_pid...
>>>    - add a STEP in VMError::report() for setting up the 'start time' for
>>>      the TestUnresponsiveErrorHandler test
>>>      - There is a corresponding change in VMError::report_and_die() that
>>>        skips the call to record_reporting_start_time() when we are
>>>        executing TestUnresponsiveErrorHandler.
>>>      - This solves the problem where the error reporting thread is 
>>> exposed
>>>        to interrupt_reporting_thread() calls before it has reached the
>>>        first STEP in VMError::report().
>>>    - change VMError::check_timeout() to only call 
>>> interrupt_reporting_thread()
>>>      once per timeout detection for either a total reporting timeout 
>>> or a
>>>      step timeout:
>>>      - check_timeout() is called by the WatcherThread once per second 
>>> once
>>>        it determines that errror reporting has started. This change 
>>> solves
>>>        the problem where a timeout is detected, the error reporting 
>>> thread
>>>        takes longer than a second to do its work so the WatcherThread 
>>> calls
>>>        check_timeout() (and interrupt_reporting_thread()) again which
>>>        restarts the STEP we were on from the beginning.
>>> - src/hotspot/share/utilities/vmError.hpp
>>>    - add clear_step_start_time()
>>> - 
>>> test/hotspot/jtreg/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>    - add support for '-Dverbose=true' to get more verbose test output
>>>    - Default ERROR_LOG_TIMEOUT is 16 seconds; Solaris sets it to 3X.
>>>    - dump the cmd output if we can't find the 'hs_err_pid' file
>>>    - dump the cmd output if we can't open the 'hs_err_pid' file
>>>    - dump the hs_err_pid file if we fail to match the patterns
>>>
>>> Webrev URL: 
>>> http://cr.openjdk.java.net/~dcubed/8188872-webrev/0-for-jdk-jdk13/
>>>
>>> Testing: Mach5 Tier[1-5]
>>>           Included the fix in my latest round of 8153224 testing
>>>           on Solaris-X64 where this bug reproduces quite a bit.
>>>
>>> Thanks, in advance, for any comments, suggestions, or questions.
>>>
>>> Dan
>