RFR(XS): 8188872: runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails intermittently

Thu May 30 13:40:41 UTC 2019

On 5/30/19 1:09 AM, David Holmes wrote:
> Hi Dan,
>
> Reading this email and the bug report really made my head hurt.

Yes, I can imagine. I've been working this bug off and on for
many months.

> I'm just left bemused. Are we really fixing something here or have we 
> written a test that is just trying to be far too clever?

Yes, we are fixing something things in the error reporting timeout
detection code. The fixes make that code more robust. In this case,
the feature is clever so the test itself had to be clever in order
to test the feature.

> That said your changes seem relatively innocuous.

Thanks! I'll take that as a 'thumbs up'.

> One aspect that I can't work through is what is supposed to happen 
> when the error reporting thread is sent the SIGILL. What does the 
> signal handler do in response to that? If it returns then we will stay 
> in the infinite_sleep. If it jumps back into error reporting then 
> we're now potentially in nested signal-handling contexts - in which 
> case if the WatcherThread tries to interrupt us again (because we're 
> still not responsive enough) won't the SIGILL be blocked because the 
> handler is still active (unless we explicitly unblock it in the 
> handler) ?

The SIGILL causes an error handler invocation on top of where the error
reporting thread is "stuck". That error handler invocation causes another
report() call which picks up at the next "STEP". In the baseline code,
that SIGILL is sent once per second which means if your system is
overloaded and it takes the current report() call longer than 1 second
to finish, it gets interrupted. You end up with report() call on top
of report() call on top of report() call. So no the SIGILL doesn't seem
to be blocked.

Have you run into an hs_err_pid file where the calling stack was just
a bunch of report() calls? In the test, we limit ERROR_LOG_TIMEOUT to
16 seconds so there would be at most 16 report() calls on the stack
before the total timeout is reached. The default for the ErrorLogTimeout
option is 120 seconds. So if we hit an ErrorLogTimeout, there can be
up to 120 report() calls on the stack.

Thanks for the review.

Dan

>
> Thanks,
> David
>
> On 29/05/2019 10:12 am, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I have a fix for the following longstanding bug:
>>
>>      JDK-8188872 
>> runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails 
>> intermittently
>>      https://bugs.openjdk.java.net/browse/JDK-8188872
>>
>> I've include Thomas Stüfe directly since I'm modifying his code...
>>
>> This fix include changes to the error handling code, the VM parts
>> of the test (-XX:+TestUnresponsiveErrorHandler) and the test itself.
>> The changes themselves are small, but the reasons are complicated
>> so a detailed explanation is required.
>>
>> Summary of the changes:
>>
>> - src/hotspot/share/utilities/vmError.cpp
>>    - add VMError::clear_step_start_time() and call it from the
>>      error reporting END macro.
>>      - VMError::report() is called twice: first, to generate a summary
>>        for stdout and second, to generate hs_err_pid output.
>>      - Adding clear_step_start_time() prevents 
>> interrupt_reporting_thread()
>>        from interrupting the error reporting thread between the two 
>> calls to
>>        VMError::report().
>>      - This solves the problem where hs_err_pid file creation gets 
>> interrupted
>>        and the hs_err_pid file ends up being created in 
>> /tmp/hs_err_pid...
>>    - add a STEP in VMError::report() for setting up the 'start time' for
>>      the TestUnresponsiveErrorHandler test
>>      - There is a corresponding change in VMError::report_and_die() that
>>        skips the call to record_reporting_start_time() when we are
>>        executing TestUnresponsiveErrorHandler.
>>      - This solves the problem where the error reporting thread is 
>> exposed
>>        to interrupt_reporting_thread() calls before it has reached the
>>        first STEP in VMError::report().
>>    - change VMError::check_timeout() to only call 
>> interrupt_reporting_thread()
>>      once per timeout detection for either a total reporting timeout 
>> or a
>>      step timeout:
>>      - check_timeout() is called by the WatcherThread once per second 
>> once
>>        it determines that errror reporting has started. This change 
>> solves
>>        the problem where a timeout is detected, the error reporting 
>> thread
>>        takes longer than a second to do its work so the WatcherThread 
>> calls
>>        check_timeout() (and interrupt_reporting_thread()) again which
>>        restarts the STEP we were on from the beginning.
>> - src/hotspot/share/utilities/vmError.hpp
>>    - add clear_step_start_time()
>> - 
>> test/hotspot/jtreg/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>    - add support for '-Dverbose=true' to get more verbose test output
>>    - Default ERROR_LOG_TIMEOUT is 16 seconds; Solaris sets it to 3X.
>>    - dump the cmd output if we can't find the 'hs_err_pid' file
>>    - dump the cmd output if we can't open the 'hs_err_pid' file
>>    - dump the hs_err_pid file if we fail to match the patterns
>>
>> Webrev URL: 
>> http://cr.openjdk.java.net/~dcubed/8188872-webrev/0-for-jdk-jdk13/
>>
>> Testing: Mach5 Tier[1-5]
>>           Included the fix in my latest round of 8153224 testing
>>           on Solaris-X64 where this bug reproduces quite a bit.
>>
>> Thanks, in advance, for any comments, suggestions, or questions.
>>
>> Dan