RFR(XS): 8188872: runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails intermittently
Daniel D. Daugherty
daniel.daugherty at oracle.com
Fri May 31 13:38:08 UTC 2019
On 5/31/19 4:04 AM, David Holmes wrote:
> Hi Dan,
>
> On 30/05/2019 11:40 pm, Daniel D. Daugherty wrote:
>> On 5/30/19 1:09 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> Reading this email and the bug report really made my head hurt.
>>
>> Yes, I can imagine. I've been working this bug off and on for
>> many months.
>>
>>
>>> I'm just left bemused. Are we really fixing something here or have
>>> we written a test that is just trying to be far too clever?
>>
>> Yes, we are fixing something things in the error reporting timeout
>> detection code. The fixes make that code more robust. In this case,
>> the feature is clever so the test itself had to be clever in order
>> to test the feature.
>>
>>
>>> That said your changes seem relatively innocuous.
>>
>> Thanks! I'll take that as a 'thumbs up'.
>
> Yes :)
Thanks.
>
>>
>>> One aspect that I can't work through is what is supposed to happen
>>> when the error reporting thread is sent the SIGILL. What does the
>>> signal handler do in response to that? If it returns then we will
>>> stay in the infinite_sleep. If it jumps back into error reporting
>>> then we're now potentially in nested signal-handling contexts - in
>>> which case if the WatcherThread tries to interrupt us again (because
>>> we're still not responsive enough) won't the SIGILL be blocked
>>> because the handler is still active (unless we explicitly unblock it
>>> in the handler) ?
>>
>> The SIGILL causes an error handler invocation on top of where the error
>> reporting thread is "stuck". That error handler invocation causes
>> another
>> report() call which picks up at the next "STEP". In the baseline code,
>> that SIGILL is sent once per second which means if your system is
>> overloaded and it takes the current report() call longer than 1 second
>> to finish, it gets interrupted. You end up with report() call on top
>> of report() call on top of report() call. So no the SIGILL doesn't seem
>> to be blocked.
>
> That is strange because as far as I can see it should be blocked
> unless we explicitly unblock it, or arrange for it not to be blocked
> in the first place.
I tested/debugged this on Solaris-X64 and my technique was to put tracing
messages in the error handling code. In particular I put a tracing mesg
before SIGINT was sent and another in the error handler where SIGINT was
caught. Like clock work, I got one ping per second when the Solaris-X64
box was loaded down with a full build...
> If I get some spare time (ROTFL) I'll try to investigate further.
I have my originally debugging/tracing code squirreled away if you
happen to find time and want it...
Again, thanks for the review.
Dan
>
> Thanks,
> David
> -----
>
>> Have you run into an hs_err_pid file where the calling stack was just
>> a bunch of report() calls? In the test, we limit ERROR_LOG_TIMEOUT to
>> 16 seconds so there would be at most 16 report() calls on the stack
>> before the total timeout is reached. The default for the ErrorLogTimeout
>> option is 120 seconds. So if we hit an ErrorLogTimeout, there can be
>> up to 120 report() calls on the stack.
>>
>> Thanks for the review.
>>
>> Dan
>>
>>>
>>> Thanks,
>>> David
>>>
>>> On 29/05/2019 10:12 am, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have a fix for the following longstanding bug:
>>>>
>>>> JDK-8188872
>>>> runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails
>>>> intermittently
>>>> https://bugs.openjdk.java.net/browse/JDK-8188872
>>>>
>>>> I've include Thomas Stüfe directly since I'm modifying his code...
>>>>
>>>> This fix include changes to the error handling code, the VM parts
>>>> of the test (-XX:+TestUnresponsiveErrorHandler) and the test itself.
>>>> The changes themselves are small, but the reasons are complicated
>>>> so a detailed explanation is required.
>>>>
>>>> Summary of the changes:
>>>>
>>>> - src/hotspot/share/utilities/vmError.cpp
>>>> - add VMError::clear_step_start_time() and call it from the
>>>> error reporting END macro.
>>>> - VMError::report() is called twice: first, to generate a summary
>>>> for stdout and second, to generate hs_err_pid output.
>>>> - Adding clear_step_start_time() prevents
>>>> interrupt_reporting_thread()
>>>> from interrupting the error reporting thread between the two
>>>> calls to
>>>> VMError::report().
>>>> - This solves the problem where hs_err_pid file creation gets
>>>> interrupted
>>>> and the hs_err_pid file ends up being created in
>>>> /tmp/hs_err_pid...
>>>> - add a STEP in VMError::report() for setting up the 'start
>>>> time' for
>>>> the TestUnresponsiveErrorHandler test
>>>> - There is a corresponding change in VMError::report_and_die()
>>>> that
>>>> skips the call to record_reporting_start_time() when we are
>>>> executing TestUnresponsiveErrorHandler.
>>>> - This solves the problem where the error reporting thread is
>>>> exposed
>>>> to interrupt_reporting_thread() calls before it has reached the
>>>> first STEP in VMError::report().
>>>> - change VMError::check_timeout() to only call
>>>> interrupt_reporting_thread()
>>>> once per timeout detection for either a total reporting
>>>> timeout or a
>>>> step timeout:
>>>> - check_timeout() is called by the WatcherThread once per
>>>> second once
>>>> it determines that errror reporting has started. This change
>>>> solves
>>>> the problem where a timeout is detected, the error reporting
>>>> thread
>>>> takes longer than a second to do its work so the
>>>> WatcherThread calls
>>>> check_timeout() (and interrupt_reporting_thread()) again which
>>>> restarts the STEP we were on from the beginning.
>>>> - src/hotspot/share/utilities/vmError.hpp
>>>> - add clear_step_start_time()
>>>> -
>>>> test/hotspot/jtreg/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>>
>>>> - add support for '-Dverbose=true' to get more verbose test output
>>>> - Default ERROR_LOG_TIMEOUT is 16 seconds; Solaris sets it to 3X.
>>>> - dump the cmd output if we can't find the 'hs_err_pid' file
>>>> - dump the cmd output if we can't open the 'hs_err_pid' file
>>>> - dump the hs_err_pid file if we fail to match the patterns
>>>>
>>>> Webrev URL:
>>>> http://cr.openjdk.java.net/~dcubed/8188872-webrev/0-for-jdk-jdk13/
>>>>
>>>> Testing: Mach5 Tier[1-5]
>>>> Included the fix in my latest round of 8153224 testing
>>>> on Solaris-X64 where this bug reproduces quite a bit.
>>>>
>>>> Thanks, in advance, for any comments, suggestions, or questions.
>>>>
>>>> Dan
>>
More information about the hotspot-runtime-dev
mailing list