RFR(XS): 8188872: runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails intermittently

Mon Jun 3 22:09:28 UTC 2019

Hi Thomas,

On 3/06/2019 10:31 pm, Thomas Stüfe wrote:
> Hi David,
> 
> On Fri, May 31, 2019 at 10:04 AM David Holmes <david.holmes at oracle.com 
> <mailto:david.holmes at oracle.com>> wrote:
> 
>     Hi Dan,
> 
>     On 30/05/2019 11:40 pm, Daniel D. Daugherty wrote:
>      > On 5/30/19 1:09 AM, David Holmes wrote:
>      >> Hi Dan,
>      >>
>      >> Reading this email and the bug report really made my head hurt.
>      >
>      > Yes, I can imagine. I've been working this bug off and on for
>      > many months.
>      >
>      >
>      >> I'm just left bemused. Are we really fixing something here or
>     have we
>      >> written a test that is just trying to be far too clever?
>      >
>      > Yes, we are fixing something things in the error reporting timeout
>      > detection code. The fixes make that code more robust. In this case,
>      > the feature is clever so the test itself had to be clever in order
>      > to test the feature.
>      >
>      >
>      >> That said your changes seem relatively innocuous.
>      >
>      > Thanks! I'll take that as a 'thumbs up'.
> 
>     Yes :)
> 
>      >
>      >> One aspect that I can't work through is what is supposed to happen
>      >> when the error reporting thread is sent the SIGILL. What does the
>      >> signal handler do in response to that? If it returns then we
>     will stay
>      >> in the infinite_sleep. If it jumps back into error reporting then
>      >> we're now potentially in nested signal-handling contexts - in which
>      >> case if the WatcherThread tries to interrupt us again (because
>     we're
>      >> still not responsive enough) won't the SIGILL be blocked because
>     the
>      >> handler is still active (unless we explicitly unblock it in the
>      >> handler) ?
>      >
>      > The SIGILL causes an error handler invocation on top of where the
>     error
>      > reporting thread is "stuck". That error handler invocation causes
>     another
>      > report() call which picks up at the next "STEP". In the baseline
>     code,
>      > that SIGILL is sent once per second which means if your system is
>      > overloaded and it takes the current report() call longer than 1
>     second
>      > to finish, it gets interrupted. You end up with report() call on top
>      > of report() call on top of report() call. So no the SIGILL
>     doesn't seem
>      > to be blocked.
> 
>     That is strange because as far as I can see it should be blocked unless
>     we explicitly unblock it, or arrange for it not to be blocked in the
>     first place.
> 
> 
> We explicitly unblock synchronous error signals both in the primary and 
> secondary signal handler (see e.g. vmError_posix.cpp, lines 112ff). This 
> allows us recursive handling of error signals in error reporting code.

Thanks! I had no recollection that we did this. That saved me a lot of 
wasted time investigating :)

Cheers,
David

> Would we not do this, synchronous signals in error reporting code would 
> just tear down or infinitely hang (depending on the OS) the process.
> 
> Cheers, Thomas
> 
> 
>     If I get some spare time (ROTFL) I'll try to investigate further.
> 
>     Thanks,
>     David
>     -----
> 
>      > Have you run into an hs_err_pid file where the calling stack was just
>      > a bunch of report() calls? In the test, we limit ERROR_LOG_TIMEOUT to
>      > 16 seconds so there would be at most 16 report() calls on the stack
>      > before the total timeout is reached. The default for the
>     ErrorLogTimeout
>      > option is 120 seconds. So if we hit an ErrorLogTimeout, there can be
>      > up to 120 report() calls on the stack.
>      >
>      > Thanks for the review.
>      >
>      > Dan
>      >
>      >>
>      >> Thanks,
>      >> David
>      >>
>      >> On 29/05/2019 10:12 am, Daniel D. Daugherty wrote:
>      >>> Greetings,
>      >>>
>      >>> I have a fix for the following longstanding bug:
>      >>>
>      >>>      JDK-8188872
>      >>> runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails
>      >>> intermittently
>      >>> https://bugs.openjdk.java.net/browse/JDK-8188872
>      >>>
>      >>> I've include Thomas Stüfe directly since I'm modifying his code...
>      >>>
>      >>> This fix include changes to the error handling code, the VM parts
>      >>> of the test (-XX:+TestUnresponsiveErrorHandler) and the test
>     itself.
>      >>> The changes themselves are small, but the reasons are complicated
>      >>> so a detailed explanation is required.
>      >>>
>      >>> Summary of the changes:
>      >>>
>      >>> - src/hotspot/share/utilities/vmError.cpp
>      >>>    - add VMError::clear_step_start_time() and call it from the
>      >>>      error reporting END macro.
>      >>>      - VMError::report() is called twice: first, to generate a
>     summary
>      >>>        for stdout and second, to generate hs_err_pid output.
>      >>>      - Adding clear_step_start_time() prevents
>      >>> interrupt_reporting_thread()
>      >>>        from interrupting the error reporting thread between the
>     two
>      >>> calls to
>      >>>        VMError::report().
>      >>>      - This solves the problem where hs_err_pid file creation gets
>      >>> interrupted
>      >>>        and the hs_err_pid file ends up being created in
>      >>> /tmp/hs_err_pid...
>      >>>    - add a STEP in VMError::report() for setting up the 'start
>     time' for
>      >>>      the TestUnresponsiveErrorHandler test
>      >>>      - There is a corresponding change in
>     VMError::report_and_die() that
>      >>>        skips the call to record_reporting_start_time() when we are
>      >>>        executing TestUnresponsiveErrorHandler.
>      >>>      - This solves the problem where the error reporting thread is
>      >>> exposed
>      >>>        to interrupt_reporting_thread() calls before it has
>     reached the
>      >>>        first STEP in VMError::report().
>      >>>    - change VMError::check_timeout() to only call
>      >>> interrupt_reporting_thread()
>      >>>      once per timeout detection for either a total reporting
>     timeout
>      >>> or a
>      >>>      step timeout:
>      >>>      - check_timeout() is called by the WatcherThread once per
>     second
>      >>> once
>      >>>        it determines that errror reporting has started. This
>     change
>      >>> solves
>      >>>        the problem where a timeout is detected, the error
>     reporting
>      >>> thread
>      >>>        takes longer than a second to do its work so the
>     WatcherThread
>      >>> calls
>      >>>        check_timeout() (and interrupt_reporting_thread()) again
>     which
>      >>>        restarts the STEP we were on from the beginning.
>      >>> - src/hotspot/share/utilities/vmError.hpp
>      >>>    - add clear_step_start_time()
>      >>> -
>      >>>
>     test/hotspot/jtreg/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>      >>>    - add support for '-Dverbose=true' to get more verbose test
>     output
>      >>>    - Default ERROR_LOG_TIMEOUT is 16 seconds; Solaris sets it
>     to 3X.
>      >>>    - dump the cmd output if we can't find the 'hs_err_pid' file
>      >>>    - dump the cmd output if we can't open the 'hs_err_pid' file
>      >>>    - dump the hs_err_pid file if we fail to match the patterns
>      >>>
>      >>> Webrev URL:
>      >>> http://cr.openjdk.java.net/~dcubed/8188872-webrev/0-for-jdk-jdk13/
>      >>>
>      >>> Testing: Mach5 Tier[1-5]
>      >>>           Included the fix in my latest round of 8153224 testing
>      >>>           on Solaris-X64 where this bug reproduces quite a bit.
>      >>>
>      >>> Thanks, in advance, for any comments, suggestions, or questions.
>      >>>
>      >>> Dan
>      >
>