RFR(XS): 8188872: runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails intermittently
David Holmes
david.holmes at oracle.com
Mon Jun 3 22:09:28 UTC 2019
Hi Thomas,
On 3/06/2019 10:31 pm, Thomas Stüfe wrote:
> Hi David,
>
> On Fri, May 31, 2019 at 10:04 AM David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
> Hi Dan,
>
> On 30/05/2019 11:40 pm, Daniel D. Daugherty wrote:
> > On 5/30/19 1:09 AM, David Holmes wrote:
> >> Hi Dan,
> >>
> >> Reading this email and the bug report really made my head hurt.
> >
> > Yes, I can imagine. I've been working this bug off and on for
> > many months.
> >
> >
> >> I'm just left bemused. Are we really fixing something here or
> have we
> >> written a test that is just trying to be far too clever?
> >
> > Yes, we are fixing something things in the error reporting timeout
> > detection code. The fixes make that code more robust. In this case,
> > the feature is clever so the test itself had to be clever in order
> > to test the feature.
> >
> >
> >> That said your changes seem relatively innocuous.
> >
> > Thanks! I'll take that as a 'thumbs up'.
>
> Yes :)
>
> >
> >> One aspect that I can't work through is what is supposed to happen
> >> when the error reporting thread is sent the SIGILL. What does the
> >> signal handler do in response to that? If it returns then we
> will stay
> >> in the infinite_sleep. If it jumps back into error reporting then
> >> we're now potentially in nested signal-handling contexts - in which
> >> case if the WatcherThread tries to interrupt us again (because
> we're
> >> still not responsive enough) won't the SIGILL be blocked because
> the
> >> handler is still active (unless we explicitly unblock it in the
> >> handler) ?
> >
> > The SIGILL causes an error handler invocation on top of where the
> error
> > reporting thread is "stuck". That error handler invocation causes
> another
> > report() call which picks up at the next "STEP". In the baseline
> code,
> > that SIGILL is sent once per second which means if your system is
> > overloaded and it takes the current report() call longer than 1
> second
> > to finish, it gets interrupted. You end up with report() call on top
> > of report() call on top of report() call. So no the SIGILL
> doesn't seem
> > to be blocked.
>
> That is strange because as far as I can see it should be blocked unless
> we explicitly unblock it, or arrange for it not to be blocked in the
> first place.
>
>
> We explicitly unblock synchronous error signals both in the primary and
> secondary signal handler (see e.g. vmError_posix.cpp, lines 112ff). This
> allows us recursive handling of error signals in error reporting code.
Thanks! I had no recollection that we did this. That saved me a lot of
wasted time investigating :)
Cheers,
David
> Would we not do this, synchronous signals in error reporting code would
> just tear down or infinitely hang (depending on the OS) the process.
>
> Cheers, Thomas
>
>
> If I get some spare time (ROTFL) I'll try to investigate further.
>
> Thanks,
> David
> -----
>
> > Have you run into an hs_err_pid file where the calling stack was just
> > a bunch of report() calls? In the test, we limit ERROR_LOG_TIMEOUT to
> > 16 seconds so there would be at most 16 report() calls on the stack
> > before the total timeout is reached. The default for the
> ErrorLogTimeout
> > option is 120 seconds. So if we hit an ErrorLogTimeout, there can be
> > up to 120 report() calls on the stack.
> >
> > Thanks for the review.
> >
> > Dan
> >
> >>
> >> Thanks,
> >> David
> >>
> >> On 29/05/2019 10:12 am, Daniel D. Daugherty wrote:
> >>> Greetings,
> >>>
> >>> I have a fix for the following longstanding bug:
> >>>
> >>> JDK-8188872
> >>> runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails
> >>> intermittently
> >>> https://bugs.openjdk.java.net/browse/JDK-8188872
> >>>
> >>> I've include Thomas Stüfe directly since I'm modifying his code...
> >>>
> >>> This fix include changes to the error handling code, the VM parts
> >>> of the test (-XX:+TestUnresponsiveErrorHandler) and the test
> itself.
> >>> The changes themselves are small, but the reasons are complicated
> >>> so a detailed explanation is required.
> >>>
> >>> Summary of the changes:
> >>>
> >>> - src/hotspot/share/utilities/vmError.cpp
> >>> - add VMError::clear_step_start_time() and call it from the
> >>> error reporting END macro.
> >>> - VMError::report() is called twice: first, to generate a
> summary
> >>> for stdout and second, to generate hs_err_pid output.
> >>> - Adding clear_step_start_time() prevents
> >>> interrupt_reporting_thread()
> >>> from interrupting the error reporting thread between the
> two
> >>> calls to
> >>> VMError::report().
> >>> - This solves the problem where hs_err_pid file creation gets
> >>> interrupted
> >>> and the hs_err_pid file ends up being created in
> >>> /tmp/hs_err_pid...
> >>> - add a STEP in VMError::report() for setting up the 'start
> time' for
> >>> the TestUnresponsiveErrorHandler test
> >>> - There is a corresponding change in
> VMError::report_and_die() that
> >>> skips the call to record_reporting_start_time() when we are
> >>> executing TestUnresponsiveErrorHandler.
> >>> - This solves the problem where the error reporting thread is
> >>> exposed
> >>> to interrupt_reporting_thread() calls before it has
> reached the
> >>> first STEP in VMError::report().
> >>> - change VMError::check_timeout() to only call
> >>> interrupt_reporting_thread()
> >>> once per timeout detection for either a total reporting
> timeout
> >>> or a
> >>> step timeout:
> >>> - check_timeout() is called by the WatcherThread once per
> second
> >>> once
> >>> it determines that errror reporting has started. This
> change
> >>> solves
> >>> the problem where a timeout is detected, the error
> reporting
> >>> thread
> >>> takes longer than a second to do its work so the
> WatcherThread
> >>> calls
> >>> check_timeout() (and interrupt_reporting_thread()) again
> which
> >>> restarts the STEP we were on from the beginning.
> >>> - src/hotspot/share/utilities/vmError.hpp
> >>> - add clear_step_start_time()
> >>> -
> >>>
> test/hotspot/jtreg/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
> >>> - add support for '-Dverbose=true' to get more verbose test
> output
> >>> - Default ERROR_LOG_TIMEOUT is 16 seconds; Solaris sets it
> to 3X.
> >>> - dump the cmd output if we can't find the 'hs_err_pid' file
> >>> - dump the cmd output if we can't open the 'hs_err_pid' file
> >>> - dump the hs_err_pid file if we fail to match the patterns
> >>>
> >>> Webrev URL:
> >>> http://cr.openjdk.java.net/~dcubed/8188872-webrev/0-for-jdk-jdk13/
> >>>
> >>> Testing: Mach5 Tier[1-5]
> >>> Included the fix in my latest round of 8153224 testing
> >>> on Solaris-X64 where this bug reproduces quite a bit.
> >>>
> >>> Thanks, in advance, for any comments, suggestions, or questions.
> >>>
> >>> Dan
> >
>
More information about the hotspot-runtime-dev
mailing list