RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe thomas.stuefe at gmail.com
Mon Feb 6 11:18:27 UTC 2017


Hi Chris, David,

Thanks for your reviews. I hope I addressed all of your issues. See further
details below.

new webrev:

http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/

delta to the last one:

http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/


@Chris:

> Most files need 2017 copyright updates.

Done.

>set_to_now() and get_timestamp() should have names that tie them together
better. Maybe set_step_start_time() and get_step_start_time(). Also, they
>should probably be methods of VMError and the statics should all be fields.

Done. Added an accessor for every timeout value and a setter too (named
"record" to indicate that instead of passing a value it uses the current
timestamp).

I'm usually not a big fan of putting internals into a header, so I tend to
hide internals in a file-local static function. But if this is the way you
prefer it, this is fine too.

>In the first review I had mentioned that you only are allowing for one
step to hang, and still be able to continue with subsequent steps:

>1509     // A step times out after half of the total timeout. Steps are
mostly fast unless they
>1510     // hang for some reason, so this simple rule allows for one
hanging step and still
>1511     // leaves time enough for the rest of the steps to finish.
>1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>
>You mentioned that for your SAP version you divide by 4, allowing for up
to 3 steps to hang. Is there a reason not to do this for the open version?
In fact it >looks like TestUnresponsiveErrorHandler is relying on that
being the case, since it will hang 3 times.

I changed this to the value we use at SAP (make the little timeouts a
quarter of the final timeout value). I now need to have five
timeout-test-steps in the error handler instead of the three earlier. I hid
them in a nice macro, so I hope they are not too much of an eye sore.

>Related to this, TimeoutInErrorHandlingTest.java is >only checking for one
instance of step timeout pattern. Probably it should check for 3 instances,
and >also one instance of the overall timeout pattern:
>
>1272         st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",

I intentionally left this test to be very basic - see the comment in the
test. Basically, I am afraid of giving you guys a test which sporadically
fails if the timing is bad.

 As David pointed out, the final error message cannot be relied on - it
should appear in 99% of all cases but may not because error reporting
thread and WatcherThread race. I explicitly avoided adding synchronization
there, because it is better to sometimes miss output in the hs-err file
than for the VM to hang in error shutdown.

For the same reason I do not test for all the little timeout messages to
appear in the error log. I now just test for two messages of the
theoretical three or four max, to see that step timeout works repeatedly.

@David:

>src/os/windows/vm/vmError_windows.cpp

>Typo: canceleation

Done.

>As previously discussed Atomic::load/store for jlong do not depend on
supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
read->modify-write operations on jlongs' - as every platform must support
atomic 64-bit load/store to support Java volatile long/double. So
set_to_now and >get_timestamp can be simplified, or even removed completely.

Sure! I removed the cx8 dependency, which makes the code simpler, but
decided to keep the accessor functions (renamed as get_.._start_time and
record_.._start_time)

>Does the timestamp really need to be the system time in millis or do you
just want to record relative elapsed time? If the latter then I suggest
using >javaTimeNanos()/100000 to avoid potential glitches with the system
time changing.

Did this. This exchanges the possibility of time glitches with possible
initialization issues (see non-apple bsd path and linux, e.g.) where the hr
timer api is dynamically loaded. But I agree, this is still better. The
latter is less probable to hurt, only if we crash at exactly the right time
during VM initialization.

>382   // TestUnresponsiveErrorHandler: three times to trigger first a step
timeout, then
>383   // the global error reporting timeout.

>I'm a little confused as to why three steps are needed to test two things??

See my earlier explanations. Now we have the 1:4 ratio between little/final
timeout values, so we need five steps.

>1272         st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",

>Typo: timout

Done.

>Also I gather we are potentially racing with the WatcherThread here, so no
guarantee these final print statements will be seen?

Yes. I decided against synchronization here. The only point of this timeout
is to write a final message into the error log. If this is lost, it is not
a big deal. Note that we also have the final error message on stderr done
by the WatcherThread.

>1502       VMError::interrupt_reporting_thread();

>Nit: you're inside a VMError function so shouldn't need the VMError prefix.

Done.

>src/share/vm/utilities/vmError.hpp

>Typo: dependend

Done (comments changed).

> This doesn't quite read right to me. How about:..

Thanks for your suggestion, I took your version.

>test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java

>I think using @requires and runtime Platform checks is redundant. I prefer
@requires for both debug build testing and not-windows.

Done. Thanks for your suggestion.

---

Kind Regards, Thomas


On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
wrote:

> Dear all,
>
> please review (or re-review) this enhancement. It was already reviewed
> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
> repo is open, I would like to close this issue.
>
> In short, this enhancement lets the hotspot cancel hanging error reporting
> steps (when writing the hs-err file), in order to allow subsequent error
> reporting steps to run. In our JVM variant, this has been active for a long
> time and has proven to make error reporting more stable in the face of
> deadlocks or file system stalls.
>
> Issue:
> https://bugs.openjdk.java.net/browse/JDK-8166944
>
> Webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
> Error-Reporting/jdk10-webrev.02/webrev/
>
> Old Diskussion for JDK9:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
> 016-October/021582.html
>
> The webrev is unchanged from version 2 discussed in october, just got
> rebased atop of JDK10/hs.
>
> Thank you,
>
> Kind Regards, Thomas
>


More information about the hotspot-runtime-dev mailing list