RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.
Thomas Stüfe
thomas.stuefe at gmail.com
Mon Feb 6 11:18:27 UTC 2017
Hi Chris, David,
Thanks for your reviews. I hope I addressed all of your issues. See further
details below.
new webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
delta to the last one:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
@Chris:
> Most files need 2017 copyright updates.
Done.
>set_to_now() and get_timestamp() should have names that tie them together
better. Maybe set_step_start_time() and get_step_start_time(). Also, they
>should probably be methods of VMError and the statics should all be fields.
Done. Added an accessor for every timeout value and a setter too (named
"record" to indicate that instead of passing a value it uses the current
timestamp).
I'm usually not a big fan of putting internals into a header, so I tend to
hide internals in a file-local static function. But if this is the way you
prefer it, this is fine too.
>In the first review I had mentioned that you only are allowing for one
step to hang, and still be able to continue with subsequent steps:
>1509 // A step times out after half of the total timeout. Steps are
mostly fast unless they
>1510 // hang for some reason, so this simple rule allows for one
hanging step and still
>1511 // leaves time enough for the rest of the steps to finish.
>1512 const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>
>You mentioned that for your SAP version you divide by 4, allowing for up
to 3 steps to hang. Is there a reason not to do this for the open version?
In fact it >looks like TestUnresponsiveErrorHandler is relying on that
being the case, since it will hang 3 times.
I changed this to the value we use at SAP (make the little timeouts a
quarter of the final timeout value). I now need to have five
timeout-test-steps in the error handler instead of the three earlier. I hid
them in a nice macro, so I hope they are not too much of an eye sore.
>Related to this, TimeoutInErrorHandlingTest.java is >only checking for one
instance of step timeout pattern. Probably it should check for 3 instances,
and >also one instance of the overall timeout pattern:
>
>1272 st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",
I intentionally left this test to be very basic - see the comment in the
test. Basically, I am afraid of giving you guys a test which sporadically
fails if the timing is bad.
As David pointed out, the final error message cannot be relied on - it
should appear in 99% of all cases but may not because error reporting
thread and WatcherThread race. I explicitly avoided adding synchronization
there, because it is better to sometimes miss output in the hs-err file
than for the VM to hang in error shutdown.
For the same reason I do not test for all the little timeout messages to
appear in the error log. I now just test for two messages of the
theoretical three or four max, to see that step timeout works repeatedly.
@David:
>src/os/windows/vm/vmError_windows.cpp
>Typo: canceleation
Done.
>As previously discussed Atomic::load/store for jlong do not depend on
supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
read->modify-write operations on jlongs' - as every platform must support
atomic 64-bit load/store to support Java volatile long/double. So
set_to_now and >get_timestamp can be simplified, or even removed completely.
Sure! I removed the cx8 dependency, which makes the code simpler, but
decided to keep the accessor functions (renamed as get_.._start_time and
record_.._start_time)
>Does the timestamp really need to be the system time in millis or do you
just want to record relative elapsed time? If the latter then I suggest
using >javaTimeNanos()/100000 to avoid potential glitches with the system
time changing.
Did this. This exchanges the possibility of time glitches with possible
initialization issues (see non-apple bsd path and linux, e.g.) where the hr
timer api is dynamically loaded. But I agree, this is still better. The
latter is less probable to hurt, only if we crash at exactly the right time
during VM initialization.
>382 // TestUnresponsiveErrorHandler: three times to trigger first a step
timeout, then
>383 // the global error reporting timeout.
>I'm a little confused as to why three steps are needed to test two things??
See my earlier explanations. Now we have the 1:4 ratio between little/final
timeout values, so we need five steps.
>1272 st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",
>Typo: timout
Done.
>Also I gather we are potentially racing with the WatcherThread here, so no
guarantee these final print statements will be seen?
Yes. I decided against synchronization here. The only point of this timeout
is to write a final message into the error log. If this is lost, it is not
a big deal. Note that we also have the final error message on stderr done
by the WatcherThread.
>1502 VMError::interrupt_reporting_thread();
>Nit: you're inside a VMError function so shouldn't need the VMError prefix.
Done.
>src/share/vm/utilities/vmError.hpp
>Typo: dependend
Done (comments changed).
> This doesn't quite read right to me. How about:..
Thanks for your suggestion, I took your version.
>test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>I think using @requires and runtime Platform checks is redundant. I prefer
@requires for both debug build testing and not-windows.
Done. Thanks for your suggestion.
---
Kind Regards, Thomas
On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
wrote:
> Dear all,
>
> please review (or re-review) this enhancement. It was already reviewed
> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
> repo is open, I would like to close this issue.
>
> In short, this enhancement lets the hotspot cancel hanging error reporting
> steps (when writing the hs-err file), in order to allow subsequent error
> reporting steps to run. In our JVM variant, this has been active for a long
> time and has proven to make error reporting more stable in the face of
> deadlocks or file system stalls.
>
> Issue:
> https://bugs.openjdk.java.net/browse/JDK-8166944
>
> Webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
> Error-Reporting/jdk10-webrev.02/webrev/
>
> Old Diskussion for JDK9:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
> 016-October/021582.html
>
> The webrev is unchanged from version 2 discussed in october, just got
> rebased atop of JDK10/hs.
>
> Thank you,
>
> Kind Regards, Thomas
>
More information about the hotspot-runtime-dev
mailing list