RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe thomas.stuefe at gmail.com
Wed Feb 22 14:35:48 UTC 2017


Hi Chris, David,

thanks for your patience. Let's wrap this up.

New webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/

Delta to last webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/

Almost identical to the last version, only the typos Chris found were fixed.

@David:

>The test is quite a slow one and so we would like to exclude it from the
>hotspot_fast_runtime group in TEST.groups (there are other ErrorHandling
tests >excluded there already).

This is fine by me. Do you need me to do something?

Kind Regards, Thomas



On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer <chris.plummer at oracle.com>
wrote:

> Hi Tomas,
>
> vmError_windows.cpp: Getting closer. Should be "cancellation".
>
> vmError.cpp: Should be "let's"
>
>  389   // global timeout, lets execute the timeout step five times.
>
> And in response to your comment:
>
> I'm usually not a big fan of putting internals into a header, so I tend to
> hide internals in a file-local static function. But if this is the way you
> prefer it, this is fine too.
>
>
> I'm no hotspot coding conventions expert, but I have frequently seen
> statics declared in the class scope (private or protected), and don't see
> much in the way of statics declared in .cpp files (although I see plenty of
> static helper functions in .cpp files). It seems more consistent to declare
> them in the class scope, but I could be wrong here. I'm not personally
> attached to one way or another.
>
> Other than that it looks good. No need for another review.
>
> thanks,
>
> Chris
>
>
> On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>
> Hi Chris, David,
>
> Thanks for your reviews. I hope I addressed all of your issues. See
> further details below.
>
> new webrev:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
> Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>
> delta to the last one:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
> Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>
>
> @Chris:
>
> > Most files need 2017 copyright updates.
>
> Done.
>
> >set_to_now() and get_timestamp() should have names that tie them together
> better. Maybe set_step_start_time() and get_step_start_time(). Also, they
> >should probably be methods of VMError and the statics should all be fields.
>
> Done. Added an accessor for every timeout value and a setter too (named
> "record" to indicate that instead of passing a value it uses the current
> timestamp).
>
> I'm usually not a big fan of putting internals into a header, so I tend to
> hide internals in a file-local static function. But if this is the way you
> prefer it, this is fine too.
>
> >In the first review I had mentioned that you only are allowing for one
> step to hang, and still be able to continue with subsequent steps:
>
> >1509     // A step times out after half of the total timeout. Steps are
> mostly fast unless they
> >1510     // hang for some reason, so this simple rule allows for one
> hanging step and still
> >1511     // leaves time enough for the rest of the steps to finish.
> >1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
> >
> >You mentioned that for your SAP version you divide by 4, allowing for up
> to 3 steps to hang. Is there a reason not to do this for the open version?
> In fact it >looks like TestUnresponsiveErrorHandler is relying on that
> being the case, since it will hang 3 times.
>
> I changed this to the value we use at SAP (make the little timeouts a
> quarter of the final timeout value). I now need to have five
> timeout-test-steps in the error handler instead of the three earlier. I hid
> them in a nice macro, so I hope they are not too much of an eye sore.
>
> >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
> one instance of step timeout pattern. Probably it should check for 3
> instances, and >also one instance of the overall timeout pattern:
> >
> >1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> I intentionally left this test to be very basic - see the comment in the
> test. Basically, I am afraid of giving you guys a test which sporadically
> fails if the timing is bad.
>
>  As David pointed out, the final error message cannot be relied on - it
> should appear in 99% of all cases but may not because error reporting
> thread and WatcherThread race. I explicitly avoided adding synchronization
> there, because it is better to sometimes miss output in the hs-err file
> than for the VM to hang in error shutdown.
>
> For the same reason I do not test for all the little timeout messages to
> appear in the error log. I now just test for two messages of the
> theoretical three or four max, to see that step timeout works repeatedly.
>
> @David:
>
> >src/os/windows/vm/vmError_windows.cpp
>
> >Typo: canceleation
>
> Done.
>
> >As previously discussed Atomic::load/store for jlong do not depend on
> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
> read->modify-write operations on jlongs' - as every platform must support
> atomic 64-bit load/store to support Java volatile long/double. So
> set_to_now and >get_timestamp can be simplified, or even removed completely.
>
> Sure! I removed the cx8 dependency, which makes the code simpler, but
> decided to keep the accessor functions (renamed as get_.._start_time and
> record_.._start_time)
>
> >Does the timestamp really need to be the system time in millis or do you
> just want to record relative elapsed time? If the latter then I suggest
> using >javaTimeNanos()/100000 to avoid potential glitches with the system
> time changing.
>
> Did this. This exchanges the possibility of time glitches with possible
> initialization issues (see non-apple bsd path and linux, e.g.) where the hr
> timer api is dynamically loaded. But I agree, this is still better. The
> latter is less probable to hurt, only if we crash at exactly the right time
> during VM initialization.
>
> >382   // TestUnresponsiveErrorHandler: three times to trigger first a
> step timeout, then
> >383   // the global error reporting timeout.
>
> >I'm a little confused as to why three steps are needed to test two
> things??
>
> See my earlier explanations. Now we have the 1:4 ratio between
> little/final timeout values, so we need five steps.
>
> >1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> >Typo: timout
>
> Done.
>
> >Also I gather we are potentially racing with the WatcherThread here, so
> no guarantee these final print statements will be seen?
>
> Yes. I decided against synchronization here. The only point of this
> timeout is to write a final message into the error log. If this is lost, it
> is not a big deal. Note that we also have the final error message on stderr
> done by the WatcherThread.
>
> >1502       VMError::interrupt_reporting_thread();
>
> >Nit: you're inside a VMError function so shouldn't need the VMError
> prefix.
>
> Done.
>
> >src/share/vm/utilities/vmError.hpp
>
> >Typo: dependend
>
> Done (comments changed).
>
> > This doesn't quite read right to me. How about:..
>
> Thanks for your suggestion, I took your version.
>
> >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>
> >I think using @requires and runtime Platform checks is redundant. I
> prefer @requires for both debug build testing and not-windows.
>
> Done. Thanks for your suggestion.
>
> ---
>
> Kind Regards, Thomas
>
>
> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
> wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error
>> reporting steps (when writing the hs-err file), in order to allow
>> subsequent error reporting steps to run. In our JVM variant, this has been
>> active for a long time and has proven to make error reporting more stable
>> in the face of deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>> rror-Reporting/jdk10-webrev.02/webrev/
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
>
>
>


More information about the hotspot-runtime-dev mailing list