RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Tue Feb 7 05:58:51 UTC 2017

Hi Thomas,

Other than the remaining typos Chris pointed out this seems fine.

Thanks,
David

PS. We should be getting rid of many of the runtime loading cases eg 
clock_gettime, in JDK 10.

PPS. Running this through JPRT just in case we overlooked something. 
Also need to see whether we need to exclude the test from 
hotspot_fast_runtime in TEST.groups.

On 6/02/2017 9:18 PM, Thomas Stüfe wrote:
> Hi Chris, David,
>
> Thanks for your reviews. I hope I addressed all of your issues. See
> further details below.
>
> new webrev:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>
> delta to the last one:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>
>
> @Chris:
>
>> Most files need 2017 copyright updates.
>
> Done.
>
>>set_to_now() and get_timestamp() should have names that tie them together better. Maybe set_step_start_time() and get_step_start_time(). Also, they >should probably be methods of VMError and the statics should all be fields.
>
> Done. Added an accessor for every timeout value and a setter too (named
> "record" to indicate that instead of passing a value it uses the current
> timestamp).
>
> I'm usually not a big fan of putting internals into a header, so I tend
> to hide internals in a file-local static function. But if this is the
> way you prefer it, this is fine too.
>
>>In the first review I had mentioned that you only are allowing for one step to hang, and still be able to continue with subsequent steps:
>
>>1509     // A step times out after half of the total timeout. Steps are mostly fast unless they
>>1510     // hang for some reason, so this simple rule allows for one hanging step and still
>>1511     // leaves time enough for the rest of the steps to finish.
>>1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>>
>>You mentioned that for your SAP version you divide by 4, allowing for up to 3 steps to hang. Is there a reason not to do this for the open version? In fact it >looks like TestUnresponsiveErrorHandler is relying on that being the case, since it will hang 3 times.
>
> I changed this to the value we use at SAP (make the little timeouts a
> quarter of the final timeout value). I now need to have five
> timeout-test-steps in the error handler instead of the three earlier. I
> hid them in a nice macro, so I hope they are not too much of an eye sore.
>
>>Related to this, TimeoutInErrorHandlingTest.java is >only checking for one instance of step timeout pattern. Probably
> it should check for 3 instances, and >also one instance of the overall
> timeout pattern:
>>
>>1272         st->print_cr("------ Timout during error reporting after " INT64_FORMAT "ms. ------",
>
> I intentionally left this test to be very basic - see the comment in the
> test. Basically, I am afraid of giving you guys a test which
> sporadically fails if the timing is bad.
>
>  As David pointed out, the final error message cannot be relied on - it
> should appear in 99% of all cases but may not because error reporting
> thread and WatcherThread race. I explicitly avoided adding
> synchronization there, because it is better to sometimes miss output in
> the hs-err file than for the VM to hang in error shutdown.
>
> For the same reason I do not test for all the little timeout messages to
> appear in the error log. I now just test for two messages of the
> theoretical three or four max, to see that step timeout works repeatedly.
>
> @David:
>
>>src/os/windows/vm/vmError_windows.cpp
>
>>Typo: canceleation
>
> Done.
>
>>As previously discussed Atomic::load/store for jlong do not depend on supports_cx8. The commentary in atomic.hpp should really refer to 'atomic read->modify-write operations on jlongs' - as every platform must support atomic 64-bit load/store to support Java volatile long/double. So set_to_now and >get_timestamp can be simplified, or even removed completely.
>
> Sure! I removed the cx8 dependency, which makes the code simpler, but
> decided to keep the accessor functions (renamed as get_.._start_time and
> record_.._start_time)
>
>>Does the timestamp really need to be the system time in millis or do you just want to record relative elapsed time? If the latter then I suggest using >javaTimeNanos()/100000 to avoid potential glitches with the system time changing.
>
> Did this. This exchanges the possibility of time glitches with possible
> initialization issues (see non-apple bsd path and linux, e.g.) where the
> hr timer api is dynamically loaded. But I agree, this is still better.
> The latter is less probable to hurt, only if we crash at exactly the
> right time during VM initialization.
>
>>382   // TestUnresponsiveErrorHandler: three times to trigger first a step timeout, then
>>383   // the global error reporting timeout.
>
>>I'm a little confused as to why three steps are needed to test two things??
>
> See my earlier explanations. Now we have the 1:4 ratio between
> little/final timeout values, so we need five steps.
>
>>1272         st->print_cr("------ Timout during error reporting after " INT64_FORMAT "ms. ------",
>
>>Typo: timout
>
> Done.
>
>>Also I gather we are potentially racing with the WatcherThread here, so
> no guarantee these final print statements will be seen?
>
> Yes. I decided against synchronization here. The only point of this
> timeout is to write a final message into the error log. If this is lost,
> it is not a big deal. Note that we also have the final error message on
> stderr done by the WatcherThread.
>
>>1502       VMError::interrupt_reporting_thread();
>
>>Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>
> Done.
>
>>src/share/vm/utilities/vmError.hpp
>
>>Typo: dependend
>
> Done (comments changed).
>
>> This doesn't quite read right to me. How about:..
>
> Thanks for your suggestion, I took your version.
>
>>test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>
>>I think using @requires and runtime Platform checks is redundant. I prefer @requires for both debug build testing and not-windows.
>
> Done. Thanks for your suggestion.
>
> ---
>
> Kind Regards, Thomas
>
>
> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com
> <mailto:thomas.stuefe at gmail.com>> wrote:
>
>     Dear all,
>
>     please review (or re-review) this enhancement. It was already
>     reviewed extensively for JDK9 in Oct 16, but pushed back to JDK10.
>     Now that JDK10 repo is open, I would like to close this issue.
>
>     In short, this enhancement lets the hotspot cancel hanging error
>     reporting steps (when writing the hs-err file), in order to allow
>     subsequent error reporting steps to run. In our JVM variant, this
>     has been active for a long time and has proven to make error
>     reporting more stable in the face of deadlocks or file system stalls.
>
>     Issue:
>     https://bugs.openjdk.java.net/browse/JDK-8166944
>     <https://bugs.openjdk.java.net/browse/JDK-8166944>
>
>     Webrev:
>     http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>     <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>
>     Old Diskussion for JDK9:
>     http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>     <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>
>     The webrev is unchanged from version 2 discussed in october, just
>     got rebased atop of JDK10/hs.
>
>     Thank you,
>
>     Kind Regards, Thomas
>
>