RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Sun Feb 5 02:37:49 UTC 2017

On 4/02/2017 6:50 PM, Thomas Stüfe wrote:
> Hi Chris,
>
> On Fri, Feb 3, 2017 at 6:56 PM, Chris Plummer <chris.plummer at oracle.com
> <mailto:chris.plummer at oracle.com>> wrote:
>
>     On 2/3/17 4:52 AM, Thomas Stüfe wrote:
>>     Hi David,
>>
>>     thanks for your review! I will prepare a new webrev. Please find
>>     some more remarks inline.
>>
>>     On Fri, Feb 3, 2017 at 3:01 AM, David Holmes
>>     <david.holmes at oracle.com <mailto:david.holmes at oracle.com>> wrote:
>>
>>         Hi Thomas,
>>
>>         Thanks for picking this back up. Overall this looks okay. I
>>         have a few specific comments/suggestions below to add to Chris's.
>>
>>         David
>>         -----
>>
>>         src/os/windows/vm/vmError_windows.cpp
>>
>>         Typo: canceleation
>>
>>         ---
>>
>>         src/share/vm/utilities/vmError.cpp
>>
>>         As previously discussed Atomic::load/store for jlong do not
>>         depend on supports_cx8. The commentary in atomic.hpp should
>>         really refer to 'atomic read-modify-write operations on
>>         jlongs' - as every platform must support atomic 64-bit
>>         load/store to support Java volatile long/double. So set_to_now
>>         and get_timestamp can be simplified, or even removed completely.
>>
>>         Does the timestamp really need to be the system time in millis
>>         or do you just want to record relative elapsed time? If the
>>         latter then I suggest using javaTimeNanos()/100000 to avoid
>>         potential glitches with the system time changing.
>>
>>
>>     Okay, I will do this. I would have liked to avoid using high
>>     resolution timers because of potential initialization problems (on
>>     some platforms the underlying system APIs need to be dlsymed
>>     first) but after looking at the implementations, I think this will
>>     be probably okay.
>>
>>
>>
>>         382   // TestUnresponsiveErrorHandler: three times to trigger
>>         first a step timeout, then
>>         383   // the global error reporting timeout.
>>
>>         I'm a little confused as to why three steps are needed to test
>>         two things??
>>
>>
>>     TestUnresponsiveErrorHandler should trigger the step timeouts,
>>     then the final global error reporting timeout. In that version, a
>>     step timeout was 1/2 of the global timeout, so in order to be safe
>>     I need to hang three times, two times to trigger the step
>>     timeouts, a third time to trigger the global timeout. If we lower
>>     the step timeout to 1/4 (like we did at SAP and which Chris found
>>     reasonable), we will even need to execute the hanging step 5
>>     times. But I will wrap this into a small macro to make it easier
>>     on the eye.
>     I think with the step timeout being 1/2 the global timeout, you
>     would only need 2 steps to timeout in order to hit the global
>     timeout because presumably you'll have spent at least a tiny amount
>     of time in other steps already.
>
>
> Yes, you are of course right.

Yes but depending on how time is measured that "tiny amount" may or may 
not be observable. It has to be at least a millisecond if using 
currentTimeMillis.

David
-----

> Thomas
>
>
>     Chris
>
>>
>>
>>         1217     reporting_started();
>>         1218     set_to_now(&reporting_start_time);
>>
>>         I think the setting of reporting_start_time should be
>>         internalized in reporting_started.
>>
>>         Nit: Lets -> Let's  (it is a contraction of 'let us')
>>
>>         1272         st->print_cr("------ Timout during error
>>         reporting after " INT64_FORMAT "ms. ------",
>>
>>         Typo: timout
>>
>>         Also I gather we are potentially racing with the WatcherThread
>>         here, so no guarantee these final print statements will be seen?
>>
>>         1502       VMError::interrupt_reporting_thread();
>>
>>         Nit: you're inside a VMError function so shouldn't need the
>>         VMError prefix.
>>
>>         ---
>>
>>         src/share/vm/utilities/vmError.hpp
>>
>>         Typo: dependend
>>
>>         +   // Called by WatcherThread to check if the currently
>>         running error reporting did timeout.
>>         +   //  Returns true if reporting did hit the global
>>         ErrorLogTimeout.
>>
>>         This doesn't quite read right to me. How about:
>>
>>         // Called by the WatcherThread to check if error reporting has
>>         timed-out.
>>         // Returns true if error reporting has not completed within
>>         the ErrorLogTimeout limit
>>
>>         ---
>>
>>         test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>>         I think using @requires and runtime Platform checks is
>>         redundant. I prefer @requires for both debug build testing and
>>         not-windows.
>>
>>
>>     All suggestions make sense - as do Chris' suggestions - and I will
>>     incorporate them into the new webrev. May take some days though.
>>
>>     Kind Regards, Thomas
>>
>>
>>
>>         On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>>
>>             Dear all,
>>
>>             please review (or re-review) this enhancement. It was
>>             already reviewed
>>             extensively for JDK9 in Oct 16, but pushed back to JDK10.
>>             Now that JDK10
>>             repo is open, I would like to close this issue.
>>
>>             In short, this enhancement lets the hotspot cancel hanging
>>             error
>>             reporting steps (when writing the hs-err file), in order
>>             to allow
>>             subsequent error reporting steps to run. In our JVM
>>             variant, this has
>>             been active for a long time and has proven to make error
>>             reporting more
>>             stable in the face of deadlocks or file system stalls.
>>
>>             Issue:
>>             https://bugs.openjdk.java.net/browse/JDK-8166944
>>             <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>
>>             Webrev:
>>             http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>
>>             Old Diskussion for JDK9:
>>             http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>             <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>
>>             The webrev is unchanged from version 2 discussed in
>>             october, just got
>>             rebased atop of JDK10/hs.
>>
>>             Thank you,
>>
>>             Kind Regards, Thomas
>>
>>
>
>