RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe thomas.stuefe at gmail.com
Fri Feb 3 12:52:35 UTC 2017


Hi David,

thanks for your review! I will prepare a new webrev. Please find some more
remarks inline.

On Fri, Feb 3, 2017 at 3:01 AM, David Holmes <david.holmes at oracle.com>
wrote:

> Hi Thomas,
>
> Thanks for picking this back up. Overall this looks okay. I have a few
> specific comments/suggestions below to add to Chris's.
>
> David
> -----
>
> src/os/windows/vm/vmError_windows.cpp
>
> Typo: canceleation
>
> ---
>
> src/share/vm/utilities/vmError.cpp
>
> As previously discussed Atomic::load/store for jlong do not depend on
> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
> read-modify-write operations on jlongs' - as every platform must support
> atomic 64-bit load/store to support Java volatile long/double. So
> set_to_now and get_timestamp can be simplified, or even removed completely.
>
> Does the timestamp really need to be the system time in millis or do you
> just want to record relative elapsed time? If the latter then I suggest
> using javaTimeNanos()/100000 to avoid potential glitches with the system
> time changing.
>

Okay, I will do this. I would have liked to avoid using high resolution
timers because of potential initialization problems (on some platforms the
underlying system APIs need to be dlsymed first) but after looking at the
implementations, I think this will be probably okay.


>
> 382   // TestUnresponsiveErrorHandler: three times to trigger first a step
> timeout, then
> 383   // the global error reporting timeout.
>
> I'm a little confused as to why three steps are needed to test two things??
>
>
TestUnresponsiveErrorHandler should trigger the step timeouts, then the
final global error reporting timeout. In that version, a step timeout was
1/2 of the global timeout, so in order to be safe I need to hang three
times, two times to trigger the step timeouts, a third time to trigger the
global timeout. If we lower the step timeout to 1/4 (like we did at SAP and
which Chris found reasonable), we will even need to execute the hanging
step 5 times. But I will wrap this into a small macro to make it easier on
the eye.


> 1217     reporting_started();
> 1218     set_to_now(&reporting_start_time);
>
> I think the setting of reporting_start_time should be internalized in
> reporting_started.
>
> Nit: Lets -> Let's  (it is a contraction of 'let us')
>
> 1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> Typo: timout
>
> Also I gather we are potentially racing with the WatcherThread here, so no
> guarantee these final print statements will be seen?
>
> 1502       VMError::interrupt_reporting_thread();
>
> Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>
> ---
>
> src/share/vm/utilities/vmError.hpp
>
> Typo: dependend
>
> +   // Called by WatcherThread to check if the currently running error
> reporting did timeout.
> +   //  Returns true if reporting did hit the global ErrorLogTimeout.
>
> This doesn't quite read right to me. How about:
>
> // Called by the WatcherThread to check if error reporting has timed-out.
> // Returns true if error reporting has not completed within the
> ErrorLogTimeout limit
>
> ---
>
> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>
> I think using @requires and runtime Platform checks is redundant. I prefer
> @requires for both debug build testing and not-windows.
>
>
All suggestions make sense - as do Chris' suggestions - and I will
incorporate them into the new webrev. May take some days though.

Kind Regards, Thomas


>
> On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error
>> reporting steps (when writing the hs-err file), in order to allow
>> subsequent error reporting steps to run. In our JVM variant, this has
>> been active for a long time and has proven to make error reporting more
>> stable in the face of deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.02/webrev/
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
>


More information about the hotspot-runtime-dev mailing list