RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.
David Holmes
david.holmes at oracle.com
Sun Feb 5 02:37:49 UTC 2017
On 4/02/2017 6:50 PM, Thomas Stüfe wrote:
> Hi Chris,
>
> On Fri, Feb 3, 2017 at 6:56 PM, Chris Plummer <chris.plummer at oracle.com
> <mailto:chris.plummer at oracle.com>> wrote:
>
> On 2/3/17 4:52 AM, Thomas Stüfe wrote:
>> Hi David,
>>
>> thanks for your review! I will prepare a new webrev. Please find
>> some more remarks inline.
>>
>> On Fri, Feb 3, 2017 at 3:01 AM, David Holmes
>> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>> wrote:
>>
>> Hi Thomas,
>>
>> Thanks for picking this back up. Overall this looks okay. I
>> have a few specific comments/suggestions below to add to Chris's.
>>
>> David
>> -----
>>
>> src/os/windows/vm/vmError_windows.cpp
>>
>> Typo: canceleation
>>
>> ---
>>
>> src/share/vm/utilities/vmError.cpp
>>
>> As previously discussed Atomic::load/store for jlong do not
>> depend on supports_cx8. The commentary in atomic.hpp should
>> really refer to 'atomic read-modify-write operations on
>> jlongs' - as every platform must support atomic 64-bit
>> load/store to support Java volatile long/double. So set_to_now
>> and get_timestamp can be simplified, or even removed completely.
>>
>> Does the timestamp really need to be the system time in millis
>> or do you just want to record relative elapsed time? If the
>> latter then I suggest using javaTimeNanos()/100000 to avoid
>> potential glitches with the system time changing.
>>
>>
>> Okay, I will do this. I would have liked to avoid using high
>> resolution timers because of potential initialization problems (on
>> some platforms the underlying system APIs need to be dlsymed
>> first) but after looking at the implementations, I think this will
>> be probably okay.
>>
>>
>>
>> 382 // TestUnresponsiveErrorHandler: three times to trigger
>> first a step timeout, then
>> 383 // the global error reporting timeout.
>>
>> I'm a little confused as to why three steps are needed to test
>> two things??
>>
>>
>> TestUnresponsiveErrorHandler should trigger the step timeouts,
>> then the final global error reporting timeout. In that version, a
>> step timeout was 1/2 of the global timeout, so in order to be safe
>> I need to hang three times, two times to trigger the step
>> timeouts, a third time to trigger the global timeout. If we lower
>> the step timeout to 1/4 (like we did at SAP and which Chris found
>> reasonable), we will even need to execute the hanging step 5
>> times. But I will wrap this into a small macro to make it easier
>> on the eye.
> I think with the step timeout being 1/2 the global timeout, you
> would only need 2 steps to timeout in order to hit the global
> timeout because presumably you'll have spent at least a tiny amount
> of time in other steps already.
>
>
> Yes, you are of course right.
Yes but depending on how time is measured that "tiny amount" may or may
not be observable. It has to be at least a millisecond if using
currentTimeMillis.
David
-----
> Thomas
>
>
> Chris
>
>>
>>
>> 1217 reporting_started();
>> 1218 set_to_now(&reporting_start_time);
>>
>> I think the setting of reporting_start_time should be
>> internalized in reporting_started.
>>
>> Nit: Lets -> Let's (it is a contraction of 'let us')
>>
>> 1272 st->print_cr("------ Timout during error
>> reporting after " INT64_FORMAT "ms. ------",
>>
>> Typo: timout
>>
>> Also I gather we are potentially racing with the WatcherThread
>> here, so no guarantee these final print statements will be seen?
>>
>> 1502 VMError::interrupt_reporting_thread();
>>
>> Nit: you're inside a VMError function so shouldn't need the
>> VMError prefix.
>>
>> ---
>>
>> src/share/vm/utilities/vmError.hpp
>>
>> Typo: dependend
>>
>> + // Called by WatcherThread to check if the currently
>> running error reporting did timeout.
>> + // Returns true if reporting did hit the global
>> ErrorLogTimeout.
>>
>> This doesn't quite read right to me. How about:
>>
>> // Called by the WatcherThread to check if error reporting has
>> timed-out.
>> // Returns true if error reporting has not completed within
>> the ErrorLogTimeout limit
>>
>> ---
>>
>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>> I think using @requires and runtime Platform checks is
>> redundant. I prefer @requires for both debug build testing and
>> not-windows.
>>
>>
>> All suggestions make sense - as do Chris' suggestions - and I will
>> incorporate them into the new webrev. May take some days though.
>>
>> Kind Regards, Thomas
>>
>>
>>
>> On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was
>> already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10.
>> Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging
>> error
>> reporting steps (when writing the hs-err file), in order
>> to allow
>> subsequent error reporting steps to run. In our JVM
>> variant, this has
>> been active for a long time and has proven to make error
>> reporting more
>> stable in the face of deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>> <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>> <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>
>> The webrev is unchanged from version 2 discussed in
>> october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
>>
>
>
More information about the hotspot-runtime-dev
mailing list