RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Sat Feb 4 08:50:58 UTC 2017

Hi Chris,

On Fri, Feb 3, 2017 at 6:56 PM, Chris Plummer <chris.plummer at oracle.com>
wrote:

> On 2/3/17 4:52 AM, Thomas Stüfe wrote:
>
> Hi David,
>
> thanks for your review! I will prepare a new webrev. Please find some more
> remarks inline.
>
> On Fri, Feb 3, 2017 at 3:01 AM, David Holmes <david.holmes at oracle.com>
> wrote:
>
>> Hi Thomas,
>>
>> Thanks for picking this back up. Overall this looks okay. I have a few
>> specific comments/suggestions below to add to Chris's.
>>
>> David
>> -----
>>
>> src/os/windows/vm/vmError_windows.cpp
>>
>> Typo: canceleation
>>
>> ---
>>
>> src/share/vm/utilities/vmError.cpp
>>
>> As previously discussed Atomic::load/store for jlong do not depend on
>> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
>> read-modify-write operations on jlongs' - as every platform must support
>> atomic 64-bit load/store to support Java volatile long/double. So
>> set_to_now and get_timestamp can be simplified, or even removed completely.
>>
>> Does the timestamp really need to be the system time in millis or do you
>> just want to record relative elapsed time? If the latter then I suggest
>> using javaTimeNanos()/100000 to avoid potential glitches with the system
>> time changing.
>>
>
> Okay, I will do this. I would have liked to avoid using high resolution
> timers because of potential initialization problems (on some platforms the
> underlying system APIs need to be dlsymed first) but after looking at the
> implementations, I think this will be probably okay.
>
>
>>
>> 382   // TestUnresponsiveErrorHandler: three times to trigger first a
>> step timeout, then
>> 383   // the global error reporting timeout.
>>
>> I'm a little confused as to why three steps are needed to test two
>> things??
>>
>>
> TestUnresponsiveErrorHandler should trigger the step timeouts, then the
> final global error reporting timeout. In that version, a step timeout was
> 1/2 of the global timeout, so in order to be safe I need to hang three
> times, two times to trigger the step timeouts, a third time to trigger the
> global timeout. If we lower the step timeout to 1/4 (like we did at SAP and
> which Chris found reasonable), we will even need to execute the hanging
> step 5 times. But I will wrap this into a small macro to make it easier on
> the eye.
>
> I think with the step timeout being 1/2 the global timeout, you would only
> need 2 steps to timeout in order to hit the global timeout because
> presumably you'll have spent at least a tiny amount of time in other steps
> already.
>
>
Yes, you are of course right.

Thomas

> Chris
>
>
>
>> 1217     reporting_started();
>> 1218     set_to_now(&reporting_start_time);
>>
>> I think the setting of reporting_start_time should be internalized in
>> reporting_started.
>>
>> Nit: Lets -> Let's  (it is a contraction of 'let us')
>>
>> 1272         st->print_cr("------ Timout during error reporting after "
>> INT64_FORMAT "ms. ------",
>>
>> Typo: timout
>>
>> Also I gather we are potentially racing with the WatcherThread here, so
>> no guarantee these final print statements will be seen?
>>
>> 1502       VMError::interrupt_reporting_thread();
>>
>> Nit: you're inside a VMError function so shouldn't need the VMError
>> prefix.
>>
>> ---
>>
>> src/share/vm/utilities/vmError.hpp
>>
>> Typo: dependend
>>
>> +   // Called by WatcherThread to check if the currently running error
>> reporting did timeout.
>> +   //  Returns true if reporting did hit the global ErrorLogTimeout.
>>
>> This doesn't quite read right to me. How about:
>>
>> // Called by the WatcherThread to check if error reporting has timed-out.
>> // Returns true if error reporting has not completed within the
>> ErrorLogTimeout limit
>>
>> ---
>>
>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>> I think using @requires and runtime Platform checks is redundant. I
>> prefer @requires for both debug build testing and not-windows.
>>
>>
> All suggestions make sense - as do Chris' suggestions - and I will
> incorporate them into the new webrev. May take some days though.
>
> Kind Regards, Thomas
>
>
>>
>> On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>>
>>> Dear all,
>>>
>>> please review (or re-review) this enhancement. It was already reviewed
>>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>>> repo is open, I would like to close this issue.
>>>
>>> In short, this enhancement lets the hotspot cancel hanging error
>>> reporting steps (when writing the hs-err file), in order to allow
>>> subsequent error reporting steps to run. In our JVM variant, this has
>>> been active for a long time and has proven to make error reporting more
>>> stable in the face of deadlocks or file system stalls.
>>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>>> rror-Reporting/jdk10-webrev.02/webrev/
>>>
>>> Old Diskussion for JDK9:
>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>>> 016-October/021582.html
>>>
>>> The webrev is unchanged from version 2 discussed in october, just got
>>> rebased atop of JDK10/hs.
>>>
>>> Thank you,
>>>
>>> Kind Regards, Thomas
>>>
>>
>
>