RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes david.holmes at oracle.com
Tue Feb 7 20:09:38 UTC 2017


Hi Thomas,

The test is quite a slow one and so we would like to exclude it from the 
hotspot_fast_runtime group in TEST.groups (there are other ErrorHandling 
tests excluded there already).

Thanks,
David

On 7/02/2017 3:58 PM, David Holmes wrote:
> Hi Thomas,
>
> Other than the remaining typos Chris pointed out this seems fine.
>
> Thanks,
> David
>
> PS. We should be getting rid of many of the runtime loading cases eg
> clock_gettime, in JDK 10.
>
> PPS. Running this through JPRT just in case we overlooked something.
> Also need to see whether we need to exclude the test from
> hotspot_fast_runtime in TEST.groups.
>
> On 6/02/2017 9:18 PM, Thomas Stüfe wrote:
>> Hi Chris, David,
>>
>> Thanks for your reviews. I hope I addressed all of your issues. See
>> further details below.
>>
>> new webrev:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>>
>>
>> delta to the last one:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>>
>>
>>
>> @Chris:
>>
>>> Most files need 2017 copyright updates.
>>
>> Done.
>>
>>> set_to_now() and get_timestamp() should have names that tie them
>>> together better. Maybe set_step_start_time() and
>>> get_step_start_time(). Also, they >should probably be methods of
>>> VMError and the statics should all be fields.
>>
>> Done. Added an accessor for every timeout value and a setter too (named
>> "record" to indicate that instead of passing a value it uses the current
>> timestamp).
>>
>> I'm usually not a big fan of putting internals into a header, so I tend
>> to hide internals in a file-local static function. But if this is the
>> way you prefer it, this is fine too.
>>
>>> In the first review I had mentioned that you only are allowing for
>>> one step to hang, and still be able to continue with subsequent steps:
>>
>>> 1509     // A step times out after half of the total timeout. Steps
>>> are mostly fast unless they
>>> 1510     // hang for some reason, so this simple rule allows for one
>>> hanging step and still
>>> 1511     // leaves time enough for the rest of the steps to finish.
>>> 1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000
>>> / 2;
>>>
>>> You mentioned that for your SAP version you divide by 4, allowing for
>>> up to 3 steps to hang. Is there a reason not to do this for the open
>>> version? In fact it >looks like TestUnresponsiveErrorHandler is
>>> relying on that being the case, since it will hang 3 times.
>>
>> I changed this to the value we use at SAP (make the little timeouts a
>> quarter of the final timeout value). I now need to have five
>> timeout-test-steps in the error handler instead of the three earlier. I
>> hid them in a nice macro, so I hope they are not too much of an eye sore.
>>
>>> Related to this, TimeoutInErrorHandlingTest.java is >only checking
>>> for one instance of step timeout pattern. Probably
>> it should check for 3 instances, and >also one instance of the overall
>> timeout pattern:
>>>
>>> 1272         st->print_cr("------ Timout during error reporting after
>>> " INT64_FORMAT "ms. ------",
>>
>> I intentionally left this test to be very basic - see the comment in the
>> test. Basically, I am afraid of giving you guys a test which
>> sporadically fails if the timing is bad.
>>
>>  As David pointed out, the final error message cannot be relied on - it
>> should appear in 99% of all cases but may not because error reporting
>> thread and WatcherThread race. I explicitly avoided adding
>> synchronization there, because it is better to sometimes miss output in
>> the hs-err file than for the VM to hang in error shutdown.
>>
>> For the same reason I do not test for all the little timeout messages to
>> appear in the error log. I now just test for two messages of the
>> theoretical three or four max, to see that step timeout works repeatedly.
>>
>> @David:
>>
>>> src/os/windows/vm/vmError_windows.cpp
>>
>>> Typo: canceleation
>>
>> Done.
>>
>>> As previously discussed Atomic::load/store for jlong do not depend on
>>> supports_cx8. The commentary in atomic.hpp should really refer to
>>> 'atomic read->modify-write operations on jlongs' - as every platform
>>> must support atomic 64-bit load/store to support Java volatile
>>> long/double. So set_to_now and >get_timestamp can be simplified, or
>>> even removed completely.
>>
>> Sure! I removed the cx8 dependency, which makes the code simpler, but
>> decided to keep the accessor functions (renamed as get_.._start_time and
>> record_.._start_time)
>>
>>> Does the timestamp really need to be the system time in millis or do
>>> you just want to record relative elapsed time? If the latter then I
>>> suggest using >javaTimeNanos()/100000 to avoid potential glitches
>>> with the system time changing.
>>
>> Did this. This exchanges the possibility of time glitches with possible
>> initialization issues (see non-apple bsd path and linux, e.g.) where the
>> hr timer api is dynamically loaded. But I agree, this is still better.
>> The latter is less probable to hurt, only if we crash at exactly the
>> right time during VM initialization.
>>
>>> 382   // TestUnresponsiveErrorHandler: three times to trigger first a
>>> step timeout, then
>>> 383   // the global error reporting timeout.
>>
>>> I'm a little confused as to why three steps are needed to test two
>>> things??
>>
>> See my earlier explanations. Now we have the 1:4 ratio between
>> little/final timeout values, so we need five steps.
>>
>>> 1272         st->print_cr("------ Timout during error reporting after
>>> " INT64_FORMAT "ms. ------",
>>
>>> Typo: timout
>>
>> Done.
>>
>>> Also I gather we are potentially racing with the WatcherThread here, so
>> no guarantee these final print statements will be seen?
>>
>> Yes. I decided against synchronization here. The only point of this
>> timeout is to write a final message into the error log. If this is lost,
>> it is not a big deal. Note that we also have the final error message on
>> stderr done by the WatcherThread.
>>
>>> 1502       VMError::interrupt_reporting_thread();
>>
>>> Nit: you're inside a VMError function so shouldn't need the VMError
>>> prefix.
>>
>> Done.
>>
>>> src/share/vm/utilities/vmError.hpp
>>
>>> Typo: dependend
>>
>> Done (comments changed).
>>
>>> This doesn't quite read right to me. How about:..
>>
>> Thanks for your suggestion, I took your version.
>>
>>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>>> I think using @requires and runtime Platform checks is redundant. I
>>> prefer @requires for both debug build testing and not-windows.
>>
>> Done. Thanks for your suggestion.
>>
>> ---
>>
>> Kind Regards, Thomas
>>
>>
>> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com
>> <mailto:thomas.stuefe at gmail.com>> wrote:
>>
>>     Dear all,
>>
>>     please review (or re-review) this enhancement. It was already
>>     reviewed extensively for JDK9 in Oct 16, but pushed back to JDK10.
>>     Now that JDK10 repo is open, I would like to close this issue.
>>
>>     In short, this enhancement lets the hotspot cancel hanging error
>>     reporting steps (when writing the hs-err file), in order to allow
>>     subsequent error reporting steps to run. In our JVM variant, this
>>     has been active for a long time and has proven to make error
>>     reporting more stable in the face of deadlocks or file system stalls.
>>
>>     Issue:
>>     https://bugs.openjdk.java.net/browse/JDK-8166944
>>     <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>
>>     Webrev:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>
>>
>> <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>
>>
>>     Old Diskussion for JDK9:
>>
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>
>>
>> <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>
>>
>>     The webrev is unchanged from version 2 discussed in october, just
>>     got rebased atop of JDK10/hs.
>>
>>     Thank you,
>>
>>     Kind Regards, Thomas
>>
>>


More information about the hotspot-runtime-dev mailing list