RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.
David Holmes
david.holmes at oracle.com
Thu Feb 23 07:38:47 UTC 2017
Looks good Thomas!
Thanks,
David
On 23/02/2017 5:32 PM, Thomas Stüfe wrote:
> Hi Chris,
>
> thank you for the detailed explanation, I modified TEST.groups in
> hotspot accordingly. I modified the webrev.04 in place.
>
> Kind Regards, Thomas
>
> On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <chris.plummer at oracle.com
> <mailto:chris.plummer at oracle.com>> wrote:
>
> On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>> Hi Chris, David,
>>
>> thanks for your patience. Let's wrap this up.
>>
>> New
>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/
>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/>
>>
>> Delta to last
>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/
>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/>
>>
>> Almost identical to the last version, only the typos Chris found
>> were fixed.
> Looks good!
>>
>> @David:
>>
>> >The test is quite a slow one and so we would like to exclude it from the
>> >hotspot_fast_runtime group in TEST.groups (there are other
>> ErrorHandling tests >excluded there already).
>>
>> This is fine by me. Do you need me to do something?
> What David said above. Look in hotspot/test/TEST.groups. Find
> hotspot_fast_runtime. Add the test to the list of tests to not run
> by putting a '-' in front of it. Currently there is:
>
> hotspot_fast_runtime = \
> runtime/ \
> -runtime/6626217/Test6626217.sh \
> -runtime/7100935 \
> -runtime/7158988/FieldMonitor.java \
> -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
> -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
> -runtime/ConstantPool/IntfMethod.java \
> -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
> -runtime/ErrorHandling/ErrorHandler.java \
> -runtime/logging/MonitorMismatchTest.java \
> -runtime/memory/ReserveMemory.java \
> ...
>
> So that means all tests in the runtime directory are included in the
> group, except the tests that follow with the '-' prefix. You can see
> there are already a couple of ErrorHandling tests being excluded, so
> just add yours after them.
>
> cheers,
>
> Chris
>
>>
>> Kind Regards, Thomas
>>
>>
>>
>> On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer
>> <chris.plummer at oracle.com <mailto:chris.plummer at oracle.com>> wrote:
>>
>> Hi Tomas,
>>
>> vmError_windows.cpp: Getting closer. Should be "cancellation".
>>
>> vmError.cpp: Should be "let's"
>>
>> 389 // global timeout, lets execute the timeout step five
>> times.
>>
>> And in response to your comment:
>>
>>> I'm usually not a big fan of putting internals into a header,
>>> so I tend to hide internals in a file-local static function.
>>> But if this is the way you prefer it, this is fine too.
>>
>> I'm no hotspot coding conventions expert, but I have
>> frequently seen statics declared in the class scope (private
>> or protected), and don't see much in the way of statics
>> declared in .cpp files (although I see plenty of static helper
>> functions in .cpp files). It seems more consistent to declare
>> them in the class scope, but I could be wrong here. I'm not
>> personally attached to one way or another.
>>
>> Other than that it looks good. No need for another review.
>>
>> thanks,
>>
>> Chris
>>
>>
>> On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>> Hi Chris, David,
>>>
>>> Thanks for your reviews. I hope I addressed all of your
>>> issues. See further details below.
>>>
>>> new webrev:
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/>
>>>
>>> delta to the last one:
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/>
>>>
>>>
>>> @Chris:
>>>
>>> > Most files need 2017 copyright updates.
>>>
>>> Done.
>>>
>>> >set_to_now() and get_timestamp() should have names that tie them together
>>> better. Maybe set_step_start_time() and
>>> get_step_start_time(). Also, they >should probably be methods
>>> of VMError and the statics should all be fields.
>>>
>>> Done. Added an accessor for every timeout value and a setter
>>> too (named "record" to indicate that instead of passing a
>>> value it uses the current timestamp).
>>>
>>> I'm usually not a big fan of putting internals into a header,
>>> so I tend to hide internals in a file-local static function.
>>> But if this is the way you prefer it, this is fine too.
>>>
>>> >In the first review I had mentioned that you only are allowing for one
>>> step to hang, and still be able to continue with subsequent
>>> steps:
>>>
>>> >1509 // A step times out after half of the total timeout. Steps are
>>> mostly fast unless they
>>> >1510 // hang for some reason, so this simple rule allows for one
>>> hanging step and still
>>> >1511 // leaves time enough for the rest of the steps to finish.
>>> >1512 const jlong end = step_start_time_l + ErrorLogTimeout *
>>> 1000 / 2;
>>> >
>>> >You mentioned that for your SAP version you divide by 4, allowing for up
>>> to 3 steps to hang. Is there a reason not to do this for the
>>> open version? In fact it >looks like
>>> TestUnresponsiveErrorHandler is relying on that being the
>>> case, since it will hang 3 times.
>>>
>>> I changed this to the value we use at SAP (make the little
>>> timeouts a quarter of the final timeout value). I now need to
>>> have five timeout-test-steps in the error handler instead of
>>> the three earlier. I hid them in a nice macro, so I hope they
>>> are not too much of an eye sore.
>>>
>>> >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
>>> one instance of step timeout pattern. Probably it should
>>> check for 3 instances, and >also one instance of the overall
>>> timeout pattern:
>>> >
>>> >1272 st->print_cr("------ Timout during error reporting after "
>>> INT64_FORMAT "ms. ------",
>>>
>>> I intentionally left this test to be very basic - see the
>>> comment in the test. Basically, I am afraid of giving you
>>> guys a test which sporadically fails if the timing is bad.
>>>
>>> As David pointed out, the final error message cannot be
>>> relied on - it should appear in 99% of all cases but may not
>>> because error reporting thread and WatcherThread race. I
>>> explicitly avoided adding synchronization there, because it
>>> is better to sometimes miss output in the hs-err file than
>>> for the VM to hang in error shutdown.
>>>
>>> For the same reason I do not test for all the little timeout
>>> messages to appear in the error log. I now just test for two
>>> messages of the theoretical three or four max, to see that
>>> step timeout works repeatedly.
>>>
>>> @David:
>>>
>>> >src/os/windows/vm/vmError_windows.cpp
>>>
>>> >Typo: canceleation
>>>
>>> Done.
>>>
>>> >As previously discussed Atomic::load/store for jlong do not depend on
>>> supports_cx8. The commentary in atomic.hpp should really
>>> refer to 'atomic read->modify-write operations on jlongs' -
>>> as every platform must support atomic 64-bit load/store to
>>> support Java volatile long/double. So set_to_now and
>>> >get_timestamp can be simplified, or even removed completely.
>>>
>>> Sure! I removed the cx8 dependency, which makes the code
>>> simpler, but decided to keep the accessor functions (renamed
>>> as get_.._start_time and record_.._start_time)
>>>
>>> >Does the timestamp really need to be the system time in millis or do
>>> you just want to record relative elapsed time? If the latter
>>> then I suggest using >javaTimeNanos()/100000 to avoid
>>> potential glitches with the system time changing.
>>>
>>> Did this. This exchanges the possibility of time glitches
>>> with possible initialization issues (see non-apple bsd path
>>> and linux, e.g.) where the hr timer api is dynamically
>>> loaded. But I agree, this is still better. The latter is less
>>> probable to hurt, only if we crash at exactly the right time
>>> during VM initialization.
>>>
>>> >382 // TestUnresponsiveErrorHandler: three times to trigger first a
>>> step timeout, then
>>> >383 // the global error reporting timeout.
>>>
>>> >I'm a little confused as to why three steps are needed to test two things??
>>>
>>> See my earlier explanations. Now we have the 1:4 ratio
>>> between little/final timeout values, so we need five steps.
>>>
>>> >1272 st->print_cr("------ Timout during error reporting after "
>>> INT64_FORMAT "ms. ------",
>>>
>>> >Typo: timout
>>>
>>> Done.
>>>
>>> >Also I gather we are potentially racing with the
>>> WatcherThread here, so no guarantee these final print
>>> statements will be seen?
>>>
>>> Yes. I decided against synchronization here. The only point
>>> of this timeout is to write a final message into the error
>>> log. If this is lost, it is not a big deal. Note that we also
>>> have the final error message on stderr done by the WatcherThread.
>>>
>>> >1502 VMError::interrupt_reporting_thread();
>>>
>>> >Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>>>
>>> Done.
>>>
>>> >src/share/vm/utilities/vmError.hpp
>>>
>>> >Typo: dependend
>>>
>>> Done (comments changed).
>>>
>>> > This doesn't quite read right to me. How about:..
>>>
>>> Thanks for your suggestion, I took your version.
>>>
>>> >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>
>>> >I think using @requires and runtime Platform checks is redundant. I prefer
>>> @requires for both debug build testing and not-windows.
>>>
>>> Done. Thanks for your suggestion.
>>>
>>> ---
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe
>>> <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>> wrote:
>>>
>>> Dear all,
>>>
>>> please review (or re-review) this enhancement. It was
>>> already reviewed extensively for JDK9 in Oct 16, but
>>> pushed back to JDK10. Now that JDK10 repo is open, I
>>> would like to close this issue.
>>>
>>> In short, this enhancement lets the hotspot cancel
>>> hanging error reporting steps (when writing the hs-err
>>> file), in order to allow subsequent error reporting steps
>>> to run. In our JVM variant, this has been active for a
>>> long time and has proven to make error reporting more
>>> stable in the face of deadlocks or file system stalls.
>>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>> <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>>
>>> Old Diskussion for JDK9:
>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>> <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>>
>>> The webrev is unchanged from version 2 discussed in
>>> october, just got rebased atop of JDK10/hs.
>>>
>>> Thank you,
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>
>>
>
>
More information about the hotspot-runtime-dev
mailing list