RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes david.holmes at oracle.com
Thu Feb 23 07:38:47 UTC 2017


Looks good Thomas!

Thanks,
David

On 23/02/2017 5:32 PM, Thomas Stüfe wrote:
> Hi Chris,
>
> thank you for the detailed explanation, I modified TEST.groups in
> hotspot accordingly. I modified the webrev.04 in place.
>
> Kind Regards, Thomas
>
> On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <chris.plummer at oracle.com
> <mailto:chris.plummer at oracle.com>> wrote:
>
>     On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>>     Hi Chris, David,
>>
>>     thanks for your patience. Let's wrap this up.
>>
>>     New
>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/>
>>
>>     Delta to last
>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/>
>>
>>     Almost identical to the last version, only the typos Chris found
>>     were fixed.
>     Looks good!
>>
>>     @David:
>>
>>     >The test is quite a slow one and so we would like to exclude it from the
>>     >hotspot_fast_runtime group in TEST.groups (there are other
>>     ErrorHandling tests >excluded there already).
>>
>>     This is fine by me. Do you need me to do something?
>     What David said above. Look in hotspot/test/TEST.groups. Find
>     hotspot_fast_runtime. Add the test to the list of tests to not run
>     by putting a '-' in front of it. Currently there is:
>
>     hotspot_fast_runtime = \
>       runtime/ \
>      -runtime/6626217/Test6626217.sh \
>      -runtime/7100935 \
>      -runtime/7158988/FieldMonitor.java \
>      -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>      -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>      -runtime/ConstantPool/IntfMethod.java \
>      -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>      -runtime/ErrorHandling/ErrorHandler.java \
>      -runtime/logging/MonitorMismatchTest.java \
>      -runtime/memory/ReserveMemory.java \
>     ...
>
>     So that means all tests in the runtime directory are included in the
>     group, except the tests that follow with the '-' prefix. You can see
>     there are already a couple of ErrorHandling tests being excluded, so
>     just add yours after them.
>
>     cheers,
>
>     Chris
>
>>
>>     Kind Regards, Thomas
>>
>>
>>
>>     On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer
>>     <chris.plummer at oracle.com <mailto:chris.plummer at oracle.com>> wrote:
>>
>>         Hi Tomas,
>>
>>         vmError_windows.cpp: Getting closer. Should be "cancellation".
>>
>>         vmError.cpp: Should be "let's"
>>
>>          389   // global timeout, lets execute the timeout step five
>>         times.
>>
>>         And in response to your comment:
>>
>>>         I'm usually not a big fan of putting internals into a header,
>>>         so I tend to hide internals in a file-local static function.
>>>         But if this is the way you prefer it, this is fine too.
>>
>>         I'm no hotspot coding conventions expert, but I have
>>         frequently seen statics declared in the class scope (private
>>         or protected), and don't see much in the way of statics
>>         declared in .cpp files (although I see plenty of static helper
>>         functions in .cpp files). It seems more consistent to declare
>>         them in the class scope, but I could be wrong here. I'm not
>>         personally attached to one way or another.
>>
>>         Other than that it looks good. No need for another review.
>>
>>         thanks,
>>
>>         Chris
>>
>>
>>         On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>>         Hi Chris, David,
>>>
>>>         Thanks for your reviews. I hope I addressed all of your
>>>         issues. See further details below.
>>>
>>>         new webrev:
>>>
>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/>
>>>
>>>         delta to the last one:
>>>
>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/>
>>>
>>>
>>>         @Chris:
>>>
>>>         > Most files need 2017 copyright updates.
>>>
>>>         Done.
>>>
>>>         >set_to_now() and get_timestamp() should have names that tie them together
>>>         better. Maybe set_step_start_time() and
>>>         get_step_start_time(). Also, they >should probably be methods
>>>         of VMError and the statics should all be fields.
>>>
>>>         Done. Added an accessor for every timeout value and a setter
>>>         too (named "record" to indicate that instead of passing a
>>>         value it uses the current timestamp).
>>>
>>>         I'm usually not a big fan of putting internals into a header,
>>>         so I tend to hide internals in a file-local static function.
>>>         But if this is the way you prefer it, this is fine too.
>>>
>>>         >In the first review I had mentioned that you only are allowing for one
>>>         step to hang, and still be able to continue with subsequent
>>>         steps:
>>>
>>>         >1509     // A step times out after half of the total timeout. Steps are
>>>         mostly fast unless they
>>>         >1510     // hang for some reason, so this simple rule allows for one
>>>         hanging step and still
>>>         >1511     // leaves time enough for the rest of the steps to finish.
>>>         >1512     const jlong end = step_start_time_l + ErrorLogTimeout *
>>>         1000 / 2;
>>>         >
>>>         >You mentioned that for your SAP version you divide by 4, allowing for up
>>>         to 3 steps to hang. Is there a reason not to do this for the
>>>         open version? In fact it >looks like
>>>         TestUnresponsiveErrorHandler is relying on that being the
>>>         case, since it will hang 3 times.
>>>
>>>         I changed this to the value we use at SAP (make the little
>>>         timeouts a quarter of the final timeout value). I now need to
>>>         have five timeout-test-steps in the error handler instead of
>>>         the three earlier. I hid them in a nice macro, so I hope they
>>>         are not too much of an eye sore.
>>>
>>>         >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
>>>         one instance of step timeout pattern. Probably it should
>>>         check for 3 instances, and >also one instance of the overall
>>>         timeout pattern:
>>>         >
>>>         >1272         st->print_cr("------ Timout during error reporting after "
>>>         INT64_FORMAT "ms. ------",
>>>
>>>         I intentionally left this test to be very basic - see the
>>>         comment in the test. Basically, I am afraid of giving you
>>>         guys a test which sporadically fails if the timing is bad.
>>>
>>>          As David pointed out, the final error message cannot be
>>>         relied on - it should appear in 99% of all cases but may not
>>>         because error reporting thread and WatcherThread race. I
>>>         explicitly avoided adding synchronization there, because it
>>>         is better to sometimes miss output in the hs-err file than
>>>         for the VM to hang in error shutdown.
>>>
>>>         For the same reason I do not test for all the little timeout
>>>         messages to appear in the error log. I now just test for two
>>>         messages of the theoretical three or four max, to see that
>>>         step timeout works repeatedly.
>>>
>>>         @David:
>>>
>>>         >src/os/windows/vm/vmError_windows.cpp
>>>
>>>         >Typo: canceleation
>>>
>>>         Done.
>>>
>>>         >As previously discussed Atomic::load/store for jlong do not depend on
>>>         supports_cx8. The commentary in atomic.hpp should really
>>>         refer to 'atomic read->modify-write operations on jlongs' -
>>>         as every platform must support atomic 64-bit load/store to
>>>         support Java volatile long/double. So set_to_now and
>>>         >get_timestamp can be simplified, or even removed completely.
>>>
>>>         Sure! I removed the cx8 dependency, which makes the code
>>>         simpler, but decided to keep the accessor functions (renamed
>>>         as get_.._start_time and record_.._start_time)
>>>
>>>         >Does the timestamp really need to be the system time in millis or do
>>>         you just want to record relative elapsed time? If the latter
>>>         then I suggest using >javaTimeNanos()/100000 to avoid
>>>         potential glitches with the system time changing.
>>>
>>>         Did this. This exchanges the possibility of time glitches
>>>         with possible initialization issues (see non-apple bsd path
>>>         and linux, e.g.) where the hr timer api is dynamically
>>>         loaded. But I agree, this is still better. The latter is less
>>>         probable to hurt, only if we crash at exactly the right time
>>>         during VM initialization.
>>>
>>>         >382   // TestUnresponsiveErrorHandler: three times to trigger first a
>>>         step timeout, then
>>>         >383   // the global error reporting timeout.
>>>
>>>         >I'm a little confused as to why three steps are needed to test two things??
>>>
>>>         See my earlier explanations. Now we have the 1:4 ratio
>>>         between little/final timeout values, so we need five steps.
>>>
>>>         >1272         st->print_cr("------ Timout during error reporting after "
>>>         INT64_FORMAT "ms. ------",
>>>
>>>         >Typo: timout
>>>
>>>         Done.
>>>
>>>         >Also I gather we are potentially racing with the
>>>         WatcherThread here, so no guarantee these final print
>>>         statements will be seen?
>>>
>>>         Yes. I decided against synchronization here. The only point
>>>         of this timeout is to write a final message into the error
>>>         log. If this is lost, it is not a big deal. Note that we also
>>>         have the final error message on stderr done by the WatcherThread.
>>>
>>>         >1502       VMError::interrupt_reporting_thread();
>>>
>>>         >Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>>>
>>>         Done.
>>>
>>>         >src/share/vm/utilities/vmError.hpp
>>>
>>>         >Typo: dependend
>>>
>>>         Done (comments changed).
>>>
>>>         > This doesn't quite read right to me. How about:..
>>>
>>>         Thanks for your suggestion, I took your version.
>>>
>>>         >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>
>>>         >I think using @requires and runtime Platform checks is redundant. I prefer
>>>         @requires for both debug build testing and not-windows.
>>>
>>>         Done. Thanks for your suggestion.
>>>
>>>         ---
>>>
>>>         Kind Regards, Thomas
>>>
>>>
>>>         On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe
>>>         <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>> wrote:
>>>
>>>             Dear all,
>>>
>>>             please review (or re-review) this enhancement. It was
>>>             already reviewed extensively for JDK9 in Oct 16, but
>>>             pushed back to JDK10. Now that JDK10 repo is open, I
>>>             would like to close this issue.
>>>
>>>             In short, this enhancement lets the hotspot cancel
>>>             hanging error reporting steps (when writing the hs-err
>>>             file), in order to allow subsequent error reporting steps
>>>             to run. In our JVM variant, this has been active for a
>>>             long time and has proven to make error reporting more
>>>             stable in the face of deadlocks or file system stalls.
>>>
>>>             Issue:
>>>             https://bugs.openjdk.java.net/browse/JDK-8166944
>>>             <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>>
>>>             Webrev:
>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>>
>>>             Old Diskussion for JDK9:
>>>             http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>>             <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>>
>>>             The webrev is unchanged from version 2 discussed in
>>>             october, just got rebased atop of JDK10/hs.
>>>
>>>             Thank you,
>>>
>>>             Kind Regards, Thomas
>>>
>>>
>>
>>
>
>


More information about the hotspot-runtime-dev mailing list