RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thu Feb 23 07:39:35 UTC 2017

Thank you both! As always, I need a sponsor :)

..Thomas

On Thu, Feb 23, 2017 at 8:38 AM, David Holmes <david.holmes at oracle.com>
wrote:

> Looks good Thomas!
>
> Thanks,
> David
>
> On 23/02/2017 5:32 PM, Thomas Stüfe wrote:
>
>> Hi Chris,
>>
>> thank you for the detailed explanation, I modified TEST.groups in
>> hotspot accordingly. I modified the webrev.04 in place.
>>
>> Kind Regards, Thomas
>>
>> On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <chris.plummer at oracle.com
>> <mailto:chris.plummer at oracle.com>> wrote:
>>
>>     On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>>
>>>     Hi Chris, David,
>>>
>>>     thanks for your patience. Let's wrap this up.
>>>
>>>     New
>>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>> Error-Reporting/jdk10-webrev.04/webrev/
>>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>> g-Error-Reporting/jdk10-webrev.04/webrev/>
>>>
>>>     Delta to last
>>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>> Error-Reporting/jdk10-webrev.04-delta/webrev/
>>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>> g-Error-Reporting/jdk10-webrev.04-delta/webrev/>
>>>
>>>
>>>     Almost identical to the last version, only the typos Chris found
>>>     were fixed.
>>>
>>     Looks good!
>>
>>>
>>>     @David:
>>>
>>>     >The test is quite a slow one and so we would like to exclude it
>>> from the
>>>     >hotspot_fast_runtime group in TEST.groups (there are other
>>>     ErrorHandling tests >excluded there already).
>>>
>>>     This is fine by me. Do you need me to do something?
>>>
>>     What David said above. Look in hotspot/test/TEST.groups. Find
>>     hotspot_fast_runtime. Add the test to the list of tests to not run
>>     by putting a '-' in front of it. Currently there is:
>>
>>     hotspot_fast_runtime = \
>>       runtime/ \
>>      -runtime/6626217/Test6626217.sh \
>>      -runtime/7100935 \
>>      -runtime/7158988/FieldMonitor.java \
>>      -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>>      -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>>      -runtime/ConstantPool/IntfMethod.java \
>>      -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>>      -runtime/ErrorHandling/ErrorHandler.java \
>>      -runtime/logging/MonitorMismatchTest.java \
>>      -runtime/memory/ReserveMemory.java \
>>     ...
>>
>>     So that means all tests in the runtime directory are included in the
>>     group, except the tests that follow with the '-' prefix. You can see
>>     there are already a couple of ErrorHandling tests being excluded, so
>>     just add yours after them.
>>
>>     cheers,
>>
>>     Chris
>>
>>
>>>     Kind Regards, Thomas
>>>
>>>
>>>
>>>     On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer
>>>     <chris.plummer at oracle.com <mailto:chris.plummer at oracle.com>> wrote:
>>>
>>>         Hi Tomas,
>>>
>>>         vmError_windows.cpp: Getting closer. Should be "cancellation".
>>>
>>>         vmError.cpp: Should be "let's"
>>>
>>>          389   // global timeout, lets execute the timeout step five
>>>         times.
>>>
>>>         And in response to your comment:
>>>
>>>         I'm usually not a big fan of putting internals into a header,
>>>>         so I tend to hide internals in a file-local static function.
>>>>         But if this is the way you prefer it, this is fine too.
>>>>
>>>
>>>         I'm no hotspot coding conventions expert, but I have
>>>         frequently seen statics declared in the class scope (private
>>>         or protected), and don't see much in the way of statics
>>>         declared in .cpp files (although I see plenty of static helper
>>>         functions in .cpp files). It seems more consistent to declare
>>>         them in the class scope, but I could be wrong here. I'm not
>>>         personally attached to one way or another.
>>>
>>>         Other than that it looks good. No need for another review.
>>>
>>>         thanks,
>>>
>>>         Chris
>>>
>>>
>>>         On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>>
>>>>         Hi Chris, David,
>>>>
>>>>         Thanks for your reviews. I hope I addressed all of your
>>>>         issues. See further details below.
>>>>
>>>>         new webrev:
>>>>
>>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.03/webrev/
>>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>>> g-Error-Reporting/jdk10-webrev.03/webrev/>
>>>>
>>>>         delta to the last one:
>>>>
>>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.03-delta/webrev/
>>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>>> g-Error-Reporting/jdk10-webrev.03-delta/webrev/>
>>>>
>>>>
>>>>
>>>>         @Chris:
>>>>
>>>>         > Most files need 2017 copyright updates.
>>>>
>>>>         Done.
>>>>
>>>>         >set_to_now() and get_timestamp() should have names that tie
>>>> them together
>>>>         better. Maybe set_step_start_time() and
>>>>         get_step_start_time(). Also, they >should probably be methods
>>>>         of VMError and the statics should all be fields.
>>>>
>>>>         Done. Added an accessor for every timeout value and a setter
>>>>         too (named "record" to indicate that instead of passing a
>>>>         value it uses the current timestamp).
>>>>
>>>>         I'm usually not a big fan of putting internals into a header,
>>>>         so I tend to hide internals in a file-local static function.
>>>>         But if this is the way you prefer it, this is fine too.
>>>>
>>>>         >In the first review I had mentioned that you only are allowing
>>>> for one
>>>>         step to hang, and still be able to continue with subsequent
>>>>         steps:
>>>>
>>>>         >1509     // A step times out after half of the total timeout.
>>>> Steps are
>>>>         mostly fast unless they
>>>>         >1510     // hang for some reason, so this simple rule allows
>>>> for one
>>>>         hanging step and still
>>>>         >1511     // leaves time enough for the rest of the steps to
>>>> finish.
>>>>         >1512     const jlong end = step_start_time_l + ErrorLogTimeout
>>>> *
>>>>         1000 / 2;
>>>>         >
>>>>         >You mentioned that for your SAP version you divide by 4,
>>>> allowing for up
>>>>         to 3 steps to hang. Is there a reason not to do this for the
>>>>         open version? In fact it >looks like
>>>>         TestUnresponsiveErrorHandler is relying on that being the
>>>>         case, since it will hang 3 times.
>>>>
>>>>         I changed this to the value we use at SAP (make the little
>>>>         timeouts a quarter of the final timeout value). I now need to
>>>>         have five timeout-test-steps in the error handler instead of
>>>>         the three earlier. I hid them in a nice macro, so I hope they
>>>>         are not too much of an eye sore.
>>>>
>>>>         >Related to this, TimeoutInErrorHandlingTest.java is >only
>>>> checking for
>>>>         one instance of step timeout pattern. Probably it should
>>>>         check for 3 instances, and >also one instance of the overall
>>>>         timeout pattern:
>>>>         >
>>>>         >1272         st->print_cr("------ Timout during error
>>>> reporting after "
>>>>         INT64_FORMAT "ms. ------",
>>>>
>>>>         I intentionally left this test to be very basic - see the
>>>>         comment in the test. Basically, I am afraid of giving you
>>>>         guys a test which sporadically fails if the timing is bad.
>>>>
>>>>          As David pointed out, the final error message cannot be
>>>>         relied on - it should appear in 99% of all cases but may not
>>>>         because error reporting thread and WatcherThread race. I
>>>>         explicitly avoided adding synchronization there, because it
>>>>         is better to sometimes miss output in the hs-err file than
>>>>         for the VM to hang in error shutdown.
>>>>
>>>>         For the same reason I do not test for all the little timeout
>>>>         messages to appear in the error log. I now just test for two
>>>>         messages of the theoretical three or four max, to see that
>>>>         step timeout works repeatedly.
>>>>
>>>>         @David:
>>>>
>>>>         >src/os/windows/vm/vmError_windows.cpp
>>>>
>>>>         >Typo: canceleation
>>>>
>>>>         Done.
>>>>
>>>>         >As previously discussed Atomic::load/store for jlong do not
>>>> depend on
>>>>         supports_cx8. The commentary in atomic.hpp should really
>>>>         refer to 'atomic read->modify-write operations on jlongs' -
>>>>         as every platform must support atomic 64-bit load/store to
>>>>         support Java volatile long/double. So set_to_now and
>>>>         >get_timestamp can be simplified, or even removed completely.
>>>>
>>>>         Sure! I removed the cx8 dependency, which makes the code
>>>>         simpler, but decided to keep the accessor functions (renamed
>>>>         as get_.._start_time and record_.._start_time)
>>>>
>>>>         >Does the timestamp really need to be the system time in millis
>>>> or do
>>>>         you just want to record relative elapsed time? If the latter
>>>>         then I suggest using >javaTimeNanos()/100000 to avoid
>>>>         potential glitches with the system time changing.
>>>>
>>>>         Did this. This exchanges the possibility of time glitches
>>>>         with possible initialization issues (see non-apple bsd path
>>>>         and linux, e.g.) where the hr timer api is dynamically
>>>>         loaded. But I agree, this is still better. The latter is less
>>>>         probable to hurt, only if we crash at exactly the right time
>>>>         during VM initialization.
>>>>
>>>>         >382   // TestUnresponsiveErrorHandler: three times to trigger
>>>> first a
>>>>         step timeout, then
>>>>         >383   // the global error reporting timeout.
>>>>
>>>>         >I'm a little confused as to why three steps are needed to test
>>>> two things??
>>>>
>>>>         See my earlier explanations. Now we have the 1:4 ratio
>>>>         between little/final timeout values, so we need five steps.
>>>>
>>>>         >1272         st->print_cr("------ Timout during error
>>>> reporting after "
>>>>         INT64_FORMAT "ms. ------",
>>>>
>>>>         >Typo: timout
>>>>
>>>>         Done.
>>>>
>>>>         >Also I gather we are potentially racing with the
>>>>         WatcherThread here, so no guarantee these final print
>>>>         statements will be seen?
>>>>
>>>>         Yes. I decided against synchronization here. The only point
>>>>         of this timeout is to write a final message into the error
>>>>         log. If this is lost, it is not a big deal. Note that we also
>>>>         have the final error message on stderr done by the
>>>> WatcherThread.
>>>>
>>>>         >1502       VMError::interrupt_reporting_thread();
>>>>
>>>>         >Nit: you're inside a VMError function so shouldn't need the
>>>> VMError prefix.
>>>>
>>>>         Done.
>>>>
>>>>         >src/share/vm/utilities/vmError.hpp
>>>>
>>>>         >Typo: dependend
>>>>
>>>>         Done (comments changed).
>>>>
>>>>         > This doesn't quite read right to me. How about:..
>>>>
>>>>         Thanks for your suggestion, I took your version.
>>>>
>>>>         >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>>
>>>>         >I think using @requires and runtime Platform checks is
>>>> redundant. I prefer
>>>>         @requires for both debug build testing and not-windows.
>>>>
>>>>         Done. Thanks for your suggestion.
>>>>
>>>>         ---
>>>>
>>>>         Kind Regards, Thomas
>>>>
>>>>
>>>>         On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe
>>>>         <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>>
>>>> wrote:
>>>>
>>>>             Dear all,
>>>>
>>>>             please review (or re-review) this enhancement. It was
>>>>             already reviewed extensively for JDK9 in Oct 16, but
>>>>             pushed back to JDK10. Now that JDK10 repo is open, I
>>>>             would like to close this issue.
>>>>
>>>>             In short, this enhancement lets the hotspot cancel
>>>>             hanging error reporting steps (when writing the hs-err
>>>>             file), in order to allow subsequent error reporting steps
>>>>             to run. In our JVM variant, this has been active for a
>>>>             long time and has proven to make error reporting more
>>>>             stable in the face of deadlocks or file system stalls.
>>>>
>>>>             Issue:
>>>>             https://bugs.openjdk.java.net/browse/JDK-8166944
>>>>             <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>>>
>>>>             Webrev:
>>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.02/webrev/
>>>>             <http://cr.openjdk.java.net/%7
>>>> Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>>> >
>>>>
>>>>             Old Diskussion for JDK9:
>>>>             http://mail.openjdk.java.net/p
>>>> ipermail/hotspot-runtime-dev/2016-October/021582.html
>>>>             <http://mail.openjdk.java.net/
>>>> pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>>>
>>>>             The webrev is unchanged from version 2 discussed in
>>>>             october, just got rebased atop of JDK10/hs.
>>>>
>>>>             Thank you,
>>>>
>>>>             Kind Regards, Thomas
>>>>
>>>>
>>>>
>>>
>>>
>>
>>