RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.
Thomas Stüfe
thomas.stuefe at gmail.com
Thu Feb 23 07:39:35 UTC 2017
Thank you both! As always, I need a sponsor :)
..Thomas
On Thu, Feb 23, 2017 at 8:38 AM, David Holmes <david.holmes at oracle.com>
wrote:
> Looks good Thomas!
>
> Thanks,
> David
>
> On 23/02/2017 5:32 PM, Thomas Stüfe wrote:
>
>> Hi Chris,
>>
>> thank you for the detailed explanation, I modified TEST.groups in
>> hotspot accordingly. I modified the webrev.04 in place.
>>
>> Kind Regards, Thomas
>>
>> On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <chris.plummer at oracle.com
>> <mailto:chris.plummer at oracle.com>> wrote:
>>
>> On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>>
>>> Hi Chris, David,
>>>
>>> thanks for your patience. Let's wrap this up.
>>>
>>> New
>>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>> Error-Reporting/jdk10-webrev.04/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>> g-Error-Reporting/jdk10-webrev.04/webrev/>
>>>
>>> Delta to last
>>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>> Error-Reporting/jdk10-webrev.04-delta/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>> g-Error-Reporting/jdk10-webrev.04-delta/webrev/>
>>>
>>>
>>> Almost identical to the last version, only the typos Chris found
>>> were fixed.
>>>
>> Looks good!
>>
>>>
>>> @David:
>>>
>>> >The test is quite a slow one and so we would like to exclude it
>>> from the
>>> >hotspot_fast_runtime group in TEST.groups (there are other
>>> ErrorHandling tests >excluded there already).
>>>
>>> This is fine by me. Do you need me to do something?
>>>
>> What David said above. Look in hotspot/test/TEST.groups. Find
>> hotspot_fast_runtime. Add the test to the list of tests to not run
>> by putting a '-' in front of it. Currently there is:
>>
>> hotspot_fast_runtime = \
>> runtime/ \
>> -runtime/6626217/Test6626217.sh \
>> -runtime/7100935 \
>> -runtime/7158988/FieldMonitor.java \
>> -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>> -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>> -runtime/ConstantPool/IntfMethod.java \
>> -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>> -runtime/ErrorHandling/ErrorHandler.java \
>> -runtime/logging/MonitorMismatchTest.java \
>> -runtime/memory/ReserveMemory.java \
>> ...
>>
>> So that means all tests in the runtime directory are included in the
>> group, except the tests that follow with the '-' prefix. You can see
>> there are already a couple of ErrorHandling tests being excluded, so
>> just add yours after them.
>>
>> cheers,
>>
>> Chris
>>
>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>> On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer
>>> <chris.plummer at oracle.com <mailto:chris.plummer at oracle.com>> wrote:
>>>
>>> Hi Tomas,
>>>
>>> vmError_windows.cpp: Getting closer. Should be "cancellation".
>>>
>>> vmError.cpp: Should be "let's"
>>>
>>> 389 // global timeout, lets execute the timeout step five
>>> times.
>>>
>>> And in response to your comment:
>>>
>>> I'm usually not a big fan of putting internals into a header,
>>>> so I tend to hide internals in a file-local static function.
>>>> But if this is the way you prefer it, this is fine too.
>>>>
>>>
>>> I'm no hotspot coding conventions expert, but I have
>>> frequently seen statics declared in the class scope (private
>>> or protected), and don't see much in the way of statics
>>> declared in .cpp files (although I see plenty of static helper
>>> functions in .cpp files). It seems more consistent to declare
>>> them in the class scope, but I could be wrong here. I'm not
>>> personally attached to one way or another.
>>>
>>> Other than that it looks good. No need for another review.
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
>>> On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>>
>>>> Hi Chris, David,
>>>>
>>>> Thanks for your reviews. I hope I addressed all of your
>>>> issues. See further details below.
>>>>
>>>> new webrev:
>>>>
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.03/webrev/
>>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>>> g-Error-Reporting/jdk10-webrev.03/webrev/>
>>>>
>>>> delta to the last one:
>>>>
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.03-delta/webrev/
>>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>>> g-Error-Reporting/jdk10-webrev.03-delta/webrev/>
>>>>
>>>>
>>>>
>>>> @Chris:
>>>>
>>>> > Most files need 2017 copyright updates.
>>>>
>>>> Done.
>>>>
>>>> >set_to_now() and get_timestamp() should have names that tie
>>>> them together
>>>> better. Maybe set_step_start_time() and
>>>> get_step_start_time(). Also, they >should probably be methods
>>>> of VMError and the statics should all be fields.
>>>>
>>>> Done. Added an accessor for every timeout value and a setter
>>>> too (named "record" to indicate that instead of passing a
>>>> value it uses the current timestamp).
>>>>
>>>> I'm usually not a big fan of putting internals into a header,
>>>> so I tend to hide internals in a file-local static function.
>>>> But if this is the way you prefer it, this is fine too.
>>>>
>>>> >In the first review I had mentioned that you only are allowing
>>>> for one
>>>> step to hang, and still be able to continue with subsequent
>>>> steps:
>>>>
>>>> >1509 // A step times out after half of the total timeout.
>>>> Steps are
>>>> mostly fast unless they
>>>> >1510 // hang for some reason, so this simple rule allows
>>>> for one
>>>> hanging step and still
>>>> >1511 // leaves time enough for the rest of the steps to
>>>> finish.
>>>> >1512 const jlong end = step_start_time_l + ErrorLogTimeout
>>>> *
>>>> 1000 / 2;
>>>> >
>>>> >You mentioned that for your SAP version you divide by 4,
>>>> allowing for up
>>>> to 3 steps to hang. Is there a reason not to do this for the
>>>> open version? In fact it >looks like
>>>> TestUnresponsiveErrorHandler is relying on that being the
>>>> case, since it will hang 3 times.
>>>>
>>>> I changed this to the value we use at SAP (make the little
>>>> timeouts a quarter of the final timeout value). I now need to
>>>> have five timeout-test-steps in the error handler instead of
>>>> the three earlier. I hid them in a nice macro, so I hope they
>>>> are not too much of an eye sore.
>>>>
>>>> >Related to this, TimeoutInErrorHandlingTest.java is >only
>>>> checking for
>>>> one instance of step timeout pattern. Probably it should
>>>> check for 3 instances, and >also one instance of the overall
>>>> timeout pattern:
>>>> >
>>>> >1272 st->print_cr("------ Timout during error
>>>> reporting after "
>>>> INT64_FORMAT "ms. ------",
>>>>
>>>> I intentionally left this test to be very basic - see the
>>>> comment in the test. Basically, I am afraid of giving you
>>>> guys a test which sporadically fails if the timing is bad.
>>>>
>>>> As David pointed out, the final error message cannot be
>>>> relied on - it should appear in 99% of all cases but may not
>>>> because error reporting thread and WatcherThread race. I
>>>> explicitly avoided adding synchronization there, because it
>>>> is better to sometimes miss output in the hs-err file than
>>>> for the VM to hang in error shutdown.
>>>>
>>>> For the same reason I do not test for all the little timeout
>>>> messages to appear in the error log. I now just test for two
>>>> messages of the theoretical three or four max, to see that
>>>> step timeout works repeatedly.
>>>>
>>>> @David:
>>>>
>>>> >src/os/windows/vm/vmError_windows.cpp
>>>>
>>>> >Typo: canceleation
>>>>
>>>> Done.
>>>>
>>>> >As previously discussed Atomic::load/store for jlong do not
>>>> depend on
>>>> supports_cx8. The commentary in atomic.hpp should really
>>>> refer to 'atomic read->modify-write operations on jlongs' -
>>>> as every platform must support atomic 64-bit load/store to
>>>> support Java volatile long/double. So set_to_now and
>>>> >get_timestamp can be simplified, or even removed completely.
>>>>
>>>> Sure! I removed the cx8 dependency, which makes the code
>>>> simpler, but decided to keep the accessor functions (renamed
>>>> as get_.._start_time and record_.._start_time)
>>>>
>>>> >Does the timestamp really need to be the system time in millis
>>>> or do
>>>> you just want to record relative elapsed time? If the latter
>>>> then I suggest using >javaTimeNanos()/100000 to avoid
>>>> potential glitches with the system time changing.
>>>>
>>>> Did this. This exchanges the possibility of time glitches
>>>> with possible initialization issues (see non-apple bsd path
>>>> and linux, e.g.) where the hr timer api is dynamically
>>>> loaded. But I agree, this is still better. The latter is less
>>>> probable to hurt, only if we crash at exactly the right time
>>>> during VM initialization.
>>>>
>>>> >382 // TestUnresponsiveErrorHandler: three times to trigger
>>>> first a
>>>> step timeout, then
>>>> >383 // the global error reporting timeout.
>>>>
>>>> >I'm a little confused as to why three steps are needed to test
>>>> two things??
>>>>
>>>> See my earlier explanations. Now we have the 1:4 ratio
>>>> between little/final timeout values, so we need five steps.
>>>>
>>>> >1272 st->print_cr("------ Timout during error
>>>> reporting after "
>>>> INT64_FORMAT "ms. ------",
>>>>
>>>> >Typo: timout
>>>>
>>>> Done.
>>>>
>>>> >Also I gather we are potentially racing with the
>>>> WatcherThread here, so no guarantee these final print
>>>> statements will be seen?
>>>>
>>>> Yes. I decided against synchronization here. The only point
>>>> of this timeout is to write a final message into the error
>>>> log. If this is lost, it is not a big deal. Note that we also
>>>> have the final error message on stderr done by the
>>>> WatcherThread.
>>>>
>>>> >1502 VMError::interrupt_reporting_thread();
>>>>
>>>> >Nit: you're inside a VMError function so shouldn't need the
>>>> VMError prefix.
>>>>
>>>> Done.
>>>>
>>>> >src/share/vm/utilities/vmError.hpp
>>>>
>>>> >Typo: dependend
>>>>
>>>> Done (comments changed).
>>>>
>>>> > This doesn't quite read right to me. How about:..
>>>>
>>>> Thanks for your suggestion, I took your version.
>>>>
>>>> >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>>
>>>> >I think using @requires and runtime Platform checks is
>>>> redundant. I prefer
>>>> @requires for both debug build testing and not-windows.
>>>>
>>>> Done. Thanks for your suggestion.
>>>>
>>>> ---
>>>>
>>>> Kind Regards, Thomas
>>>>
>>>>
>>>> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe
>>>> <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>>
>>>> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> please review (or re-review) this enhancement. It was
>>>> already reviewed extensively for JDK9 in Oct 16, but
>>>> pushed back to JDK10. Now that JDK10 repo is open, I
>>>> would like to close this issue.
>>>>
>>>> In short, this enhancement lets the hotspot cancel
>>>> hanging error reporting steps (when writing the hs-err
>>>> file), in order to allow subsequent error reporting steps
>>>> to run. In our JVM variant, this has been active for a
>>>> long time and has proven to make error reporting more
>>>> stable in the face of deadlocks or file system stalls.
>>>>
>>>> Issue:
>>>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>>> <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.02/webrev/
>>>> <http://cr.openjdk.java.net/%7
>>>> Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>>> >
>>>>
>>>> Old Diskussion for JDK9:
>>>> http://mail.openjdk.java.net/p
>>>> ipermail/hotspot-runtime-dev/2016-October/021582.html
>>>> <http://mail.openjdk.java.net/
>>>> pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>>>
>>>> The webrev is unchanged from version 2 discussed in
>>>> october, just got rebased atop of JDK10/hs.
>>>>
>>>> Thank you,
>>>>
>>>> Kind Regards, Thomas
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
More information about the hotspot-runtime-dev
mailing list