RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe thomas.stuefe at gmail.com
Thu Feb 23 07:32:39 UTC 2017


Hi Chris,

thank you for the detailed explanation, I modified TEST.groups in hotspot
accordingly. I modified the webrev.04 in place.

Kind Regards, Thomas

On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <chris.plummer at oracle.com>
wrote:

> On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>
> Hi Chris, David,
>
> thanks for your patience. Let's wrap this up.
>
> New webrev: http://cr.openjdk.java.net/~stuefe/webrevs/
> 8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/
>
> Delta to last webrev: http://cr.openjdk.java.net/~stuefe/webrevs/
> 8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/
>
> Almost identical to the last version, only the typos Chris found were
> fixed.
>
> Looks good!
>
>
> @David:
>
> >The test is quite a slow one and so we would like to exclude it from the
> >hotspot_fast_runtime group in TEST.groups (there are other ErrorHandling
> tests >excluded there already).
>
> This is fine by me. Do you need me to do something?
>
> What David said above. Look in hotspot/test/TEST.groups. Find
> hotspot_fast_runtime. Add the test to the list of tests to not run by
> putting a '-' in front of it. Currently there is:
>
> hotspot_fast_runtime = \
>   runtime/ \
>  -runtime/6626217/Test6626217.sh \
>  -runtime/7100935 \
>  -runtime/7158988/FieldMonitor.java \
>  -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>  -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>  -runtime/ConstantPool/IntfMethod.java \
>  -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>  -runtime/ErrorHandling/ErrorHandler.java \
>  -runtime/logging/MonitorMismatchTest.java \
>  -runtime/memory/ReserveMemory.java \
> ...
>
> So that means all tests in the runtime directory are included in the
> group, except the tests that follow with the '-' prefix. You can see there
> are already a couple of ErrorHandling tests being excluded, so just add
> yours after them.
>
> cheers,
>
> Chris
>
>
> Kind Regards, Thomas
>
>
>
> On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer <chris.plummer at oracle.com>
> wrote:
>
>> Hi Tomas,
>>
>> vmError_windows.cpp: Getting closer. Should be "cancellation".
>>
>> vmError.cpp: Should be "let's"
>>
>>  389   // global timeout, lets execute the timeout step five times.
>>
>> And in response to your comment:
>>
>> I'm usually not a big fan of putting internals into a header, so I tend
>> to hide internals in a file-local static function. But if this is the way
>> you prefer it, this is fine too.
>>
>>
>> I'm no hotspot coding conventions expert, but I have frequently seen
>> statics declared in the class scope (private or protected), and don't see
>> much in the way of statics declared in .cpp files (although I see plenty of
>> static helper functions in .cpp files). It seems more consistent to declare
>> them in the class scope, but I could be wrong here. I'm not personally
>> attached to one way or another.
>>
>> Other than that it looks good. No need for another review.
>>
>> thanks,
>>
>> Chris
>>
>>
>> On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>
>> Hi Chris, David,
>>
>> Thanks for your reviews. I hope I addressed all of your issues. See
>> further details below.
>>
>> new webrev:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.03/webrev/
>>
>> delta to the last one:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.03-delta/webrev/
>>
>>
>> @Chris:
>>
>> > Most files need 2017 copyright updates.
>>
>> Done.
>>
>> >set_to_now() and get_timestamp() should have names that tie them
>> together better. Maybe set_step_start_time() and get_step_start_time().
>> Also, they >should probably be methods of VMError and the statics should
>> all be fields.
>>
>> Done. Added an accessor for every timeout value and a setter too (named
>> "record" to indicate that instead of passing a value it uses the current
>> timestamp).
>>
>> I'm usually not a big fan of putting internals into a header, so I tend
>> to hide internals in a file-local static function. But if this is the way
>> you prefer it, this is fine too.
>>
>> >In the first review I had mentioned that you only are allowing for one
>> step to hang, and still be able to continue with subsequent steps:
>>
>> >1509     // A step times out after half of the total timeout. Steps are
>> mostly fast unless they
>> >1510     // hang for some reason, so this simple rule allows for one
>> hanging step and still
>> >1511     // leaves time enough for the rest of the steps to finish.
>> >1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 /
>> 2;
>> >
>> >You mentioned that for your SAP version you divide by 4, allowing for up
>> to 3 steps to hang. Is there a reason not to do this for the open version?
>> In fact it >looks like TestUnresponsiveErrorHandler is relying on that
>> being the case, since it will hang 3 times.
>>
>> I changed this to the value we use at SAP (make the little timeouts a
>> quarter of the final timeout value). I now need to have five
>> timeout-test-steps in the error handler instead of the three earlier. I hid
>> them in a nice macro, so I hope they are not too much of an eye sore.
>>
>> >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
>> one instance of step timeout pattern. Probably it should check for 3
>> instances, and >also one instance of the overall timeout pattern:
>> >
>> >1272         st->print_cr("------ Timout during error reporting after "
>> INT64_FORMAT "ms. ------",
>>
>> I intentionally left this test to be very basic - see the comment in the
>> test. Basically, I am afraid of giving you guys a test which sporadically
>> fails if the timing is bad.
>>
>>  As David pointed out, the final error message cannot be relied on - it
>> should appear in 99% of all cases but may not because error reporting
>> thread and WatcherThread race. I explicitly avoided adding synchronization
>> there, because it is better to sometimes miss output in the hs-err file
>> than for the VM to hang in error shutdown.
>>
>> For the same reason I do not test for all the little timeout messages to
>> appear in the error log. I now just test for two messages of the
>> theoretical three or four max, to see that step timeout works repeatedly.
>>
>> @David:
>>
>> >src/os/windows/vm/vmError_windows.cpp
>>
>> >Typo: canceleation
>>
>> Done.
>>
>> >As previously discussed Atomic::load/store for jlong do not depend on
>> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
>> read->modify-write operations on jlongs' - as every platform must support
>> atomic 64-bit load/store to support Java volatile long/double. So
>> set_to_now and >get_timestamp can be simplified, or even removed completely.
>>
>> Sure! I removed the cx8 dependency, which makes the code simpler, but
>> decided to keep the accessor functions (renamed as get_.._start_time and
>> record_.._start_time)
>>
>> >Does the timestamp really need to be the system time in millis or do you
>> just want to record relative elapsed time? If the latter then I suggest
>> using >javaTimeNanos()/100000 to avoid potential glitches with the system
>> time changing.
>>
>> Did this. This exchanges the possibility of time glitches with possible
>> initialization issues (see non-apple bsd path and linux, e.g.) where the hr
>> timer api is dynamically loaded. But I agree, this is still better. The
>> latter is less probable to hurt, only if we crash at exactly the right time
>> during VM initialization.
>>
>> >382   // TestUnresponsiveErrorHandler: three times to trigger first a
>> step timeout, then
>> >383   // the global error reporting timeout.
>>
>> >I'm a little confused as to why three steps are needed to test two
>> things??
>>
>> See my earlier explanations. Now we have the 1:4 ratio between
>> little/final timeout values, so we need five steps.
>>
>> >1272         st->print_cr("------ Timout during error reporting after "
>> INT64_FORMAT "ms. ------",
>>
>> >Typo: timout
>>
>> Done.
>>
>> >Also I gather we are potentially racing with the WatcherThread here, so
>> no guarantee these final print statements will be seen?
>>
>> Yes. I decided against synchronization here. The only point of this
>> timeout is to write a final message into the error log. If this is lost, it
>> is not a big deal. Note that we also have the final error message on stderr
>> done by the WatcherThread.
>>
>> >1502       VMError::interrupt_reporting_thread();
>>
>> >Nit: you're inside a VMError function so shouldn't need the VMError
>> prefix.
>>
>> Done.
>>
>> >src/share/vm/utilities/vmError.hpp
>>
>> >Typo: dependend
>>
>> Done (comments changed).
>>
>> > This doesn't quite read right to me. How about:..
>>
>> Thanks for your suggestion, I took your version.
>>
>> >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>> >I think using @requires and runtime Platform checks is redundant. I
>> prefer @requires for both debug build testing and not-windows.
>>
>> Done. Thanks for your suggestion.
>>
>> ---
>>
>> Kind Regards, Thomas
>>
>>
>> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
>> wrote:
>>
>>> Dear all,
>>>
>>> please review (or re-review) this enhancement. It was already reviewed
>>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>>> repo is open, I would like to close this issue.
>>>
>>> In short, this enhancement lets the hotspot cancel hanging error
>>> reporting steps (when writing the hs-err file), in order to allow
>>> subsequent error reporting steps to run. In our JVM variant, this has been
>>> active for a long time and has proven to make error reporting more stable
>>> in the face of deadlocks or file system stalls.
>>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>>> rror-Reporting/jdk10-webrev.02/webrev/
>>>
>>> Old Diskussion for JDK9:
>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>>> 016-October/021582.html
>>>
>>> The webrev is unchanged from version 2 discussed in october, just got
>>> rebased atop of JDK10/hs.
>>>
>>> Thank you,
>>>
>>> Kind Regards, Thomas
>>>
>>
>>
>>
>
>


More information about the hotspot-runtime-dev mailing list