RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.
Dmitry Dmitriev
dmitry.dmitriev at oracle.com
Tue Feb 7 11:41:56 UTC 2017
Hello Thomas,
Sorry for the late review. Few small comments about test:
1) Can you put FileInputStream and BufferedReader usage in the
try-with-resources block and remove br.close() on line 124?
2) Replace several empty lines with the one line: lines 45-46, lines 105-107
3) Remove last empty line(135)
Thank you!
Dmitry
On 06.02.2017 14:18, Thomas Stüfe wrote:
> Hi Chris, David,
>
> Thanks for your reviews. I hope I addressed all of your issues. See further
> details below.
>
> new webrev:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>
> delta to the last one:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>
>
> @Chris:
>
>> Most files need 2017 copyright updates.
> Done.
>
>> set_to_now() and get_timestamp() should have names that tie them together
> better. Maybe set_step_start_time() and get_step_start_time(). Also, they
>> should probably be methods of VMError and the statics should all be fields.
> Done. Added an accessor for every timeout value and a setter too (named
> "record" to indicate that instead of passing a value it uses the current
> timestamp).
>
> I'm usually not a big fan of putting internals into a header, so I tend to
> hide internals in a file-local static function. But if this is the way you
> prefer it, this is fine too.
>
>> In the first review I had mentioned that you only are allowing for one
> step to hang, and still be able to continue with subsequent steps:
>
>> 1509 // A step times out after half of the total timeout. Steps are
> mostly fast unless they
>> 1510 // hang for some reason, so this simple rule allows for one
> hanging step and still
>> 1511 // leaves time enough for the rest of the steps to finish.
>> 1512 const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>>
>> You mentioned that for your SAP version you divide by 4, allowing for up
> to 3 steps to hang. Is there a reason not to do this for the open version?
> In fact it >looks like TestUnresponsiveErrorHandler is relying on that
> being the case, since it will hang 3 times.
>
> I changed this to the value we use at SAP (make the little timeouts a
> quarter of the final timeout value). I now need to have five
> timeout-test-steps in the error handler instead of the three earlier. I hid
> them in a nice macro, so I hope they are not too much of an eye sore.
>
>> Related to this, TimeoutInErrorHandlingTest.java is >only checking for one
> instance of step timeout pattern. Probably it should check for 3 instances,
> and >also one instance of the overall timeout pattern:
>> 1272 st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> I intentionally left this test to be very basic - see the comment in the
> test. Basically, I am afraid of giving you guys a test which sporadically
> fails if the timing is bad.
>
> As David pointed out, the final error message cannot be relied on - it
> should appear in 99% of all cases but may not because error reporting
> thread and WatcherThread race. I explicitly avoided adding synchronization
> there, because it is better to sometimes miss output in the hs-err file
> than for the VM to hang in error shutdown.
>
> For the same reason I do not test for all the little timeout messages to
> appear in the error log. I now just test for two messages of the
> theoretical three or four max, to see that step timeout works repeatedly.
>
> @David:
>
>> src/os/windows/vm/vmError_windows.cpp
>> Typo: canceleation
> Done.
>
>> As previously discussed Atomic::load/store for jlong do not depend on
> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
> read->modify-write operations on jlongs' - as every platform must support
> atomic 64-bit load/store to support Java volatile long/double. So
> set_to_now and >get_timestamp can be simplified, or even removed completely.
>
> Sure! I removed the cx8 dependency, which makes the code simpler, but
> decided to keep the accessor functions (renamed as get_.._start_time and
> record_.._start_time)
>
>> Does the timestamp really need to be the system time in millis or do you
> just want to record relative elapsed time? If the latter then I suggest
> using >javaTimeNanos()/100000 to avoid potential glitches with the system
> time changing.
>
> Did this. This exchanges the possibility of time glitches with possible
> initialization issues (see non-apple bsd path and linux, e.g.) where the hr
> timer api is dynamically loaded. But I agree, this is still better. The
> latter is less probable to hurt, only if we crash at exactly the right time
> during VM initialization.
>
>> 382 // TestUnresponsiveErrorHandler: three times to trigger first a step
> timeout, then
>> 383 // the global error reporting timeout.
>> I'm a little confused as to why three steps are needed to test two things??
> See my earlier explanations. Now we have the 1:4 ratio between little/final
> timeout values, so we need five steps.
>
>> 1272 st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
>> Typo: timout
> Done.
>
>> Also I gather we are potentially racing with the WatcherThread here, so no
> guarantee these final print statements will be seen?
>
> Yes. I decided against synchronization here. The only point of this timeout
> is to write a final message into the error log. If this is lost, it is not
> a big deal. Note that we also have the final error message on stderr done
> by the WatcherThread.
>
>> 1502 VMError::interrupt_reporting_thread();
>> Nit: you're inside a VMError function so shouldn't need the VMError prefix.
> Done.
>
>> src/share/vm/utilities/vmError.hpp
>> Typo: dependend
> Done (comments changed).
>
>> This doesn't quite read right to me. How about:..
> Thanks for your suggestion, I took your version.
>
>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>> I think using @requires and runtime Platform checks is redundant. I prefer
> @requires for both debug build testing and not-windows.
>
> Done. Thanks for your suggestion.
>
> ---
>
> Kind Regards, Thomas
>
>
> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
> wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error reporting
>> steps (when writing the hs-err file), in order to allow subsequent error
>> reporting steps to run. In our JVM variant, this has been active for a long
>> time and has proven to make error reporting more stable in the face of
>> deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.02/webrev/
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
More information about the hotspot-runtime-dev
mailing list