RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.
David Holmes
david.holmes at oracle.com
Thu Oct 27 17:40:49 UTC 2016
Just picking up some a couple of specific discussion points ...
On 27/10/2016 4:09 PM, Thomas Stüfe wrote:
> On Wed, Oct 26, 2016 at 9:27 PM, Chris Plummer <chris.plummer at oracle.com
> <mailto:chris.plummer at oracle.com>> wrote:
>> But then, I had to check whether 64bit atomic stores/loads are
>> even supported by this platform (I actually did not find a 32bit
>> platform whithout 64bit atomics, but the comment in atomic.hpp is
>> pretty insistent and I did not want to risk regressions for other
>> platforms).
>>
>> Well, if no cx8 support was available, I pretty much just give up
>> and read and write timestamps directly. As I said, I am not sure
>> if this code path gets ever executed.
>>
>> Maybe I was overthinking all this and just reading and writing the
>> (C++ volatile) jlongs would have been enough, but I wanted to
>> prevent sporadic test errors because of incompletely read 64bit
>> values.
> Closed ports may not have cx8 support, although I don't believe any
> are being released with JDK9. Since you just have one writer and one
> reader, I think the only concern is word tearing on the read. For
> this reason you likely need the cx8 support. David would know, so
> hopefully he can comment on this.
>
> Assuming you need cx8 support, theoretically there are platforms
> where your code could fail due to not having cx8 support. You could
> argue that the risk of word tearing is minimal, both in likelihood
> of it happening (race condition on a platform we aren't currently
> officially supporting), and the possible negative behavior if it
> does (premature timeout or possibly no timeout, but only with debug
> builds after a crash).
>
> The other choice here is to just disable the whole timeout mechanism
> if cx8 is not supported. In fact simply making set_to_now() and
> get_timestamp() no-ops when cx8 is not supported would accomplish
> that, although I'd suggest also adding some more explicit disabling
> of the code wherever the timestamps are referenced.
>
>
> Another thing I thought of would be to change the timestamp to 32bit - I
> only need second resolution - and handle somehow the year 2038 overflow.
> But the most easy and pragmatic way is to either ignore the problem
> completely for non-cx8 or to do anything.
PPC32 did not support CX8, which is why this is present in the codebase.
(It is also present at the Java level too.)
All platforms the JVM runs on must supported 64-bit atomic loads and
stores by some means - to implement Java volatile long semantics. Even
platforms that don't support CX8 have some means to do this eg by using
FPU unit. But this isn't necessarily implemented in the Atomic class (it
wasn't for PPC32 because there were no calls to those methods in the VM).
The warnings in the atomic.hpp file were to avoid causally defining
different Atomic ops for jlongs, when they could not be implemented
efficiently on systems without CX8 support. So the onus was put back on
the user of the API to check this and define an alternative - rather
than, for example, forcing use of a global lock on such platforms.
Given you are only using Atomic::load/store I think you can dispense
with the supports_cx8 check, because, as I said, every platform must
have some means to support such atomic loads/stores. And we currently
don't have any ports that don't support CX8.
>> It's not clear to me why you changed a couple of os::sleep()
>> calls to os::naked_short_sleep(), and the rationale for the
>> sleep periods. Can you please explain?
>>
>>
>> Because os::sleep() does a lot of work under the hood and relies
>> on a bit of VM infrastructure. I think that is not a good idea in
>> error situations where potentially everything may be broken
>> already. You want to step lightly and really only do a naked
>> system sleep. About the sleep periods, os::naked_sleep has an
>> inbuilt maximum value of 1000ms, which I have to stay below to not
>> hit the assert. I did use 999ms as the longest interval I am
>> allowed to sleep nakedly. And after the timeout hit and before the
>> WatcherThread calls os::abort, I again sleep 200ms to give the
>> error reporter thread time to write the "error log aborted due to
>> timeout" into the error log and to flush the error log. Those
>> 200ms are just guesswork.
> Ok. I'd like to see someone else comment on the os::naked_sleep()
> use since it's not something I'm familiar enough with.
In this case, because we are in the WatcherThread, os::sleep will not
doing anything interesting that relies on other VM infrastructure
(modification of osThread wait-state only, calls to javaTimeNanos). For
the same reason changing to naked_sleep is also fine. We can/should
relax the assert in naked_sleep so that the sleep time is only limited
for JavaThreads.
Thanks,
David
More information about the hotspot-runtime-dev
mailing list