RFR(L) 8153224 Monitor deflation prolong safepoints (CR11/v2.11/14-for-jdk15)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Thu May 14 19:30:06 UTC 2020
Hi Erik,
I'm gonna trim this down a bit....
On 5/14/20 6:44 AM, Erik Österlund wrote:
> Hi Dan,
>
> On 2020-05-13 19:49, Daniel D. Daugherty wrote:
<snip>
>
>>
>>>
>>>>
>>>>> Maybe one could poke the contended counter,
>>>>> or find a convenient way of poisoning the displaced mark word so
>>>>> that the CAS installing it on JT1 fails.
>>>>> Anyway, there is some thinking to be done here I think.
>>>>
>>>> Agreed. More mulling is/was needed. Here's the revised block:
>>>>
>>>> } else if (mark.has_monitor()) {
>>>> monitor = mark.monitor();
>>>> temp = monitor->header();
>>>> assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT,
>>>> temp.value());
>>>> hash = temp.hash();
>>>> if (hash != 0) {
>>>> // It has a hash.
>>>> if (AsyncDeflateIdleMonitors &&
>>>> monitor->is_being_async_deflated()) {
>>>> // But we can't safely use the hash if we detect that async
>>>> // deflation has occurred. So we attempt to restore the
>>>> // header/dmw to the object's header so that we only retry
>>>> // once if the deflater thread happens to be slow.
>>>> install_displaced_markword_in_object(obj);
>>>> continue;
>>>> }
>>>> return hash;
>>>> }
>>>
>>> Yeah, something like that should do the trick. Although,
>>> unfortunately I think we have to sprinkle some awkward
>>> fencing in there to make sure the load of the displaced mark word
>>> and the loads in is_being_async_deflated()
>>> are not reordered.
>>
>> So is_being_async_deflated() looks like this right now:
>>
>> inline bool ObjectMonitor::is_being_async_deflated() {
>> return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() &&
>> contentions() < 0;
>> }
>>
>> contentions() uses an Atomic::load() as of the v2.11 patch. However,
>> owner_is_DEFLATER_MARKER() does not:
>>
>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>> return _owner == DEFLATER_MARKER;
>> }
>>
>> So I'm going to change it to:
>>
>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>> return Atomic::load(&_owner) == DEFLATER_MARKER;
>> }
>
> Good catch.
Thanks. I do have this unresolved bug:
JDK-8238174 migrate ObjectMonitor::_owner field away from C++
volatile semantics
https://bugs.openjdk.java.net/browse/JDK-8238174
but this fix seems like a good idea now.
>
>> It should have been that way when I created owner_is_DEFLATER_MARKER(),
>> but I was sloppy. Bad Dan!
>>
>> Here's how the dmw/header field is fetched:
>>
>> inline markWord ObjectMonitor::header() const {
>> return Atomic::load(&_header);
>> }
>>
>>
>> So if all three of the players (_header, _owner, _contentions) are
>> loaded with Atomic::load(), doesn't that keep them from being
>> reordered?
>
> No, it only prevents word tearing and makes us feel a bit better when
> we sleep at night,
> knowing that we are ensuring that the semantically right way, rather
> than relying on volatile.
So you're saying that these operations in this order:
Atomic::load(&_header);
Atomic::load(&_owner);
Atomic::load(&_contentions);
can be reordered? (I'm not talking about stores here at all...)
> But we need a bigger hammer than that here.
>
>> Thinking more, that should keep the loads() from being reordered
>> relative to each other, but I haven't done anything with respect
>> to the stores... sigh...
>
> The stores that move the monitor towards being async deflating are all
> updated with conservatively
> ordered atomics, which is fine.
Okay... I think...
>
>>
>>> In other places, I have seen that it is being used after an atomic
>>> operation, hence not needing
>>> further fencing. But here there are only loads involved. They will
>>> not have a consistent global ordering, and the lack
>>> thereof can lead us back to the race still happening. The intuitive
>>> solution would to put in an OrderAccess::loadload()
>>> right before asking if monitor->is_being_async_deflated(). However,
>>> a loadload on non-multiple copy atomic (nMCA)
>>> machine, does not protect against the loads happening in an
>>> inconsistent order, when the stores are performed
>>> on different threads. In this case, if the current thread is A, then
>>> the displaced mark word could be set by a thread
>>> B, and the state in is_being_async_deflated() could be updated by a
>>> thread C. Let's say there is a fourth thread,
>>> D, that ends up in the same situation as thread A, then on a nMCA
>>> machine, even if there is a loadload() before calling
>>> is_being_async_deflated(), the ordering of the two loads could be
>>> different on thread A and thread D. This is the famous
>>> Independent Reads of Independent Writes (IRIW) scenario that is
>>> widely believed to not matter in practice as there are
>>> barely any algorithms where it matters. Here it does, because thread
>>> A might think that the racy hashCode is okay,
>>> while thread C says it is not okay. That means that one of the two
>>> reading threads may get an inconsistent hashCode
>>> value, unless there is a full fence before
>>> is_being_async_deflated(), depending on what side of the IRIW race they
>>> end up in. We need them both to always think it is *not* okay, and
>>> hence end up on the same side of the IRIW race.
>>>
>>> In order to synchronize this appropriately on a nMCA machine, you
>>> need a full fence() (instead of a loadload) between
>>> the load of the displaced mark word, and the loads in
>>> is_being_async_deflated(). The fence will make sure that if you
>>> read the updated (non-zero hashCode) displaced markWord, then it
>>> will globally sync stale memory view (caches) such that when
>>> the loads in is_being_async_deflated() are being executed, they
>>> observe non-stale values(). Normally you would
>>> need MO_SEQ_CST loads for all of these loads to really be able to
>>> reason about its correctness in a sane way, but since
>>> we have the domain knowledge that it is only the flavour of race
>>> where the DMW is not stale and the is_being_async_deflated
>>> values are, that we end up in trouble. So while a sane model would
>>> have leading fences on all the loads involved, we
>>> only need one. On normal platforms a loadload() is sufficient. You
>>> can check which one you are running on like this:
>>>
>>> // Long incomprehensible comment describing what on earth the point
>>> of this synchronization voodoo is
>>> if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>> OrderAccess::fence();
>>> } else {
>>> OrderAccess::loadload();
>>> }
>>>
>>> Or we could go for a different solution that is easier to reason
>>> about, if you prefer that.
>>
>> Based on what you're saying here, the first call to
>> is_being_async_deflated()
>> in FastHashCode() needs to have an OrderAccess::loadload() barrier to
>> ensure
>> proper ordering.
>
> Kind of. But I'm also saying that on e.g. PPC that is not enough and
> will still fail. There it needs a fence(),
> hence the filter.
Right. I was working thru the non-nMCA machine POV first. Sorry, I tend
not to worry about PPC or ARM, at least at first...
>
>> The is_being_async_deflated() call in ObjectMonitor::enter() is okay
>> because
>> it follows inc_contentions() which has a barrier. The second call to
>> is_being_async_deflated() call in FastHashCode should be okay because it
>> is preceded by an Atomic::cmpxchg() call...
>>
>> For now, I'm going with this:
>>
>> temp = monitor->header();
>> assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT,
>> temp.value());
>> hash = temp.hash();
>> if (hash != 0) {
>> // It has a hash.
>>
>> // Separate load of dmw/header above from the loads in
>> // is_being_async_deflated(). Has to be re-examined for a
>> // non-multiple copy atomic (nMCA) machine.
>> OrderAccess::loadload();
>> if (monitor->is_being_async_deflated()) {
>>
>> I've seen discussions about nMCA machines and the Independent Reads
>> of Independent
>> Writes (IRIW) scenario, but I don't know enough about it yet to be
>> comfortable
>> with putting an alternate fence() call in there...
>
> It will not work correctly unless we do that.
I wasn't disagreeing with that at all. I just needed another round
to think about it... and this where I'm at right now:
temp = monitor->header();
assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT,
temp.value());
hash = temp.hash();
if (hash != 0) {
// It has a hash.
// Separate load of dmw/header above from the loads in
// is_being_async_deflated().
if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
// A non-multiple copy atomic (nMCA) machine needs a bigger
// hammer to make sure that the load above and the loads
// below all see non-stale memory values.
OrderAccess::fence();
} else {
OrderAccess::loadload();
}
if (monitor->is_being_async_deflated()) {
> The other option, if we want to prioritize readability and
> understandability over optimization, is
> to let the loads involved in the race be read with:
> NativeAccess<MO_SEQ_CST>::load(&_value)
>
> All sequentially consistent accesses are intuitively guaranteed to
> have a consistent global ordering,
> as if they happened one after the other, and we can reason about it
> "easily".
> As the conservative accesses that produce the values we are interested
> in are also (at least)
> sequentially consistent, we only need to upgrade the loads to be
> sequentially consistent to solve
> the race here.
>
> The good news with that solution, is that it is more intuitive to
> reason about in the way
> our brains want to, without really having to know anything at all
> about nMCA machines; that
> is just an implementation detail of how to get seq_cst semantics on
> such platforms, handled
> by the access bindings.
>
> The bad news is that it will cost two unnecessary fences on PPC in
> this path, which might be
> quite costly. On TSO platforms, it will boil down to Atomic::load,
> which is optimal. On AArch64 it
> will yield a load_acquire, which is a bit suboptimal, but not too bad.
>
> I'll leave the choice up to you. But I insist that just a loadload()
> is not enough for semantic
> correctness of the algorithm, and will indeed cause races at least on
> PPC, causing different
> threads to get different hash codes.
Again, I wasn't disagreeing with that at all...
>
>>
>>>
>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>
>>>> Thanks for the thorough review! I always appreciate it when you
>>>> crawl through my code. :-)
>>>
>>> Thanks for moving this mountain closer to jdk-jdk!
>>
>> Hey now! This mountain has become much smaller as we've split things
>> off into sub-task after sub-task... :-)
>
> Indeed. :)
>
> Thanks,
> /Erik
Thanks for the thorough review!
Dan
More information about the hotspot-runtime-dev
mailing list