RFR(L) 8153224 Monitor deflation prolong safepoints (CR11/v2.11/14-for-jdk15)

Fri May 15 11:07:24 UTC 2020

Hi Dan,

Sorry, I stared at the code a bit longer and found one more IRIW bug and 
one normal fencing bug.

The two loads in this IRIW race is _contentions in is_async_deflating() 
and _header in subsequent install_displaced_markword_in_object().
On an nMCA machine, these two loads *must* be separated by a full fence. 
They observe values written by two independent threads to two
independent addresses. Yet it is very important that the total ordering 
of these stores are observed to happen in the same order.

To be concrete, the problem is code like this:

if (is_async_deflating()) {
   install_displaced_markword_in_object().
}

In is_async_deflating() we read _contentions and see that we are 
deflating, and in install_displaced_markword_in_object() we read
the _header that will survive into the markWord. These loads must 
understand a total order, and hence have a full fence() between
them on nMCA machines.

Let's say that one observer thread A reads an updated hashCode. It can 
then read _contentions either right before it gets updated
or right after it gets updated in total order. If it reads it right 
before it gets updated, then that hashCode will be returned.
Therefore, whichever other thread performs a successful 
install_displaced_markword_in_object operation *has to* catch the 
updated DMW
that A observed, containing the new hashCode and install it in the 
header. If it does not, threads will end up retrieving different
hashCodes. Therefore, it is crucial that the load of the DMW on the 
thread performing install_displaced_markword_in_object() is also
seq_cst or otherwise deals with IRIW somehow (another fancy fence 
construction before reading header() in 
install_displaced_markword_in_object).
Today it is a relaxed load, that is exposed to IRIW craziness. This 
allows the thread calling install_displaced_markword_in_object()
to read a stale version of the DMW, and hence install the not yet 
updated hashCode back into the markWord, despite the updated DMW
having been read by the unfortunate observer A. Another observer D will 
possibly read a different hashCode.

Oh and we probably need a normal loadload() between reading _owner and 
_contentions in the is_async_deflating function. The
loads re-ordering implies we could spuriously read an old value for 
_owner (not deflating) and a new value for _contentions, leading it
to believe it isn't deflating even though it is. But that is easier to 
reason about because the loads are reading values written by the same 
thread.

I will wait with formulating fat comments explaining what on earth this 
is all about, until we have decided on whether we want the
solution to be making the loads seq_cst, inserting fancy 
incomprehensible IRIW-fences, or something completely different.

Sorry for the trouble.

/Erik

On 2020-05-14 23:01, Daniel D. Daugherty wrote:
> On 5/14/20 4:58 PM, Erik Österlund wrote:
>> Hi Dan,
>>
>>> On 14 May 2020, at 21:30, Daniel D. Daugherty 
>>> <daniel.daugherty at oracle.com> wrote:
>>>
>>> Hi Erik,
>>>
>>> I'm gonna trim this down a bit....
>>>
>>>> On 5/14/20 6:44 AM, Erik Österlund wrote:
>>>> Hi Dan,
>>>>
>>>> On 2020-05-13 19:49, Daniel D. Daugherty wrote:
>>> <snip>
>>>>>>>> Maybe one could poke the contended counter,
>>>>>>>> or find a convenient way of poisoning the displaced mark word 
>>>>>>>> so that the CAS installing it on JT1 fails.
>>>>>>>> Anyway, there is some thinking to be done here I think.
>>>>>>> Agreed. More mulling is/was needed. Here's the revised block:
>>>>>>>
>>>>>>>      } else if (mark.has_monitor()) {
>>>>>>>        monitor = mark.monitor();
>>>>>>>        temp = monitor->header();
>>>>>>>        assert(temp.is_neutral(), "invariant: header=" 
>>>>>>> INTPTR_FORMAT, temp.value());
>>>>>>>        hash = temp.hash();
>>>>>>>        if (hash != 0) {
>>>>>>>          // It has a hash.
>>>>>>>          if (AsyncDeflateIdleMonitors && 
>>>>>>> monitor->is_being_async_deflated()) {
>>>>>>>            // But we can't safely use the hash if we detect that 
>>>>>>> async
>>>>>>>            // deflation has occurred. So we attempt to restore the
>>>>>>>            // header/dmw to the object's header so that we only 
>>>>>>> retry
>>>>>>>            // once if the deflater thread happens to be slow.
>>>>>>> install_displaced_markword_in_object(obj);
>>>>>>>            continue;
>>>>>>>          }
>>>>>>>          return hash;
>>>>>>>        }
>>>>>> Yeah, something like that should do the trick. Although, 
>>>>>> unfortunately I think we have to sprinkle some awkward
>>>>>> fencing in there to make sure the load of the displaced mark word 
>>>>>> and the loads in is_being_async_deflated()
>>>>>> are not reordered.
>>>>> So is_being_async_deflated() looks like this right now:
>>>>>
>>>>> inline bool ObjectMonitor::is_being_async_deflated() {
>>>>>    return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() 
>>>>> && contentions() < 0;
>>>>> }
>>>>>
>>>>> contentions() uses an Atomic::load() as of the v2.11 patch. However,
>>>>> owner_is_DEFLATER_MARKER() does not:
>>>>>
>>>>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>>>>    return _owner == DEFLATER_MARKER;
>>>>> }
>>>>>
>>>>> So I'm going to change it to:
>>>>>
>>>>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>>>>    return Atomic::load(&_owner) == DEFLATER_MARKER;
>>>>> }
>>>> Good catch.
>>> Thanks. I do have this unresolved bug:
>>>
>>>      JDK-8238174 migrate ObjectMonitor::_owner field away from C++ 
>>> volatile semantics
>>>      https://bugs.openjdk.java.net/browse/JDK-8238174
>>>
>>> but this fix seems like a good idea now.
>> Yup.
>>
>>>>> It should have been that way when I created 
>>>>> owner_is_DEFLATER_MARKER(),
>>>>> but I was sloppy. Bad Dan!
>>>>>
>>>>> Here's how the dmw/header field is fetched:
>>>>>
>>>>> inline markWord ObjectMonitor::header() const {
>>>>>    return Atomic::load(&_header);
>>>>> }
>>>>>
>>>>>
>>>>> So if all three of the players (_header, _owner, _contentions) are
>>>>> loaded with Atomic::load(), doesn't that keep them from being
>>>>> reordered?
>>>> No, it only prevents word tearing and makes us feel a bit better 
>>>> when we sleep at night,
>>>> knowing that we are ensuring that the semantically right way, 
>>>> rather than relying on volatile.
>>> So you're saying that these operations in this order:
>>>
>>>      Atomic::load(&_header);
>>>      Atomic::load(&_owner);
>>>      Atomic::load(&_contentions);
>>>
>>> can be reordered? (I'm not talking about stores here at all...)
>> Yes, they can and will reorder, except for TSO machines like SPARC 
>> and x86 where they will not reorder.
>>
>>>> But we need a bigger hammer than that here.
>>>>
>>>>> Thinking more, that should keep the loads() from being reordered
>>>>> relative to each other, but I haven't done anything with respect
>>>>> to the stores... sigh...
>>>> The stores that move the monitor towards being async deflating are 
>>>> all updated with conservatively
>>>> ordered atomics, which is fine.
>>> Okay... I think...
>> My point is that contention decrements and the owner flip to the 
>> deflating marker are conservatively ordered atomic operations. There 
>> are other stores going on to these fields, but they are not relevant 
>> for the race.
>>
>>>>>> In other places, I have seen that it is being used after an 
>>>>>> atomic operation, hence not needing
>>>>>> further fencing. But here there are only loads involved. They 
>>>>>> will not have a consistent global ordering, and the lack
>>>>>> thereof can lead us back to the race still happening. The 
>>>>>> intuitive solution would to put in an OrderAccess::loadload()
>>>>>> right before asking if monitor->is_being_async_deflated(). 
>>>>>> However, a loadload on non-multiple copy atomic (nMCA)
>>>>>> machine, does not protect against the loads happening in an 
>>>>>> inconsistent order, when the stores are performed
>>>>>> on different threads. In this case, if the current thread is A, 
>>>>>> then the displaced mark word could be set by a thread
>>>>>> B, and the state in is_being_async_deflated() could be updated by 
>>>>>> a thread C. Let's say there is a fourth thread,
>>>>>> D, that ends up in the same situation as thread A, then on a nMCA 
>>>>>> machine, even if there is a loadload() before calling
>>>>>> is_being_async_deflated(), the ordering of the two loads could be 
>>>>>> different on thread A and thread D. This is the famous
>>>>>> Independent Reads of Independent Writes (IRIW) scenario that is 
>>>>>> widely believed to not matter in practice as there are
>>>>>> barely any algorithms where it matters. Here it does, because 
>>>>>> thread A might think that the racy hashCode is okay,
>>>>>> while thread C says it is not okay. That means that one of the 
>>>>>> two reading threads may get an inconsistent hashCode
>>>>>> value, unless there is a full fence before 
>>>>>> is_being_async_deflated(), depending on what side of the IRIW 
>>>>>> race they
>>>>>> end up in. We need them both to always think it is *not* okay, 
>>>>>> and hence end up on the same side of the IRIW race.
>>>>>>
>>>>>> In order to synchronize this appropriately on a nMCA machine, you 
>>>>>> need a full fence() (instead of a loadload) between
>>>>>> the load of the displaced mark word, and the loads in 
>>>>>> is_being_async_deflated(). The fence will make sure that if you
>>>>>> read the updated (non-zero hashCode) displaced markWord, then it 
>>>>>> will globally sync stale memory view (caches) such that when
>>>>>> the loads in is_being_async_deflated() are being executed, they 
>>>>>> observe non-stale values(). Normally you would
>>>>>> need MO_SEQ_CST loads for all of these loads to really be able to 
>>>>>> reason about its correctness in a sane way, but since
>>>>>> we have the domain knowledge that it is only the flavour of race 
>>>>>> where the DMW is not stale and the is_being_async_deflated
>>>>>> values are, that we end up in trouble. So while a sane model 
>>>>>> would have leading fences on all the loads involved, we
>>>>>> only need one. On normal platforms a loadload() is sufficient. 
>>>>>> You can check which one you are running on like this:
>>>>>>
>>>>>> // Long incomprehensible comment describing what on earth the 
>>>>>> point of this synchronization voodoo is
>>>>>> if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>>>>>    OrderAccess::fence();
>>>>>> } else {
>>>>>>    OrderAccess::loadload();
>>>>>> }
>>>>>>
>>>>>> Or we could go for a different solution that is easier to reason 
>>>>>> about, if you prefer that.
>>>>> Based on what you're saying here, the first call to 
>>>>> is_being_async_deflated()
>>>>> in FastHashCode() needs to have an OrderAccess::loadload() barrier 
>>>>> to ensure
>>>>> proper ordering.
>>>> Kind of. But I'm also saying that on e.g. PPC that is not enough 
>>>> and will still fail. There it needs a fence(),
>>>> hence the filter.
>>> Right. I was working thru the non-nMCA machine POV first. Sorry, I tend
>>> not to worry about PPC or ARM, at least at first...
>> Makes sense!
>>
>>>>> The is_being_async_deflated() call in ObjectMonitor::enter() is 
>>>>> okay because
>>>>> it follows inc_contentions() which has a barrier. The second call to
>>>>> is_being_async_deflated() call in FastHashCode should be okay 
>>>>> because it
>>>>> is preceded by an Atomic::cmpxchg() call...
>>>>>
>>>>> For now, I'm going with this:
>>>>>
>>>>>        temp = monitor->header();
>>>>>        assert(temp.is_neutral(), "invariant: header=" 
>>>>> INTPTR_FORMAT, temp.value());
>>>>>        hash = temp.hash();
>>>>>        if (hash != 0) {
>>>>>          // It has a hash.
>>>>>
>>>>>          // Separate load of dmw/header above from the loads in
>>>>>          // is_being_async_deflated(). Has to be re-examined for a
>>>>>          // non-multiple copy atomic (nMCA) machine.
>>>>>          OrderAccess::loadload();
>>>>>          if (monitor->is_being_async_deflated()) {
>>>>>
>>>>> I've seen discussions about nMCA machines and the Independent 
>>>>> Reads of Independent
>>>>> Writes (IRIW) scenario, but I don't know enough about it yet to be 
>>>>> comfortable
>>>>> with putting an alternate fence() call in there...
>>>> It will not work correctly unless we do that.
>>> I wasn't disagreeing with that at all. I just needed another round
>>> to think about it... and this where I'm at right now:
>>>
>>>        temp = monitor->header();
>>>        assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
>>> temp.value());
>>>        hash = temp.hash();
>>>        if (hash != 0) {
>>>          // It has a hash.
>>>
>>>          // Separate load of dmw/header above from the loads in
>>>          // is_being_async_deflated().
>>>          if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>>            // A non-multiple copy atomic (nMCA) machine needs a bigger
>>>            // hammer to make sure that the load above and the loads
>>>            // below all see non-stale memory values.
>>>            OrderAccess::fence();
>>>          } else {
>>>            OrderAccess::loadload();
>>>          }
>>>          if (monitor->is_being_async_deflated()) {
>> That looks good. Note though that the comment is not quite right; we 
>> assume that the first load may be stale, but the subsequent ones we 
>> separate out may not. The reason is that a stale hash code is 0, 
>> which will simply trigger a retry, which is fine. So the proposed 
>> fencing does not guarantee the first load isn’t stale, but it is okay 
>> if it is algorithmically. Yet the subsequent loads must not be stale 
>> w.r.t. the first load, because then bad things happen.
>
> I'm prepping the next review round now. Please comment on the wording
> there and provide a clear suggested change.
>
> Dan
>
>>
>>>> The other option, if we want to prioritize readability and 
>>>> understandability over optimization, is
>>>> to let the loads involved in the race be read with: 
>>>> NativeAccess<MO_SEQ_CST>::load(&_value)
>>>>
>>>> All sequentially consistent accesses are intuitively guaranteed to 
>>>> have a consistent global ordering,
>>>> as if they happened one after the other, and we can reason about it 
>>>> "easily".
>>>> As the conservative accesses that produce the values we are 
>>>> interested in are also (at least)
>>>> sequentially consistent, we only need to upgrade the loads to be 
>>>> sequentially consistent to solve
>>>> the race here.
>>>>
>>>> The good news with that solution, is that it is more intuitive to 
>>>> reason about in the way
>>>> our brains want to, without really having to know anything at all 
>>>> about nMCA machines; that
>>>> is just an implementation detail of how to get seq_cst semantics on 
>>>> such platforms, handled
>>>> by the access bindings.
>>>>
>>>> The bad news is that it will cost two unnecessary fences on PPC in 
>>>> this path, which might be
>>>> quite costly. On TSO platforms, it will boil down to Atomic::load, 
>>>> which is optimal. On AArch64 it
>>>> will yield a load_acquire, which is a bit suboptimal, but not too bad.
>>>>
>>>> I'll leave the choice up to you. But I insist that just a 
>>>> loadload() is not enough for semantic
>>>> correctness of the algorithm, and will indeed cause races at least 
>>>> on PPC, causing different
>>>> threads to get different hash codes.
>>> Again, I wasn't disagreeing with that at all...
>> Okay, great. Sounds like we landed in the optimized incomprehensible 
>> version. I’m okay with that!
>>
>>>>>>>> Thanks,
>>>>>>>> /Erik
>>>>>>> Thanks for the thorough review! I always appreciate it when you
>>>>>>> crawl through my code. :-)
>>>>>> Thanks for moving this mountain closer to jdk-jdk!
>>>>> Hey now! This mountain has become much smaller as we've split things
>>>>> off into sub-task after sub-task... :-)
>>>> Indeed. :)
>>>>
>>>> Thanks,
>>>> /Erik
>>> Thanks for the thorough review!
>> No problem! Done from my part now. Looks good and let’s ship it!
>>
>> /Erik
>>
>>> Dan
>>>
>