RFR(L) 8153224 Monitor deflation prolong safepoints (CR11/v2.11/14-for-jdk15)

Thu May 14 21:01:56 UTC 2020

On 5/14/20 4:58 PM, Erik Österlund wrote:
> Hi Dan,
>
>> On 14 May 2020, at 21:30, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
>>
>> Hi Erik,
>>
>> I'm gonna trim this down a bit....
>>
>>> On 5/14/20 6:44 AM, Erik Österlund wrote:
>>> Hi Dan,
>>>
>>> On 2020-05-13 19:49, Daniel D. Daugherty wrote:
>> <snip>
>>>>>>> Maybe one could poke the contended counter,
>>>>>>> or find a convenient way of poisoning the displaced mark word so that the CAS installing it on JT1 fails.
>>>>>>> Anyway, there is some thinking to be done here I think.
>>>>>> Agreed. More mulling is/was needed. Here's the revised block:
>>>>>>
>>>>>>      } else if (mark.has_monitor()) {
>>>>>>        monitor = mark.monitor();
>>>>>>        temp = monitor->header();
>>>>>>        assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, temp.value());
>>>>>>        hash = temp.hash();
>>>>>>        if (hash != 0) {
>>>>>>          // It has a hash.
>>>>>>          if (AsyncDeflateIdleMonitors && monitor->is_being_async_deflated()) {
>>>>>>            // But we can't safely use the hash if we detect that async
>>>>>>            // deflation has occurred. So we attempt to restore the
>>>>>>            // header/dmw to the object's header so that we only retry
>>>>>>            // once if the deflater thread happens to be slow.
>>>>>>            install_displaced_markword_in_object(obj);
>>>>>>            continue;
>>>>>>          }
>>>>>>          return hash;
>>>>>>        }
>>>>> Yeah, something like that should do the trick. Although, unfortunately I think we have to sprinkle some awkward
>>>>> fencing in there to make sure the load of the displaced mark word and the loads in is_being_async_deflated()
>>>>> are not reordered.
>>>> So is_being_async_deflated() looks like this right now:
>>>>
>>>> inline bool ObjectMonitor::is_being_async_deflated() {
>>>>    return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() && contentions() < 0;
>>>> }
>>>>
>>>> contentions() uses an Atomic::load() as of the v2.11 patch. However,
>>>> owner_is_DEFLATER_MARKER() does not:
>>>>
>>>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>>>    return _owner == DEFLATER_MARKER;
>>>> }
>>>>
>>>> So I'm going to change it to:
>>>>
>>>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>>>    return Atomic::load(&_owner) == DEFLATER_MARKER;
>>>> }
>>> Good catch.
>> Thanks. I do have this unresolved bug:
>>
>>      JDK-8238174 migrate ObjectMonitor::_owner field away from C++ volatile semantics
>>      https://bugs.openjdk.java.net/browse/JDK-8238174
>>
>> but this fix seems like a good idea now.
> Yup.
>
>>>> It should have been that way when I created owner_is_DEFLATER_MARKER(),
>>>> but I was sloppy. Bad Dan!
>>>>
>>>> Here's how the dmw/header field is fetched:
>>>>
>>>> inline markWord ObjectMonitor::header() const {
>>>>    return Atomic::load(&_header);
>>>> }
>>>>
>>>>
>>>> So if all three of the players (_header, _owner, _contentions) are
>>>> loaded with Atomic::load(), doesn't that keep them from being
>>>> reordered?
>>> No, it only prevents word tearing and makes us feel a bit better when we sleep at night,
>>> knowing that we are ensuring that the semantically right way, rather than relying on volatile.
>> So you're saying that these operations in this order:
>>
>>      Atomic::load(&_header);
>>      Atomic::load(&_owner);
>>      Atomic::load(&_contentions);
>>
>> can be reordered? (I'm not talking about stores here at all...)
> Yes, they can and will reorder, except for TSO machines like SPARC and x86 where they will not reorder.
>
>>> But we need a bigger hammer than that here.
>>>
>>>> Thinking more, that should keep the loads() from being reordered
>>>> relative to each other, but I haven't done anything with respect
>>>> to the stores... sigh...
>>> The stores that move the monitor towards being async deflating are all updated with conservatively
>>> ordered atomics, which is fine.
>> Okay... I think...
> My point is that contention decrements and the owner flip to the deflating marker are conservatively ordered atomic operations. There are other stores going on to these fields, but they are not relevant for the race.
>
>>>>> In other places, I have seen that it is being used after an atomic operation, hence not needing
>>>>> further fencing. But here there are only loads involved. They will not have a consistent global ordering, and the lack
>>>>> thereof can lead us back to the race still happening. The intuitive solution would to put in an OrderAccess::loadload()
>>>>> right before asking if monitor->is_being_async_deflated(). However, a loadload on non-multiple copy atomic (nMCA)
>>>>> machine, does not protect against the loads happening in an inconsistent order, when the stores are performed
>>>>> on different threads. In this case, if the current thread is A, then the displaced mark word could be set by a thread
>>>>> B, and the state in is_being_async_deflated() could be updated by a thread C. Let's say there is a fourth thread,
>>>>> D, that ends up in the same situation as thread A, then on a nMCA machine, even if there is a loadload() before calling
>>>>> is_being_async_deflated(), the ordering of the two loads could be different on thread A and thread D. This is the famous
>>>>> Independent Reads of Independent Writes (IRIW) scenario that is widely believed to not matter in practice as there are
>>>>> barely any algorithms where it matters. Here it does, because thread A might think that the racy hashCode is okay,
>>>>> while thread C says it is not okay. That means that one of the two reading threads may get an inconsistent hashCode
>>>>> value, unless there is a full fence before is_being_async_deflated(), depending on what side of the IRIW race they
>>>>> end up in. We need them both to always think it is *not* okay, and hence end up on the same side of the IRIW race.
>>>>>
>>>>> In order to synchronize this appropriately on a nMCA machine, you need a full fence() (instead of a loadload) between
>>>>> the load of the displaced mark word, and the loads in is_being_async_deflated(). The fence will make sure that if you
>>>>> read the updated (non-zero hashCode) displaced markWord, then it will globally sync stale memory view (caches) such that when
>>>>> the loads in is_being_async_deflated() are being executed, they observe non-stale values(). Normally you would
>>>>> need MO_SEQ_CST loads for all of these loads to really be able to reason about its correctness in a sane way, but since
>>>>> we have the domain knowledge that it is only the flavour of race where the DMW is not stale and the is_being_async_deflated
>>>>> values are, that we end up in trouble. So while a sane model would have leading fences on all the loads involved, we
>>>>> only need one. On normal platforms a loadload() is sufficient. You can check which one you are running on like this:
>>>>>
>>>>> // Long incomprehensible comment describing what on earth the point of this synchronization voodoo is
>>>>> if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>>>>    OrderAccess::fence();
>>>>> } else {
>>>>>    OrderAccess::loadload();
>>>>> }
>>>>>
>>>>> Or we could go for a different solution that is easier to reason about, if you prefer that.
>>>> Based on what you're saying here, the first call to is_being_async_deflated()
>>>> in FastHashCode() needs to have an OrderAccess::loadload() barrier to ensure
>>>> proper ordering.
>>> Kind of. But I'm also saying that on e.g. PPC that is not enough and will still fail. There it needs a fence(),
>>> hence the filter.
>> Right. I was working thru the non-nMCA machine POV first. Sorry, I tend
>> not to worry about PPC or ARM, at least at first...
> Makes sense!
>
>>>> The is_being_async_deflated() call in ObjectMonitor::enter() is okay because
>>>> it follows inc_contentions() which has a barrier. The second call to
>>>> is_being_async_deflated() call in FastHashCode should be okay because it
>>>> is preceded by an Atomic::cmpxchg() call...
>>>>
>>>> For now, I'm going with this:
>>>>
>>>>        temp = monitor->header();
>>>>        assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, temp.value());
>>>>        hash = temp.hash();
>>>>        if (hash != 0) {
>>>>          // It has a hash.
>>>>
>>>>          // Separate load of dmw/header above from the loads in
>>>>          // is_being_async_deflated(). Has to be re-examined for a
>>>>          // non-multiple copy atomic (nMCA) machine.
>>>>          OrderAccess::loadload();
>>>>          if (monitor->is_being_async_deflated()) {
>>>>
>>>> I've seen discussions about nMCA machines and the Independent Reads of Independent
>>>> Writes (IRIW) scenario, but I don't know enough about it yet to be comfortable
>>>> with putting an alternate fence() call in there...
>>> It will not work correctly unless we do that.
>> I wasn't disagreeing with that at all. I just needed another round
>> to think about it... and this where I'm at right now:
>>
>>        temp = monitor->header();
>>        assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, temp.value());
>>        hash = temp.hash();
>>        if (hash != 0) {
>>          // It has a hash.
>>
>>          // Separate load of dmw/header above from the loads in
>>          // is_being_async_deflated().
>>          if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>            // A non-multiple copy atomic (nMCA) machine needs a bigger
>>            // hammer to make sure that the load above and the loads
>>            // below all see non-stale memory values.
>>            OrderAccess::fence();
>>          } else {
>>            OrderAccess::loadload();
>>          }
>>          if (monitor->is_being_async_deflated()) {
> That looks good. Note though that the comment is not quite right; we assume that the first load may be stale, but the subsequent ones we separate out may not. The reason is that a stale hash code is 0, which will simply trigger a retry, which is fine. So the proposed fencing does not guarantee the first load isn’t stale, but it is okay if it is algorithmically. Yet the subsequent loads must not be stale w.r.t. the first load, because then bad things happen.

I'm prepping the next review round now. Please comment on the wording
there and provide a clear suggested change.

Dan

>
>>> The other option, if we want to prioritize readability and understandability over optimization, is
>>> to let the loads involved in the race be read with: NativeAccess<MO_SEQ_CST>::load(&_value)
>>>
>>> All sequentially consistent accesses are intuitively guaranteed to have a consistent global ordering,
>>> as if they happened one after the other, and we can reason about it "easily".
>>> As the conservative accesses that produce the values we are interested in are also (at least)
>>> sequentially consistent, we only need to upgrade the loads to be sequentially consistent to solve
>>> the race here.
>>>
>>> The good news with that solution, is that it is more intuitive to reason about in the way
>>> our brains want to, without really having to know anything at all about nMCA machines; that
>>> is just an implementation detail of how to get seq_cst semantics on such platforms, handled
>>> by the access bindings.
>>>
>>> The bad news is that it will cost two unnecessary fences on PPC in this path, which might be
>>> quite costly. On TSO platforms, it will boil down to Atomic::load, which is optimal. On AArch64 it
>>> will yield a load_acquire, which is a bit suboptimal, but not too bad.
>>>
>>> I'll leave the choice up to you. But I insist that just a loadload() is not enough for semantic
>>> correctness of the algorithm, and will indeed cause races at least on PPC, causing different
>>> threads to get different hash codes.
>> Again, I wasn't disagreeing with that at all...
> Okay, great. Sounds like we landed in the optimized incomprehensible version. I’m okay with that!
>
>>>>>>> Thanks,
>>>>>>> /Erik
>>>>>> Thanks for the thorough review! I always appreciate it when you
>>>>>> crawl through my code. :-)
>>>>> Thanks for moving this mountain closer to jdk-jdk!
>>>> Hey now! This mountain has become much smaller as we've split things
>>>> off into sub-task after sub-task... :-)
>>> Indeed. :)
>>>
>>> Thanks,
>>> /Erik
>> Thanks for the thorough review!
> No problem! Done from my part now. Looks good and let’s ship it!
>
> /Erik
>
>> Dan
>>