RFR(L) 8153224 Monitor deflation prolong safepoints (CR11/v2.11/14-for-jdk15)

Fri May 15 21:56:02 UTC 2020

First, I'm guessing that you missed this comment from me in the mesg
that you are replying to here:
> On 2020-05-14 23:01, Daniel D. Daugherty wrote: 
> I'm prepping the next review round now. Please comment on the wording
> there and provide a clear suggested change.

This was an attempt to move this discussion from the CR11/v2.11/14-for-jdk15
to the CR12/v2.12/15-for-jdk15 thread.

Please DO NOT REPLY on the CR11/v2.11/14-for-jdk15 thread. If you
have an issue with anything that I put in this reply, then please
post the comment on the CR12/v2.12/15-for-jdk15 thread.

On 5/15/20 7:07 AM, Erik Österlund wrote:
> Hi Dan,
>
> Sorry, I stared at the code a bit longer and found one more IRIW bug 
> and one normal fencing bug.

Okay, just for clarity's sake for the folks on the review thread. This
is the last blob of code and comments that I proposed:

       temp = monitor->header();
       assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
temp.value());
       hash = temp.hash();
       if (hash != 0) {
         // It has a hash.

         // Separate load of dmw/header above from the loads in
         // is_being_async_deflated().
         if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
           // A non-multiple copy atomic (nMCA) machine needs a bigger
           // hammer to make sure that the load above and the loads
           // below all see non-stale memory values.
           OrderAccess::fence();
         } else {
           OrderAccess::loadload();
         }
         if (monitor->is_being_async_deflated()) {

And this is why we added it:

The "OrderAccess::loadload()" was added to solve a "normal fencing bug"
between "monitor->header()" and "monitor->is_being_async_deflated()".
The "monitor->header()" might see a stale value of zero, but that's
okay because it will result in a retry. The "OrderAccess::loadload()"
forces the "monitor->is_being_async_deflated()" to see current values
on TSO machines.

Just to be clear: We don't mind if the hash code from the header is
stale, but the owner field and the contentions field values that we
load in is_being_async_deflated() must not be stale.

For non-multiple copy atomic (nMCA) machines, the "OrderAccess::loadload()"
is not sufficient for some very complicated reasons involving what can
happen with four or more threads. At least I _think_ the nMCA race that
Erik is talking about requires 4 threads.

For the record, the comment that added for the nMCA if-statement is not
quite right, but I think Erik will be providing a rewording... Here's a
simpler rewording for now:

          // A non-multiple copy atomic (nMCA) machine needs a bigger
           // hammer to separate the load above and the loads below.

So I no longer mumble about stale values in this comment at all.

At this point, I think everyone should have the proper context for
diving into this next set of comments. Or they should have a monster
headache and be in serious need of pain killers! :-)

> The two loads in this IRIW race is _contentions in 
> is_async_deflating() and _header in subsequent 
> install_displaced_markword_in_object().
> On an nMCA machine, these two loads *must* be separated by a full 
> fence. They observe values written by two independent threads to two
> independent addresses. Yet it is very important that the total 
> ordering of these stores are observed to happen in the same order.
>
> To be concrete, the problem is code like this:
>
> if (is_async_deflating()) {
>   install_displaced_markword_in_object().
> }
>
> In is_async_deflating() we read _contentions and see that we are 
> deflating, and in install_displaced_markword_in_object() we read
> the _header that will survive into the markWord. These loads must 
> understand a total order, and hence have a full fence() between
> them on nMCA machines.
>
> Let's say that one observer thread A reads an updated hashCode. It can 
> then read _contentions either right before it gets updated
> or right after it gets updated in total order. If it reads it right 
> before it gets updated, then that hashCode will be returned.
> Therefore, whichever other thread performs a successful 
> install_displaced_markword_in_object operation *has to* catch the 
> updated DMW
> that A observed, containing the new hashCode and install it in the 
> header. If it does not, threads will end up retrieving different
> hashCodes. Therefore, it is crucial that the load of the DMW on the 
> thread performing install_displaced_markword_in_object() is also
> seq_cst or otherwise deals with IRIW somehow (another fancy fence 
> construction before reading header() in 
> install_displaced_markword_in_object).
> Today it is a relaxed load, that is exposed to IRIW craziness. This 
> allows the thread calling install_displaced_markword_in_object()
> to read a stale version of the DMW, and hence install the not yet 
> updated hashCode back into the markWord, despite the updated DMW
> having been read by the unfortunate observer A. Another observer D 
> will possibly read a different hashCode.

Just to be clear, I think everything in the above 5 paragraphs is for nMCA
machines and is not a "normal fencing bug". Right now, we have an off list
thread trying to nail down the nMCA terminology, comments and approach so
for now, I'm not planning to address the above 5 paragraphs.

> Oh and we probably need a normal loadload() between reading _owner and 
> _contentions in the is_async_deflating function. The
> loads re-ordering implies we could spuriously read an old value for 
> _owner (not deflating) and a new value for _contentions, leading it
> to believe it isn't deflating even though it is. But that is easier to 
> reason about because the loads are reading values written by the same 
> thread.

Here's the current code in v2.12:

inline bool ObjectMonitor::is_being_async_deflated() {
   return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() && 
contentions() < 0;
}

inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
   return Atomic::load(&_owner) == DEFLATER_MARKER;
}

inline jint ObjectMonitor::contentions() const {
   return Atomic::load(&_contentions);
}

So both the loads of the owner and contentions fields should not reorder on
TSO machines. Also, we added an OrderAccess::loadload() on the one call to
is_being_async_deflated() that wasn't already preceded by a stronger memory
sync operation so we shouldn't read a stale value for the owner field.

Here's the code that changes the owner and contentions fields to make the
is_being_async_deflated() function return true:

   // Set a NULL owner to DEFLATER_MARKER to force any contending thread
   // through the slow path. This is just the first part of the async
   // deflation dance.
   if (mid->try_set_owner_from(NULL, DEFLATER_MARKER) != NULL) {
     // The owner field is no longer NULL so we lost the race since the
     // ObjectMonitor is now busy.
     return false;
   }

which calls:

   inline void* ObjectMonitor::try_set_owner_from(void* old_value, void* 
new_value) {
     void* prev = Atomic::cmpxchg(&_owner, old_value, new_value);

and then we do:

   // Make a zero contentions field negative to force any contending threads
   // to retry. This is the second part of the async deflation dance.
   if (Atomic::cmpxchg(&mid->_contentions, (jint)0, -max_jint) != 0) {
     // Contentions was no longer 0 so we lost the race since the
     // ObjectMonitor is now busy. Restore owner to NULL if it is
     // still DEFLATER_MARKER:
     mid->try_set_owner_from(DEFLATER_MARKER, NULL);
     return false;
   }

Both fields are changed with an Atomic::cmpxchg() call which does
a full fence so that makes the changes visible on any machine.
However, they are done with separate calls to cmpxchg() so it is
possible for a hash code fetching thread (hash-2) to observe

     owner == DEFLATER_MARKER && contentions > 0

in it's call to is_being_async_deflated() because at that point the
ObjectMonitor is not being deflated. It will return the hash code
that it fetched off the ObjectMonitor.

In a previous discussion of the race in this code, we talked about
where that hash code in the ObjectMonitor came from and we hypothesized
that it could have come from another thread (hash-1) that just stored
the hash code value in the ObjectMonitor and then observed that the
ObjectMonitor was being deflated so it bails out and tries again.
The hash code that hash-1 stored in the ObjectMonitor is thrown away
when the ObjectMonitor is deflated.

Okay, I just described the same race that motivated the addition of
this block of code:

         if (monitor->is_being_async_deflated()) {
           // But we can't safely use the hash if we detect that async
           // deflation has occurred. So we attempt to restore the
           // header/dmw to the object's header so that we only retry
           // once if the deflater thread happens to be slow.
           monitor->install_displaced_markword_in_object(obj);
           continue;
         }

We have hash-2 fetching a hash code from the ObjectMonitor and not
seeing the ObjectMonitor is being deflated so it returns a hash
code value that might not be valid.

We have hash-1 which has just stored a hash code into ObjectMonitor
and then sees that the ObjectMonitor is being deflated so it calls
install_displaced_markword_in_object() which will attempt to restore
the object's header INCLUDING the hash code that it just saved in
the ObjectMonitor.

We have the async deflater thread that has decided that the ObjectMonitor
is not busy and it goes thru the deflation protocol and also calls
install_displaced_markword_in_object() which will restore the object's
header from the dmw/header in the ObjectMonitor. That dmw/header may
or may not contain a hash code depending on what hash-1 has finished.

So we have these possible scenarios:

1) Deflater thread managed to restore the object's header WITHOUT a
    hash code. Yes, hash-1 stored a hash code in the dmw/header field
    in the ObjectMonitor, but it was too late for the deflater thread
    to use it for the restoration.

    Note: hash-1's call to install_displaced_markword_in_object()
    after detecting async deflation won't change the object's header.
    So the hash code saved by hash-1 in the ObjectMonitor is lost.

2) hash-1 managed to restore the object's header with a hash code.
    Yes, the deflater thread succeeded in deflating the ObjectMonitor,
    but it did not succeed in restoring the object's header.

3) hash-1 managed to save the hash code in the dmw/header in the
    ObjectMonitor before the deflater thread restored the object's
    header so the object's header is restored with a hash code.

So in scenario #2 and #3, the hash code observed by hash-2 is okay
because it is the hash code in the object's header.

This leaves scenario #1 where hash-1 has managed to save a hash code
in the ObjectMonitor, but that hash code is not associated with the
object's header when deflation is done. So is it possible for hash-2
to see the hash code in the ObjectMonitor and not see that the
ObjectMonitor is being deflated while at the same time the deflater
thread does not see the hash code and manages to deflate the
ObjectMonitor and restore the object's header to the dmw (without a
hash code value)?

hash-1             deflater           | header  owner contentions |    
hash-2
-----------------  ---------------    | ------  -------- ----------- 
|    -----------------
:                  mark owner         | <dmw>   DEF_MARK 0           |    :
:                  set cont           | -max_jint   |    :
:                  restore obj hdr |                               |    :
:                    get OM hdr |                               |    :
set OM hdr=<hash>    :                | <hash>                        |    :
:                    set obj hdr |                               |    
get <hash>
owner == DEF_MARK      to 
dmw                                              owner == DEF_MARK
&& cont < 0 (works)                                             cont >= 
0 ???
   restore obj hdr return <hash>
     get OM hdr
     set obj hdr
       to <hash>
     (fails)

For scenario #1 to work, the deflater thread has to quickly mark the
owner field and set cont negative before it gets the ObjectMonitor
header value that DOES NOT contain the <hash>.

The hash-1 thread sets the ObjectMonitor's header field to <hash>
after the deflater thread has gotten the ObjectMonitor's header value
and in time for hash-2 to get the <hash> value from the ObjectMonitor.

The deflater thread finishes its install_displaced_markword_in_object()
call and sets the object's header to <dmw>.

hash-1 then observes that the ObjectMonitor is deflated and it
calls install_displaced_markword_in_object(), but it fails to set
the object's header because it is no longer ObjectMonitor*. We
have lost the hash code value generated by hash-1.

For the failure to occur:
   - hash-2 has to see that the ObjectMonitor is NOT deflated after it
     has seen the <hash> value set by hash-1.
   - deflater thread has to have set the owner field to DEFLATER_MARKER
     and the continuations field to -max_jint before it sees the
     ObjectMonitor's header field == <dmw> and not <hash>.
   - hash-1 has to have set the ObjectMonitor's header field to <hash>
     after deflater thread has set the owner field to DEFLATER_MARKER
     and the continuations field to -max_jint AND before deflater thread
     sees the ObjectMonitor's header field == <hash>.

In order for deflater thread to see <dmw> in the ObjectMonitor's header
it must have executed its load FROM the ObjectMonitor's header BEFORE
hash-1 executed its store of <hash> TO the ObjectMonitor's header. That
means that the deflater thread must have also executed its cmpxchg() of
DEFLATER_MARKER TO the owner field and its cmpxchg of -max_jint TO the
contentions field BEFORE hash-1 executed its store of <hash> TO the
ObjectMonitor's header.

In order for hash-2 to see <hash> when it loads FROM the ObjectMonitor's
header, it has to have executed that load AFTER hash-1 executed its
store of <hash> TO the ObjectMonitor's header. In the previous paragraph,
we showed that the deflater thread must have also executed its cmpxchg() of
DEFLATER_MARKER TO the owner field and its cmpxchg of -max_jint TO the
contentions field BEFORE hash-1 executed its store of <hash> TO the
ObjectMonitor's header. If hash-2 sees the <hash> value, then it must
also see the owner field == DEFLATER_MARKER and the contentions field
== -max_jint.

This part of the transaction that I show above for hash-2: "cont >= 0" is
NOT possible on a TSO machine. So I don't see how this code:

     inline bool ObjectMonitor::is_being_async_deflated() {
       return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() && 
contentions() < 0;
     }

when called by hash-2 can fail to see DEFLATER_MARKER in the owner field
and -max_jint in the contentions field.

So as far as I can tell, this code that we added works on a TSO machine:

         if (monitor->is_being_async_deflated()) {
           // But we can't safely use the hash if we detect that async
           // deflation has occurred. So we attempt to restore the
           // header/dmw to the object's header so that we only retry
           // once if the deflater thread happens to be slow.
           monitor->install_displaced_markword_in_object(obj);
           continue;
         }

Please let me know if you see something that I don't with this analysis.

Please DO NOT REPLY on the CR11/v2.11/14-for-jdk15 thread. If you
have an issue with anything that I put in this reply, then please
post the comment on the CR12/v2.12/15-for-jdk15 thread.

> I will wait with formulating fat comments explaining what on earth 
> this is all about, until we have decided on whether we want the
> solution to be making the loads seq_cst, inserting fancy 
> incomprehensible IRIW-fences, or something completely different.

I'm good with holding off on any more nMCA machine changes until the
dust settles in the off OpenJDK alias discussions.

> Sorry for the trouble.

So far I don't see any trouble on the "normal fencing bug" fork of
this review thread. There is definitely more work to do on the
nMCA machine changes, but I'm not going to try to handle that in
this reply.

Again, thanks for your reviews and comments. It's been a while
since I've spent most of a day on the reply to just one set of
code review comments.

Dan

>
> /Erik
>
> On 2020-05-14 23:01, Daniel D. Daugherty wrote:
>> On 5/14/20 4:58 PM, Erik Österlund wrote:
>>> Hi Dan,
>>>
>>>> On 14 May 2020, at 21:30, Daniel D. Daugherty 
>>>> <daniel.daugherty at oracle.com> wrote:
>>>>
>>>> Hi Erik,
>>>>
>>>> I'm gonna trim this down a bit....
>>>>
>>>>> On 5/14/20 6:44 AM, Erik Österlund wrote:
>>>>> Hi Dan,
>>>>>
>>>>> On 2020-05-13 19:49, Daniel D. Daugherty wrote:
>>>> <snip>
>>>>>>>>> Maybe one could poke the contended counter,
>>>>>>>>> or find a convenient way of poisoning the displaced mark word 
>>>>>>>>> so that the CAS installing it on JT1 fails.
>>>>>>>>> Anyway, there is some thinking to be done here I think.
>>>>>>>> Agreed. More mulling is/was needed. Here's the revised block:
>>>>>>>>
>>>>>>>>      } else if (mark.has_monitor()) {
>>>>>>>>        monitor = mark.monitor();
>>>>>>>>        temp = monitor->header();
>>>>>>>>        assert(temp.is_neutral(), "invariant: header=" 
>>>>>>>> INTPTR_FORMAT, temp.value());
>>>>>>>>        hash = temp.hash();
>>>>>>>>        if (hash != 0) {
>>>>>>>>          // It has a hash.
>>>>>>>>          if (AsyncDeflateIdleMonitors && 
>>>>>>>> monitor->is_being_async_deflated()) {
>>>>>>>>            // But we can't safely use the hash if we detect 
>>>>>>>> that async
>>>>>>>>            // deflation has occurred. So we attempt to restore the
>>>>>>>>            // header/dmw to the object's header so that we only 
>>>>>>>> retry
>>>>>>>>            // once if the deflater thread happens to be slow.
>>>>>>>> install_displaced_markword_in_object(obj);
>>>>>>>>            continue;
>>>>>>>>          }
>>>>>>>>          return hash;
>>>>>>>>        }
>>>>>>> Yeah, something like that should do the trick. Although, 
>>>>>>> unfortunately I think we have to sprinkle some awkward
>>>>>>> fencing in there to make sure the load of the displaced mark 
>>>>>>> word and the loads in is_being_async_deflated()
>>>>>>> are not reordered.
>>>>>> So is_being_async_deflated() looks like this right now:
>>>>>>
>>>>>> inline bool ObjectMonitor::is_being_async_deflated() {
>>>>>>    return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() 
>>>>>> && contentions() < 0;
>>>>>> }
>>>>>>
>>>>>> contentions() uses an Atomic::load() as of the v2.11 patch. However,
>>>>>> owner_is_DEFLATER_MARKER() does not:
>>>>>>
>>>>>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>>>>>    return _owner == DEFLATER_MARKER;
>>>>>> }
>>>>>>
>>>>>> So I'm going to change it to:
>>>>>>
>>>>>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>>>>>    return Atomic::load(&_owner) == DEFLATER_MARKER;
>>>>>> }
>>>>> Good catch.
>>>> Thanks. I do have this unresolved bug:
>>>>
>>>>      JDK-8238174 migrate ObjectMonitor::_owner field away from C++ 
>>>> volatile semantics
>>>>      https://bugs.openjdk.java.net/browse/JDK-8238174
>>>>
>>>> but this fix seems like a good idea now.
>>> Yup.
>>>
>>>>>> It should have been that way when I created 
>>>>>> owner_is_DEFLATER_MARKER(),
>>>>>> but I was sloppy. Bad Dan!
>>>>>>
>>>>>> Here's how the dmw/header field is fetched:
>>>>>>
>>>>>> inline markWord ObjectMonitor::header() const {
>>>>>>    return Atomic::load(&_header);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> So if all three of the players (_header, _owner, _contentions) are
>>>>>> loaded with Atomic::load(), doesn't that keep them from being
>>>>>> reordered?
>>>>> No, it only prevents word tearing and makes us feel a bit better 
>>>>> when we sleep at night,
>>>>> knowing that we are ensuring that the semantically right way, 
>>>>> rather than relying on volatile.
>>>> So you're saying that these operations in this order:
>>>>
>>>>      Atomic::load(&_header);
>>>>      Atomic::load(&_owner);
>>>>      Atomic::load(&_contentions);
>>>>
>>>> can be reordered? (I'm not talking about stores here at all...)
>>> Yes, they can and will reorder, except for TSO machines like SPARC 
>>> and x86 where they will not reorder.
>>>
>>>>> But we need a bigger hammer than that here.
>>>>>
>>>>>> Thinking more, that should keep the loads() from being reordered
>>>>>> relative to each other, but I haven't done anything with respect
>>>>>> to the stores... sigh...
>>>>> The stores that move the monitor towards being async deflating are 
>>>>> all updated with conservatively
>>>>> ordered atomics, which is fine.
>>>> Okay... I think...
>>> My point is that contention decrements and the owner flip to the 
>>> deflating marker are conservatively ordered atomic operations. There 
>>> are other stores going on to these fields, but they are not relevant 
>>> for the race.
>>>
>>>>>>> In other places, I have seen that it is being used after an 
>>>>>>> atomic operation, hence not needing
>>>>>>> further fencing. But here there are only loads involved. They 
>>>>>>> will not have a consistent global ordering, and the lack
>>>>>>> thereof can lead us back to the race still happening. The 
>>>>>>> intuitive solution would to put in an OrderAccess::loadload()
>>>>>>> right before asking if monitor->is_being_async_deflated(). 
>>>>>>> However, a loadload on non-multiple copy atomic (nMCA)
>>>>>>> machine, does not protect against the loads happening in an 
>>>>>>> inconsistent order, when the stores are performed
>>>>>>> on different threads. In this case, if the current thread is A, 
>>>>>>> then the displaced mark word could be set by a thread
>>>>>>> B, and the state in is_being_async_deflated() could be updated 
>>>>>>> by a thread C. Let's say there is a fourth thread,
>>>>>>> D, that ends up in the same situation as thread A, then on a 
>>>>>>> nMCA machine, even if there is a loadload() before calling
>>>>>>> is_being_async_deflated(), the ordering of the two loads could 
>>>>>>> be different on thread A and thread D. This is the famous
>>>>>>> Independent Reads of Independent Writes (IRIW) scenario that is 
>>>>>>> widely believed to not matter in practice as there are
>>>>>>> barely any algorithms where it matters. Here it does, because 
>>>>>>> thread A might think that the racy hashCode is okay,
>>>>>>> while thread C says it is not okay. That means that one of the 
>>>>>>> two reading threads may get an inconsistent hashCode
>>>>>>> value, unless there is a full fence before 
>>>>>>> is_being_async_deflated(), depending on what side of the IRIW 
>>>>>>> race they
>>>>>>> end up in. We need them both to always think it is *not* okay, 
>>>>>>> and hence end up on the same side of the IRIW race.
>>>>>>>
>>>>>>> In order to synchronize this appropriately on a nMCA machine, 
>>>>>>> you need a full fence() (instead of a loadload) between
>>>>>>> the load of the displaced mark word, and the loads in 
>>>>>>> is_being_async_deflated(). The fence will make sure that if you
>>>>>>> read the updated (non-zero hashCode) displaced markWord, then it 
>>>>>>> will globally sync stale memory view (caches) such that when
>>>>>>> the loads in is_being_async_deflated() are being executed, they 
>>>>>>> observe non-stale values(). Normally you would
>>>>>>> need MO_SEQ_CST loads for all of these loads to really be able 
>>>>>>> to reason about its correctness in a sane way, but since
>>>>>>> we have the domain knowledge that it is only the flavour of race 
>>>>>>> where the DMW is not stale and the is_being_async_deflated
>>>>>>> values are, that we end up in trouble. So while a sane model 
>>>>>>> would have leading fences on all the loads involved, we
>>>>>>> only need one. On normal platforms a loadload() is sufficient. 
>>>>>>> You can check which one you are running on like this:
>>>>>>>
>>>>>>> // Long incomprehensible comment describing what on earth the 
>>>>>>> point of this synchronization voodoo is
>>>>>>> if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>>>>>>    OrderAccess::fence();
>>>>>>> } else {
>>>>>>>    OrderAccess::loadload();
>>>>>>> }
>>>>>>>
>>>>>>> Or we could go for a different solution that is easier to reason 
>>>>>>> about, if you prefer that.
>>>>>> Based on what you're saying here, the first call to 
>>>>>> is_being_async_deflated()
>>>>>> in FastHashCode() needs to have an OrderAccess::loadload() 
>>>>>> barrier to ensure
>>>>>> proper ordering.
>>>>> Kind of. But I'm also saying that on e.g. PPC that is not enough 
>>>>> and will still fail. There it needs a fence(),
>>>>> hence the filter.
>>>> Right. I was working thru the non-nMCA machine POV first. Sorry, I 
>>>> tend
>>>> not to worry about PPC or ARM, at least at first...
>>> Makes sense!
>>>
>>>>>> The is_being_async_deflated() call in ObjectMonitor::enter() is 
>>>>>> okay because
>>>>>> it follows inc_contentions() which has a barrier. The second call to
>>>>>> is_being_async_deflated() call in FastHashCode should be okay 
>>>>>> because it
>>>>>> is preceded by an Atomic::cmpxchg() call...
>>>>>>
>>>>>> For now, I'm going with this:
>>>>>>
>>>>>>        temp = monitor->header();
>>>>>>        assert(temp.is_neutral(), "invariant: header=" 
>>>>>> INTPTR_FORMAT, temp.value());
>>>>>>        hash = temp.hash();
>>>>>>        if (hash != 0) {
>>>>>>          // It has a hash.
>>>>>>
>>>>>>          // Separate load of dmw/header above from the loads in
>>>>>>          // is_being_async_deflated(). Has to be re-examined for a
>>>>>>          // non-multiple copy atomic (nMCA) machine.
>>>>>>          OrderAccess::loadload();
>>>>>>          if (monitor->is_being_async_deflated()) {
>>>>>>
>>>>>> I've seen discussions about nMCA machines and the Independent 
>>>>>> Reads of Independent
>>>>>> Writes (IRIW) scenario, but I don't know enough about it yet to 
>>>>>> be comfortable
>>>>>> with putting an alternate fence() call in there...
>>>>> It will not work correctly unless we do that.
>>>> I wasn't disagreeing with that at all. I just needed another round
>>>> to think about it... and this where I'm at right now:
>>>>
>>>>        temp = monitor->header();
>>>>        assert(temp.is_neutral(), "invariant: header=" 
>>>> INTPTR_FORMAT, temp.value());
>>>>        hash = temp.hash();
>>>>        if (hash != 0) {
>>>>          // It has a hash.
>>>>
>>>>          // Separate load of dmw/header above from the loads in
>>>>          // is_being_async_deflated().
>>>>          if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>>>            // A non-multiple copy atomic (nMCA) machine needs a bigger
>>>>            // hammer to make sure that the load above and the loads
>>>>            // below all see non-stale memory values.
>>>>            OrderAccess::fence();
>>>>          } else {
>>>>            OrderAccess::loadload();
>>>>          }
>>>>          if (monitor->is_being_async_deflated()) {
>>> That looks good. Note though that the comment is not quite right; we 
>>> assume that the first load may be stale, but the subsequent ones we 
>>> separate out may not. The reason is that a stale hash code is 0, 
>>> which will simply trigger a retry, which is fine. So the proposed 
>>> fencing does not guarantee the first load isn’t stale, but it is 
>>> okay if it is algorithmically. Yet the subsequent loads must not be 
>>> stale w.r.t. the first load, because then bad things happen.
>>
>> I'm prepping the next review round now. Please comment on the wording
>> there and provide a clear suggested change.
>>
>> Dan
>>
>>>
>>>>> The other option, if we want to prioritize readability and 
>>>>> understandability over optimization, is
>>>>> to let the loads involved in the race be read with: 
>>>>> NativeAccess<MO_SEQ_CST>::load(&_value)
>>>>>
>>>>> All sequentially consistent accesses are intuitively guaranteed to 
>>>>> have a consistent global ordering,
>>>>> as if they happened one after the other, and we can reason about 
>>>>> it "easily".
>>>>> As the conservative accesses that produce the values we are 
>>>>> interested in are also (at least)
>>>>> sequentially consistent, we only need to upgrade the loads to be 
>>>>> sequentially consistent to solve
>>>>> the race here.
>>>>>
>>>>> The good news with that solution, is that it is more intuitive to 
>>>>> reason about in the way
>>>>> our brains want to, without really having to know anything at all 
>>>>> about nMCA machines; that
>>>>> is just an implementation detail of how to get seq_cst semantics 
>>>>> on such platforms, handled
>>>>> by the access bindings.
>>>>>
>>>>> The bad news is that it will cost two unnecessary fences on PPC in 
>>>>> this path, which might be
>>>>> quite costly. On TSO platforms, it will boil down to Atomic::load, 
>>>>> which is optimal. On AArch64 it
>>>>> will yield a load_acquire, which is a bit suboptimal, but not too 
>>>>> bad.
>>>>>
>>>>> I'll leave the choice up to you. But I insist that just a 
>>>>> loadload() is not enough for semantic
>>>>> correctness of the algorithm, and will indeed cause races at least 
>>>>> on PPC, causing different
>>>>> threads to get different hash codes.
>>>> Again, I wasn't disagreeing with that at all...
>>> Okay, great. Sounds like we landed in the optimized incomprehensible 
>>> version. I’m okay with that!
>>>
>>>>>>>>> Thanks,
>>>>>>>>> /Erik
>>>>>>>> Thanks for the thorough review! I always appreciate it when you
>>>>>>>> crawl through my code. :-)
>>>>>>> Thanks for moving this mountain closer to jdk-jdk!
>>>>>> Hey now! This mountain has become much smaller as we've split things
>>>>>> off into sub-task after sub-task... :-)
>>>>> Indeed. :)
>>>>>
>>>>> Thanks,
>>>>> /Erik
>>>> Thanks for the thorough review!
>>> No problem! Done from my part now. Looks good and let’s ship it!
>>>
>>> /Erik
>>>
>>>> Dan
>>>>
>>
>