RFR(L) 8153224 Monitor deflation prolong safepoints (CR11/v2.11/14-for-jdk15)

Thu May 14 19:30:06 UTC 2020

Hi Erik,

I'm gonna trim this down a bit....

On 5/14/20 6:44 AM, Erik Österlund wrote:
> Hi Dan,
>
> On 2020-05-13 19:49, Daniel D. Daugherty wrote:
<snip>
>
>>
>>>
>>>>
>>>>> Maybe one could poke the contended counter,
>>>>> or find a convenient way of poisoning the displaced mark word so 
>>>>> that the CAS installing it on JT1 fails.
>>>>> Anyway, there is some thinking to be done here I think.
>>>>
>>>> Agreed. More mulling is/was needed. Here's the revised block:
>>>>
>>>>     } else if (mark.has_monitor()) {
>>>>       monitor = mark.monitor();
>>>>       temp = monitor->header();
>>>>       assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
>>>> temp.value());
>>>>       hash = temp.hash();
>>>>       if (hash != 0) {
>>>>         // It has a hash.
>>>>         if (AsyncDeflateIdleMonitors && 
>>>> monitor->is_being_async_deflated()) {
>>>>           // But we can't safely use the hash if we detect that async
>>>>           // deflation has occurred. So we attempt to restore the
>>>>           // header/dmw to the object's header so that we only retry
>>>>           // once if the deflater thread happens to be slow.
>>>>           install_displaced_markword_in_object(obj);
>>>>           continue;
>>>>         }
>>>>         return hash;
>>>>       }
>>>
>>> Yeah, something like that should do the trick. Although, 
>>> unfortunately I think we have to sprinkle some awkward
>>> fencing in there to make sure the load of the displaced mark word 
>>> and the loads in is_being_async_deflated()
>>> are not reordered.
>>
>> So is_being_async_deflated() looks like this right now:
>>
>> inline bool ObjectMonitor::is_being_async_deflated() {
>>   return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() && 
>> contentions() < 0;
>> }
>>
>> contentions() uses an Atomic::load() as of the v2.11 patch. However,
>> owner_is_DEFLATER_MARKER() does not:
>>
>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>   return _owner == DEFLATER_MARKER;
>> }
>>
>> So I'm going to change it to:
>>
>> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>>   return Atomic::load(&_owner) == DEFLATER_MARKER;
>> }
>
> Good catch.

Thanks. I do have this unresolved bug:

     JDK-8238174 migrate ObjectMonitor::_owner field away from C++ 
volatile semantics
     https://bugs.openjdk.java.net/browse/JDK-8238174

but this fix seems like a good idea now.

>
>> It should have been that way when I created owner_is_DEFLATER_MARKER(),
>> but I was sloppy. Bad Dan!
>>
>> Here's how the dmw/header field is fetched:
>>
>> inline markWord ObjectMonitor::header() const {
>>   return Atomic::load(&_header);
>> }
>>
>>
>> So if all three of the players (_header, _owner, _contentions) are
>> loaded with Atomic::load(), doesn't that keep them from being
>> reordered?
>
> No, it only prevents word tearing and makes us feel a bit better when 
> we sleep at night,
> knowing that we are ensuring that the semantically right way, rather 
> than relying on volatile.

So you're saying that these operations in this order:

     Atomic::load(&_header);
     Atomic::load(&_owner);
     Atomic::load(&_contentions);

can be reordered? (I'm not talking about stores here at all...)

> But we need a bigger hammer than that here.
>
>> Thinking more, that should keep the loads() from being reordered
>> relative to each other, but I haven't done anything with respect
>> to the stores... sigh...
>
> The stores that move the monitor towards being async deflating are all 
> updated with conservatively
> ordered atomics, which is fine.

Okay... I think...

>
>>
>>> In other places, I have seen that it is being used after an atomic 
>>> operation, hence not needing
>>> further fencing. But here there are only loads involved. They will 
>>> not have a consistent global ordering, and the lack
>>> thereof can lead us back to the race still happening. The intuitive 
>>> solution would to put in an OrderAccess::loadload()
>>> right before asking if monitor->is_being_async_deflated(). However, 
>>> a loadload on non-multiple copy atomic (nMCA)
>>> machine, does not protect against the loads happening in an 
>>> inconsistent order, when the stores are performed
>>> on different threads. In this case, if the current thread is A, then 
>>> the displaced mark word could be set by a thread
>>> B, and the state in is_being_async_deflated() could be updated by a 
>>> thread C. Let's say there is a fourth thread,
>>> D, that ends up in the same situation as thread A, then on a nMCA 
>>> machine, even if there is a loadload() before calling
>>> is_being_async_deflated(), the ordering of the two loads could be 
>>> different on thread A and thread D. This is the famous
>>> Independent Reads of Independent Writes (IRIW) scenario that is 
>>> widely believed to not matter in practice as there are
>>> barely any algorithms where it matters. Here it does, because thread 
>>> A might think that the racy hashCode is okay,
>>> while thread C says it is not okay. That means that one of the two 
>>> reading threads may get an inconsistent hashCode
>>> value, unless there is a full fence before 
>>> is_being_async_deflated(), depending on what side of the IRIW race they
>>> end up in. We need them both to always think it is *not* okay, and 
>>> hence end up on the same side of the IRIW race.
>>>
>>> In order to synchronize this appropriately on a nMCA machine, you 
>>> need a full fence() (instead of a loadload) between
>>> the load of the displaced mark word, and the loads in 
>>> is_being_async_deflated(). The fence will make sure that if you
>>> read the updated (non-zero hashCode) displaced markWord, then it 
>>> will globally sync stale memory view (caches) such that when
>>> the loads in is_being_async_deflated() are being executed, they 
>>> observe non-stale values(). Normally you would
>>> need MO_SEQ_CST loads for all of these loads to really be able to 
>>> reason about its correctness in a sane way, but since
>>> we have the domain knowledge that it is only the flavour of race 
>>> where the DMW is not stale and the is_being_async_deflated
>>> values are, that we end up in trouble. So while a sane model would 
>>> have leading fences on all the loads involved, we
>>> only need one. On normal platforms a loadload() is sufficient. You 
>>> can check which one you are running on like this:
>>>
>>> // Long incomprehensible comment describing what on earth the point 
>>> of this synchronization voodoo is
>>> if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>>   OrderAccess::fence();
>>> } else {
>>>   OrderAccess::loadload();
>>> }
>>>
>>> Or we could go for a different solution that is easier to reason 
>>> about, if you prefer that.
>>
>> Based on what you're saying here, the first call to 
>> is_being_async_deflated()
>> in FastHashCode() needs to have an OrderAccess::loadload() barrier to 
>> ensure
>> proper ordering.
>
> Kind of. But I'm also saying that on e.g. PPC that is not enough and 
> will still fail. There it needs a fence(),
> hence the filter.

Right. I was working thru the non-nMCA machine POV first. Sorry, I tend
not to worry about PPC or ARM, at least at first...

>
>> The is_being_async_deflated() call in ObjectMonitor::enter() is okay 
>> because
>> it follows inc_contentions() which has a barrier. The second call to
>> is_being_async_deflated() call in FastHashCode should be okay because it
>> is preceded by an Atomic::cmpxchg() call...
>>
>> For now, I'm going with this:
>>
>>       temp = monitor->header();
>>       assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
>> temp.value());
>>       hash = temp.hash();
>>       if (hash != 0) {
>>         // It has a hash.
>>
>>         // Separate load of dmw/header above from the loads in
>>         // is_being_async_deflated(). Has to be re-examined for a
>>         // non-multiple copy atomic (nMCA) machine.
>>         OrderAccess::loadload();
>>         if (monitor->is_being_async_deflated()) {
>>
>> I've seen discussions about nMCA machines and the Independent Reads 
>> of Independent
>> Writes (IRIW) scenario, but I don't know enough about it yet to be 
>> comfortable
>> with putting an alternate fence() call in there...
>
> It will not work correctly unless we do that.

I wasn't disagreeing with that at all. I just needed another round
to think about it... and this where I'm at right now:

       temp = monitor->header();
       assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
temp.value());
       hash = temp.hash();
       if (hash != 0) {
         // It has a hash.

         // Separate load of dmw/header above from the loads in
         // is_being_async_deflated().
         if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
           // A non-multiple copy atomic (nMCA) machine needs a bigger
           // hammer to make sure that the load above and the loads
           // below all see non-stale memory values.
           OrderAccess::fence();
         } else {
           OrderAccess::loadload();
         }
         if (monitor->is_being_async_deflated()) {

> The other option, if we want to prioritize readability and 
> understandability over optimization, is
> to let the loads involved in the race be read with: 
> NativeAccess<MO_SEQ_CST>::load(&_value)
>
> All sequentially consistent accesses are intuitively guaranteed to 
> have a consistent global ordering,
> as if they happened one after the other, and we can reason about it 
> "easily".
> As the conservative accesses that produce the values we are interested 
> in are also (at least)
> sequentially consistent, we only need to upgrade the loads to be 
> sequentially consistent to solve
> the race here.
>
> The good news with that solution, is that it is more intuitive to 
> reason about in the way
> our brains want to, without really having to know anything at all 
> about nMCA machines; that
> is just an implementation detail of how to get seq_cst semantics on 
> such platforms, handled
> by the access bindings.
>
> The bad news is that it will cost two unnecessary fences on PPC in 
> this path, which might be
> quite costly. On TSO platforms, it will boil down to Atomic::load, 
> which is optimal. On AArch64 it
> will yield a load_acquire, which is a bit suboptimal, but not too bad.
>
> I'll leave the choice up to you. But I insist that just a loadload() 
> is not enough for semantic
> correctness of the algorithm, and will indeed cause races at least on 
> PPC, causing different
> threads to get different hash codes.

Again, I wasn't disagreeing with that at all...

>
>>
>>>
>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>
>>>> Thanks for the thorough review! I always appreciate it when you
>>>> crawl through my code. :-)
>>>
>>> Thanks for moving this mountain closer to jdk-jdk!
>>
>> Hey now! This mountain has become much smaller as we've split things
>> off into sub-task after sub-task... :-)
>
> Indeed. :)
>
> Thanks,
> /Erik

Thanks for the thorough review!

Dan