RFR(L) 8153224 Monitor deflation prolong safepoints (CR11/v2.11/14-for-jdk15)

Thu May 14 10:44:27 UTC 2020

Hi Dan,

On 2020-05-13 19:49, Daniel D. Daugherty wrote:
> Thanks for not insisting here. If I drop that NULL check, I'll have to
> validate it with another multi-day round of stress testing since this
> quick_enter() function has been shown to be racy in more than one round
> of testing...
>
> On the plus side, relative to v2.10, we got rid of:
>
>     old L495:       ObjectMonitorHandle omh;
>     old L496:       if (!omh.save_om_ptr(obj, mark)) {
>
> Both of which showed up in Eric C's perf testing where he was analyzing
> the DaCapo-h2 perf degradation in v2.10...

Okay.

>
>>
>>>
>>>> I also noticed that when entering a monitor, you help async 
>>>> deflation finish unlinking of the mark
>>>> word, so that a re-try will make progress, and not act as a spin 
>>>> lock, waiting for the deflating thread
>>>> to make progress. That seems great.
>>>
>>> Right. That used to happen with the ObjectMonitorHandle helper via the
>>> associated save_om_ptr() call. Carsten had similar code to what I have
>>> now in ObjectMonitor::enter() in his prototype.
>>>
>>> So when I took out ObjectMonitorHandle, I added back the direct call to
>>> install_displaced_markword_in_object() in ObjectMonitor::enter()...
>>
>> Okay.
>>
>>>
>>>> However, when a hashCode is computed, you just re-try instead,
>>>> without helping deflation along the way. It seems to me this 
>>>> creates a similar spin lock kind of
>>>> situation, where the thread computing a hashCode will have to wait 
>>>> quite a bit if the deflating
>>>> thread gets preempted at the wrong time.
>>>
>>> Again, that used to happen with the ObjectMonitorHandle helper via the
>>> associated save_om_ptr() call. Obviously when I took out 
>>> ObjectMonitorHandle,
>>> I had to decide how to handle the race. I chose to simply retry.
>>>
>>> In Carsten's prototype, he had some logic that tried to convert the
>>> object's header from an ObjectMonitor* ref to the 'dmw' with the
>>> hashcode value included. That logic depended on the marking of the
>>> dmw that used to be done by install_displaced_markword_in_object().
>>>
>>> I chose to simplify the interaction of async deflation and hashcode
>>> installation races. That allowed me to drop the marking of 'dmw' in
>>> the ObjectMonitor in install_displaced_markword_in_object() which
>>> meant one less write to ObjectMonitor's _header/dmw field. It also
>>> meant dropping the reads necessary to see if the mark bit was set.
>>
>> That sounds good. I was not a fan of exposing is_marked() markWords 
>> outside of safepoints
>> as they are inherently not designed to ever be outside of safepoints. 
>> There are checks that no
>> longer work, such as has_monitor() returning true for all marked 
>> markWords regardless of whether
>> they encode a monitor or not. So yeah, happy we won't have to deal 
>> with convincing ourselves
>> that the marked markWords are never exposed to code that will subtly 
>> break.
>
> But, that was so much fun for you and I to verify in one of the previous
> bug hunts! I live for chasing possible leakage of markWords into other
> places in the VM :-)

:)

> More seriously, I agree that dropping the marking of the dmw/header field
> in ObjectMonitor greatly simplifies analysis of these set of changes!!

Yup!

>
>>
>>> Here's where I retry in FastHashCode:
>>>
>>>     L1099:       if (monitor->is_being_async_deflated()) {
>>>     L1100:         // If we detect that async deflation has 
>>> occurred, then we
>>>     L1101:         // simply retry so that the hash value can be 
>>> stored in either
>>>     L1102:         // the object's header or in the re-inflated 
>>> ObjectMonitor's
>>>     L1103:         // header as appropriate.
>>>     L1104:         continue;
>>>     L1105:       }
>>>
>>> First, I just realized that I forget to check AsyncDeflateIdleMonitors
>>> before that call to is_being_async_deflated()... oops.
>>
>> Yeah a spontaneous thought is that maybe is_being_async_deflated 
>> should check if !AsyncDeflateIdleMonitors
>> and then return false. It reads quite intuitively then that if 
>> AsyncDeflateIdleMonitors is disabled, then it is
>> obviously also not being async deflated.
>
> Well, I could do that and it does make sense. Currently all calls to
> is_being_async_deflated() are now guarded with AsyncDeflateIdleMonitors
> so move the check inside is_being_async_deflated() doesn't invalidate
> my current testing...

Nice.

> If Async Monitor Deflation makes it into JDK15, then I plan to remove
> AsyncDeflateIdleMonitors and friends early in JDK16...

That would be great!

>
>>
>>> Second, I have a good object ref there ('obj') so I could easily add a
>>> call to "install_displaced_markword_in_object(obj)" before the continue
>>> and we would only retry once. That would get us out of the spin lock
>>> like situation. What do you think?
>>
>> That sounds okay. The more general issue though is that everywhere 
>> there is a retry, we have
>> to remember to call install_displaced_markword_in_object somewhere or 
>> there will be issues.
>
> If by "issues" you mean we might do a spin loop, then I agree. I don't
> think there are correctness issues...

Then we are on the same page.

>
>> I guess there are not that many places though in the end, so maybe 
>> just sprinkling in another
>> install_displaced_markword_in_object is fine.
>
> Definitely fewer calls to install_displaced_markword_in_object() than we
> used to have with ObjectMonitorHandle and save_om_ptr(). I believe we 
> have
> a total of four call sites in the current code. In v2.10, we only had two
> call sites, but save_om_ptr() was called from many more places...

Yeah, I don't think it's too bad.

>
>>
>>>
>>>> Perhaps one way of dealing with this is to move the async monitor
>>>> deflation helping from entering the monitor to 
>>>> ObjectSynchronizer::inflate. When inflate() find that
>>>> the header has a monitor, now it just returns that monitor. Perhaps 
>>>> it could instead check if it is being
>>>> deflated, help with deflation, and re-inflate a new monitor. That 
>>>> way, any code that grabs a monitor,
>>>> tries to do something with it, and fails, can just retry, without 
>>>> having to deal with the re-try not
>>>> making any progress unless manually helping async deflation. Just 
>>>> an idea.
>>>
>>> Definitely something to mull on, but not for this fix. Keep in mind 
>>> that
>>> the async deflation could happen just after the call to inflate() here
>>> on L588:
>>>
>>> src/hotspot/share/runtime/synchronizer.cpp:
>>>
>>>     L551: void ObjectSynchronizer::enter(Handle obj, BasicLock* 
>>> lock, TRAPS) {
>>> <snip>
>>>     L587:   while (true) {
>>>     L588:     ObjectMonitor* monitor = inflate(THREAD, obj(), 
>>> inflate_cause_monitor_enter);
>>>     L589:     if (monitor->enter(THREAD)) {
>>>
>>> and before L297 in the ObjectMonitor::enter() call we make on L589 
>>> above:
>>>
>>> src/hotspot/share/runtime/objectMonitor.cpp:
>>>
>>>     L243: bool ObjectMonitor::enter(TRAPS) {
>>> <snip>
>>>     L297:   Atomic::inc(&_contentions);
>>>     L298:   if (AsyncDeflateIdleMonitors && 
>>> is_being_async_deflated()) {
>>>     L299:     // Async deflation is in progress and our contentions 
>>> increment
>>>     L300:     // above lost the race to async deflation. Undo the 
>>> work and
>>>     L301:     // force the caller to retry.
>>>     L302:     const oop l_object = (oop)object();
>>>     L303:     if (l_object != NULL) {
>>>     L304:       // Attempt to restore the header/dmw to the object's 
>>> header so that
>>>     L305:       // we only retry once if the deflater thread happens 
>>> to be slow.
>>>     L306: install_displaced_markword_in_object(l_object);
>>>
>>> So we still need the check in ObjectMonitor::enter() even if we add
>>> one in inflate(). Of course, when we had ref_counting, we didn't have
>>> this problem because the > 0 ref_count would have kept that async
>>> deflation from landing the way that I described here.
>>
>> Yes. The idea was that potentially failing things like enter and 
>> hashCode only have to check
>> if it is async deflating, and then fail, without having to remember 
>> that failing followed by retry
>> would need another manual call somewhere to 
>> install_displaced_markword_in_object; at least
>> that part would be done for you, and you'd only have to return false 
>> and say it failed. But
>> again, there are kind of few places this happens so I am okay with a 
>> less generic solution.
>
> I don't see how we can do a generic solution that works (in one place).
> We had a generic solution when we had ObjectMonitorHandle, ref_counting
> and save_om_ptr(), but we had to give that up due to the big impact on
> performance by the ref_counting... (sad now :-( )

The generic solution I had in mind was to put it into inflate(). All 
places we retry call inflate() again
in the re-try path before ending up in what would be a spin loop. But we 
can drop that. Who knows,
maybe poking around at all those lock-free fields could cause 
regressions. Since the number of places
we perform these looping retries is rather limited, I think we don't 
need to investigate this for now.

>
>>
>>>
>>>> Related to the hashCode stuff:
>>>>
>>>> 044     } else if (mark.has_monitor()) {
>>>> 1045       // The first stage of a racing async deflation won't 
>>>> affect the
>>>> 1046       // hash value if this ObjectMonitor happens to already 
>>>> have one.
>>>> 1047       monitor = mark.monitor();
>>>> 1048       temp = monitor->header();
>>>> 1049       assert(temp.is_neutral(), "invariant: header=" 
>>>> INTPTR_FORMAT, temp.value());
>>>> 1050       hash = temp.hash();
>>>> 1051       if (hash != 0) {                  // if it has a hash, 
>>>> just return it
>>>> 1052         return hash;
>>>> 1053       }
>>>> 1054       // Fall thru so we only have one place that installs the 
>>>> hash in
>>>> 1055       // the ObjectMonitor.
>>>>
>>>> The comment says that if the hashCode has been set (not 0), then it 
>>>> is safe to use it without any weird
>>>> async deflation interactions. I am not convinded that is true. In 
>>>> the previous version of this, you had
>>>> a way of poisoning the displaced mark word during deflation, so 
>>>> that the CAS installing a new hashCode
>>>> would fail. But now with this new approach, that code is gone 
>>>> (unless I missed something).
>>>
>>> Actually in the previous version of this, the ref_count would have kept
>>> async deflation from interfering with saving the hash code value in the
>>> ObjectMonitor's _header/dmw field. The logic was different in much 
>>> older
>>> versions of the code...
>>
>> Ah yes. I recalled the even older logic. This has been going on for a 
>> while. :)
>
> Yes, yes it has... I have been living with this code for much of my time
> for the last year plus... :-|
>
>
>>
>>> The reason the above is safe is because the hash code value that we 
>>> found
>>> in the ObjectMonitor will be the same hash code value that will be 
>>> restored
>>> to the object's header by the deflation process.
>>>
>>> Or at least that was my thinking when I wrote that comment... more 
>>> below...
>>> (Yes, I was wrong!)
>>
>> :)
>>
>>>
>>>> I think this
>>>> can yield an unfortunate race when two threads call hashCode() 
>>>> while it is being deflated.
>>>>
>>>> JT1: load monitor from object
>>>> JT2: load monitor from object
>>>> ServiceThread: deflate monitor
>>>> JT1: Check if there is an existing hashCode. There is not.
>>>> JT1: Install new displaced mark word successfully with new non-zero 
>>>> hashCode
>>>> JT1: Figure out that this monitor has been deflated, and retry
>>>> JT2: Check if there is an existing hashCode. There is; the 
>>>> discarded result of JT1.
>>>> JT1: Install new hashCode, not knowing that JT1 caught the old one.
>>> Correction: you mean 'JT2 caught the old one' here.
>>>
>>>
>>>> Now JT1 and JT2 have acquired different values for hashCode.
>>>
>>> Definitely an interesting scenario. Sigh... this is yet another race 
>>> that
>>> wasn't possible with ref_counting. Grumble, grumble, grumble... :-)
>>
>> Yes, that was nice with the ObjectMonitorHandle abstraction.
>
> One of the "problems" I've run into with back tracking away from
> ObjectMonitorHandles and the ref_counting is that I removed/changed
> code that no longer had to worry about races because ref_counting
> made things _safe_.
>
> I've had to very carefully compare the baseline to v2.11 and make
> sure that any code that I removed wasn't needed for race safety.
> It has been painful...
>
> We had ObjectMonitorHandle and ref_counting for most of this project
> so getting out of that "of course, things are safe" mindset is hard.

Yes, I also miss that aspect of the handles. But we can live without them.

>
>>
>>>
>>>> Maybe one could poke the contended counter,
>>>> or find a convenient way of poisoning the displaced mark word so 
>>>> that the CAS installing it on JT1 fails.
>>>> Anyway, there is some thinking to be done here I think.
>>>
>>> Agreed. More mulling is/was needed. Here's the revised block:
>>>
>>>     } else if (mark.has_monitor()) {
>>>       monitor = mark.monitor();
>>>       temp = monitor->header();
>>>       assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
>>> temp.value());
>>>       hash = temp.hash();
>>>       if (hash != 0) {
>>>         // It has a hash.
>>>         if (AsyncDeflateIdleMonitors && 
>>> monitor->is_being_async_deflated()) {
>>>           // But we can't safely use the hash if we detect that async
>>>           // deflation has occurred. So we attempt to restore the
>>>           // header/dmw to the object's header so that we only retry
>>>           // once if the deflater thread happens to be slow.
>>>           install_displaced_markword_in_object(obj);
>>>           continue;
>>>         }
>>>         return hash;
>>>       }
>>
>> Yeah, something like that should do the trick. Although, 
>> unfortunately I think we have to sprinkle some awkward
>> fencing in there to make sure the load of the displaced mark word and 
>> the loads in is_being_async_deflated()
>> are not reordered.
>
> So is_being_async_deflated() looks like this right now:
>
> inline bool ObjectMonitor::is_being_async_deflated() {
>   return AsyncDeflateIdleMonitors && owner_is_DEFLATER_MARKER() && 
> contentions() < 0;
> }
>
> contentions() uses an Atomic::load() as of the v2.11 patch. However,
> owner_is_DEFLATER_MARKER() does not:
>
> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>   return _owner == DEFLATER_MARKER;
> }
>
> So I'm going to change it to:
>
> inline bool ObjectMonitor::owner_is_DEFLATER_MARKER() {
>   return Atomic::load(&_owner) == DEFLATER_MARKER;
> }

Good catch.

> It should have been that way when I created owner_is_DEFLATER_MARKER(),
> but I was sloppy. Bad Dan!
>
> Here's how the dmw/header field is fetched:
>
> inline markWord ObjectMonitor::header() const {
>   return Atomic::load(&_header);
> }
>
>
> So if all three of the players (_header, _owner, _contentions) are
> loaded with Atomic::load(), doesn't that keep them from being
> reordered?

No, it only prevents word tearing and makes us feel a bit better when we 
sleep at night,
knowing that we are ensuring that the semantically right way, rather 
than relying on volatile.
But we need a bigger hammer than that here.

> Thinking more, that should keep the loads() from being reordered
> relative to each other, but I haven't done anything with respect
> to the stores... sigh...

The stores that move the monitor towards being async deflating are all 
updated with conservatively
ordered atomics, which is fine.

>
>> In other places, I have seen that it is being used after an atomic 
>> operation, hence not needing
>> further fencing. But here there are only loads involved. They will 
>> not have a consistent global ordering, and the lack
>> thereof can lead us back to the race still happening. The intuitive 
>> solution would to put in an OrderAccess::loadload()
>> right before asking if monitor->is_being_async_deflated(). However, a 
>> loadload on non-multiple copy atomic (nMCA)
>> machine, does not protect against the loads happening in an 
>> inconsistent order, when the stores are performed
>> on different threads. In this case, if the current thread is A, then 
>> the displaced mark word could be set by a thread
>> B, and the state in is_being_async_deflated() could be updated by a 
>> thread C. Let's say there is a fourth thread,
>> D, that ends up in the same situation as thread A, then on a nMCA 
>> machine, even if there is a loadload() before calling
>> is_being_async_deflated(), the ordering of the two loads could be 
>> different on thread A and thread D. This is the famous
>> Independent Reads of Independent Writes (IRIW) scenario that is 
>> widely believed to not matter in practice as there are
>> barely any algorithms where it matters. Here it does, because thread 
>> A might think that the racy hashCode is okay,
>> while thread C says it is not okay. That means that one of the two 
>> reading threads may get an inconsistent hashCode
>> value, unless there is a full fence before is_being_async_deflated(), 
>> depending on what side of the IRIW race they
>> end up in. We need them both to always think it is *not* okay, and 
>> hence end up on the same side of the IRIW race.
>>
>> In order to synchronize this appropriately on a nMCA machine, you 
>> need a full fence() (instead of a loadload) between
>> the load of the displaced mark word, and the loads in 
>> is_being_async_deflated(). The fence will make sure that if you
>> read the updated (non-zero hashCode) displaced markWord, then it will 
>> globally sync stale memory view (caches) such that when
>> the loads in is_being_async_deflated() are being executed, they 
>> observe non-stale values(). Normally you would
>> need MO_SEQ_CST loads for all of these loads to really be able to 
>> reason about its correctness in a sane way, but since
>> we have the domain knowledge that it is only the flavour of race 
>> where the DMW is not stale and the is_being_async_deflated
>> values are, that we end up in trouble. So while a sane model would 
>> have leading fences on all the loads involved, we
>> only need one. On normal platforms a loadload() is sufficient. You 
>> can check which one you are running on like this:
>>
>> // Long incomprehensible comment describing what on earth the point 
>> of this synchronization voodoo is
>> if (support_IRIW_for_not_multiple_copy_atomic_cpu) {
>>   OrderAccess::fence();
>> } else {
>>   OrderAccess::loadload();
>> }
>>
>> Or we could go for a different solution that is easier to reason 
>> about, if you prefer that.
>
> Based on what you're saying here, the first call to 
> is_being_async_deflated()
> in FastHashCode() needs to have an OrderAccess::loadload() barrier to 
> ensure
> proper ordering.

Kind of. But I'm also saying that on e.g. PPC that is not enough and 
will still fail. There it needs a fence(),
hence the filter.

> The is_being_async_deflated() call in ObjectMonitor::enter() is okay 
> because
> it follows inc_contentions() which has a barrier. The second call to
> is_being_async_deflated() call in FastHashCode should be okay because it
> is preceded by an Atomic::cmpxchg() call...
>
> For now, I'm going with this:
>
>       temp = monitor->header();
>       assert(temp.is_neutral(), "invariant: header=" INTPTR_FORMAT, 
> temp.value());
>       hash = temp.hash();
>       if (hash != 0) {
>         // It has a hash.
>
>         // Separate load of dmw/header above from the loads in
>         // is_being_async_deflated(). Has to be re-examined for a
>         // non-multiple copy atomic (nMCA) machine.
>         OrderAccess::loadload();
>         if (monitor->is_being_async_deflated()) {
>
> I've seen discussions about nMCA machines and the Independent Reads of 
> Independent
> Writes (IRIW) scenario, but I don't know enough about it yet to be 
> comfortable
> with putting an alternate fence() call in there...

It will not work correctly unless we do that.

The other option, if we want to prioritize readability and 
understandability over optimization, is
to let the loads involved in the race be read with: 
NativeAccess<MO_SEQ_CST>::load(&_value)

All sequentially consistent accesses are intuitively guaranteed to have 
a consistent global ordering,
as if they happened one after the other, and we can reason about it 
"easily".
As the conservative accesses that produce the values we are interested 
in are also (at least)
sequentially consistent, we only need to upgrade the loads to be 
sequentially consistent to solve
the race here.

The good news with that solution, is that it is more intuitive to reason 
about in the way
our brains want to, without really having to know anything at all about 
nMCA machines; that
is just an implementation detail of how to get seq_cst semantics on such 
platforms, handled
by the access bindings.

The bad news is that it will cost two unnecessary fences on PPC in this 
path, which might be
quite costly. On TSO platforms, it will boil down to Atomic::load, which 
is optimal. On AArch64 it
will yield a load_acquire, which is a bit suboptimal, but not too bad.

I'll leave the choice up to you. But I insist that just a loadload() is 
not enough for semantic
correctness of the algorithm, and will indeed cause races at least on 
PPC, causing different
threads to get different hash codes.

>
>>
>>>
>>>> Another thing I noticed...
>>>>
>>>> 1248       if (!mid->is_free() && mid->object() != NULL) {
>>>> 1249         // Only process with closure if the object is set.
>>>> 1250
>>>> 1251         // monitors_iterate() is only called at a safepoint or 
>>>> when the
>>>> 1252         // target thread is suspended or when the target 
>>>> thread is
>>>> 1253         // operating on itself. The closures are only 
>>>> interested in an
>>>> 1254         // owned ObjectMonitor and ownership cannot be dropped 
>>>> under the
>>>> 1255         // calling contexts so the ObjectMonitor cannot be 
>>>> async deflated.
>>>> 1256         closure->do_monitor(mid);
>>>> 1257       }
>>>>
>>>> Your new comment says that this is safe with the pre-condition that 
>>>> the user of the API filter away
>>>> anything that is not an owned monitor. If that is true (I presume 
>>>> it is), then it seems that the new
>>>> mid->is_free() check should not be needed either. An owned monitor 
>>>> will never be mid->is_free(). Also,
>>>> clear_using_JT both clears the owner and sets the state to free.
>>>
>>> Correction: clear_using_JT() clears the object field and sets the 
>>> state to free.
>>
>> Ah yes. I meant that, I just wrote something else that started with 
>> 'o'. Oh dear...
>>
>>>
>>>> So the null check would also filter
>>>> away monitors that have been freed. But then again, using monitors 
>>>> that have been freed is not safe for
>>>> any user of this API.
>>>
>>> You are right that the "!mid->is_free()" part of the check is not 
>>> needed
>>> so I'll remove that. The baseline code did not call 
>>> closure->do_monitor(mid)
>>> when the object field was not set and that is sufficient. I'm basing 
>>> that
>>> on the fact that clear_using_JT() (renamed to clear_common()), 
>>> clears the
>>> object field before setting the state to free.
>>
>> Great.
>
> Robbin spotted another use of is_free() that wasn't part of debug code so
> I've also fixed that one.

Good.

>
>>
>>> This part of the comment:
>>>
>>>     L1253:                                 The closures are only 
>>> interested in an
>>>     L1254:         // owned ObjectMonitor and ownership cannot be 
>>> dropped under the
>>>     L1255:         // calling contexts so the ObjectMonitor cannot 
>>> be async deflated.
>>>
>>> is meant to describe how the closures passed to the current callers of
>>> monitors_iterate() work today and why they aren't a problem for async
>>> monitor deflation. I wasn't trying to claim that monitor_iterate()
>>> required that all closures only care about owned monitors.
>>>
>>> I tweaked the comment a little bit:
>>>
>>>         //                      The current closures in use today are
>>>         // only interested in an owned ObjectMonitor and ownership
>>>         // cannot be dropped under the calling contexts so the
>>>         // ObjectMonitor cannot be async deflated.
>>
>> Excellent.
>>
>>>
>>>> Perhaps that should be removed, and the name of the function 
>>>> ObjectSynchronizer::monitors_iterate should
>>>> change to owned_monitors_iterate(Thread* owner), where owner == 
>>>> NULL implies all owned monitors, which
>>>> is only valid in a safepoint. This makes the API safer to use, and 
>>>> supports only what we can safely
>>>> support, instead of handing off non-sticky monitors, hoping the 
>>>> arbitrary user closer will know
>>>> what monitors it can and can not use safely.
>>>
>>> For now, I think what we have is safe with respect to async deflation.
>>> I don't want to redesign ObjectSynchronizer::monitors_iterate() as part
>>> of this bug fix. I can see filing an RFE to follow up with additional
>>> work on ObjectSynchronizer::monitors_iterate().
>>
>> Sounds fine, we can fix that another day.
>
> Thanks for agreeing. I'll post when I file the new RFE...

Great.

>
>>
>>>
>>>> Thanks,
>>>> /Erik
>>>
>>> Thanks for the thorough review! I always appreciate it when you
>>> crawl through my code. :-)
>>
>> Thanks for moving this mountain closer to jdk-jdk!
>
> Hey now! This mountain has become much smaller as we've split things
> off into sub-task after sub-task... :-)

Indeed. :)

Thanks,
/Erik