RFR(L): 8235795: replace monitor list mux{Acquire,Release}(&gListLock) with spin locks

Thu Jan 23 18:31:17 UTC 2020

Hi David,

Thanks for yet another code review on ObjectMonitor stuff!

Sorry for the delay in replying to your review. I was on vacation after
the Oracle Holiday break and it took me a bit to get caught up on stuff
after returning to the office...

On 1/6/20 1:37 AM, David Holmes wrote:
> Hi Dan,
>
> Thanks for this Xmas Eve gift, but I'm afraid it had to wait till 
> after the holidays :) Happy New Year!

You are very welcome! It would have been a good cure for insomnia 
though... :-)

Happy New Year!

> Thanks as always for the extensive writeups on all this to help 
> explain the details.

You're welcome. It is definitely a bit of work to keep it up to date, but
I think it's worth it...

> I have only a few minor comments:

Replies embedded below...

> src/hotspot/share/runtime/synchronizer.cpp
>
>  166 // Note: the om parameter may or may not have been marked 
> originally.
>
> AFAICS there is a single caller of mark_om_ptr and the OM is unmarked. 
> Can you clarify (or point to wiki section) why you may be marking an 
> already marked OM*? To me that implies some kind of nested marking, 
> which would then make unlocking problematic.

Nice catch!!

The caller of mark_om_ptr(), which is try_om_lock(), can be called from
4 different places and the 'om' parameter to that function may or may not
be locked, but try_om_lock() creates a local 'next' variable that is
always unlocked and that unlocked 'next' value is what is passed to
the mark_om_ptr() function.

I'll delete L166.

> -- 
>
>  175   ObjectMonitor* next = 
> (ObjectMonitor*)((intptr_t)Atomic::load(&om->_next_om) & ~OM_LOCK_BIT);
>
> Couldn't that be:
>
> ObjectMonitor* next = unmarked_next(om);
>
> ?

Nice catch!

I updated L175 to use unmarked_next() and relocated the unmarked_next()
function above try_om_lock() so things will compile. :-)

> ---
>
>  211         // The list head changed so we have to retry.
>
> May I suggest for clarity:
>
> // The list head changed before we could lock it,
> // so we have to retry.

Updated the comment (but kept it all one line).

> ---
>
> 1155     // used with with block linkage _next_om fields).
>
> typo: with with

Fixed two instances of 'with with'.

> ---
>
> 1164   if (MonitorUsedDeflationThreshold > 0) {
> 1165     int monitors_used = Atomic::load(&LVars.population) - 
> Atomic::load(&LVars.free_count);
> 1166     int monitor_usage = (monitors_used * 100LL) / 
> Atomic::load(&LVars.population);
>
> This may not matter, but given the fact that the counters are not 
> updated atomically wrt. the lists, we may be using two different 
> values of "population" in that calculation. Or is that not possible 
> because this is executed at a safepoint?

monitors_used_above_threshold() is only called by
ObjectSynchronizer::is_cleanup_needed() which is only called by
SafepointSynchronize::is_cleanup_needed().

SafepointSynchronize::is_cleanup_needed() is only called by
VMThread::no_op_safepoint() which is only called by
VMThread::loop() which is not at a safepoint when it calls
no_op_safepoint().

So monitors_used_above_threshold() can return an incorrect
answer, either a bad 'false' or a bad 'true'. The bad 'false'
means that we'll miss having a cleanup safepoint until the
next check. The bad 'true' means that we'll have an extra
cleanup safepoint when we don't need one.

Neither failure mode is catastrophic. However, this is sloppy
code and we should fetch LVars.population into a local so that
the same value is used in both calculations. That won't solve
all the possible races that could result in a missed cleanup
safepoint or an extra safepoint, but it is an improvement.

Made one call to Atomic::load(&LVars.population) at the top of
the function and used the local in all three places that we
called Atomic::load(&LVars.population) before.

> ---
>
> 1401     // We use the simpler lock-mid-as-we-go protocol to prevent 
> races
>
> "simpler" than what? Somewhere (in the code) we are missing a simple 
> description of the chained-locking/hand-over-hand-locking pattern that 
> is being used to allow safe traversal. The comment block at L2275 
> touches on it but I think this needs to be clearly explained in the 
> meta comments for the whole OM lifecycle process. Perhaps some list 
> management notes immediately after the comment block at L1233-L1237 ?

Of course, due to this comment, I took a closer look at
ObjectSynchronizer::om_release() and I figured out that I could greatly
simplify the locking protocol that's used in that function so we no
longer use the lock-mid-as-we-go protocol.

This simplification solves a bug in om_release() where this call:

     L1415:         } else {
     L1416:           // mid is locked. Switch cur_mid_in_use's next 
field to next
     L1417:           // which is safe because we have no parallel list 
deletions,
     L1418:           // but we leave mid locked:
     L1419:           set_next(cur_mid_in_use, next);
     L1420:         }

can accidentally unlock a cur_mid_in_use that was just locked by a list
walker (see lock_next_for_traversal()). When the list walker calls
om_unlock() we fail:

     L194:   guarantee(((intptr_t)next & OM_LOCK_BIT) == OM_LOCK_BIT, 
"next=" INTPTR_FORMAT
     L195:             " must have OM_LOCK_BIT=%x set.", p2i(next), 
OM_LOCK_BIT);

This bug doesn't exist in the 8153224 version of om_release() because
we had to use the lock-cur_mid_in_use-and-mid-as-we-go protocol due to
async deflation. So when I extracted and simplified the logic for this
bug fix (8235795), I introduced the bug (bad Dan!).

Of course, there is one other place where we used the simpler
lock-mid-as-we-go protocol so I also reexamined deflate_monitor_list().
It turns out that deflate_monitor_list() doesn't race with anything
else when accessing the in-use lists so I was able to drop all of the
locking calls from that code. It's not quite a reversion back to the
baseline version because of other changes, but it is close.

deflate_monitor_list() is only called at a safepoint so it doesn't race
with any non-safepoint audit_and_print_stats() calls. The VMThread handles
deflating the global in-use list by itself. The per-thread in-use lists
are either handled by the VMThread by itself or by worker thread(s) which
each handle a JavaThread's per-thread in-use list by itself. So there is
no overlay by multiple threads calling deflate_monitor_list() on the same
list. Lastly, the VMThread makes the call to audit_and_print_stats() after
all the deflate_monitor_list() calls have finished so no overlap there.

At this point, I don't know what I would/should write about list management
protocols after L1233-L1237. We no longer have the lock-mid-as-we-go 
protocol
as the minimum. Each place that does list walking now uses locking in a way
that's very specific to the function that's doing the list walking and what
it might be racing with.

So while I've rewritten the list management locking in both
ObjectSynchronizer::om_release() and deflate_monitor_list(), I haven't
really updated header comments to talk about list management locking
protocols.

> ---
>
> 1525         // s is locked so there must be a racing walker thread ahead
> 1526         // of us so we'll give it a chance to finish.
> 1527         while (is_locked(s)) {
> 1528           os::naked_short_sleep(1);
> 1529         }
>
> So this implies that any walking code is safepoint-check free (or more 
> generally non-blocking). Is that asserted where possible?

If I understand correctly, what you're worried about here is whether
a thread can lock an ObjectMonitor ('s' in this case) and then block
for a safepoint which would cause the thread running this code to loop
forever. This is similar to my worry that an algorithm error can lead
to a thread forgetting to unlock an ObjectMonitor.

I don't really see a way to assert() for this situation other than the
existing safepoint-takes-too-long checks done when SafepointTimeout
is enabled. So by default, the VM would appear to hang here...

Do you have a suggestion or two here?

> ---
>
> 1947   OrderAccess::storestore();           // flush inits for worker 
> threads
>
> A storestore barrier doesn't flush anything, it simple maintains an 
> ordering between stores astride the barrier. If you needed to ensure 
> visibility of these updates to the GC worker threads then this would 
> not suffice. But any communication with the GC worker threads (i.e. 
> telling them there is work to be done) requires explicit 
> synchronization anyway, so this barrier should not be needed as it 
> should be subsumed by that existing mechanism.

So this particular "storestore()" call was added at your suggestion
back on 2019.10.22:

On 10/22/19 3:40 AM, David Holmes wrote:
> And again:
>
> 2470 void 
> ObjectSynchronizer::prepare_deflate_idle_monitors(DeflateMonitorCounters* 
> counters) {
> 2471   OrderAccess::release_store(&counters->n_in_use, 0); // 
> currently associated with objects
> 2472 OrderAccess::release_store(&counters->n_in_circulation, 0); // 
> extant
> 2473   OrderAccess::release_store(&counters->n_scavenged, 0); // 
> reclaimed (global and per-thread)
> 2474 OrderAccess::release_store(&counters->per_thread_scavenged, 0); 
> // per-thread scavenge total
> 2475   counters->per_thread_times = 0.0; // per-thread scavenge times
> 2476 }
>
> A release_store is not needed for each write. If "counters" is to be 
> published somewhere for another thread to examine then it suffices to 
> publish it with a suitable memory barrier. If you wanted to 
> internalize it you could add a storeStore() barrier at the end. 

My reply to that comment was:

> I'll switch these counter inits to a simple store, e.g.:
>
>     L2471:   counters->n_in_use = 0;  // currently associated with 
> objects
>
> with a single storeStore() after them all or a comment stating which 
> memory
> sync we're relying on. It maybe safer for future safepoint work 
> juggling to
> put the storeStore() there rather than relying on a comment and hoping 
> for
> unchanging code... 

so I thought I was doing what you requested back on 2019.10.22...
I added a storestore() barrier as you said to do...

We want to make sure those fields are initialized before the
worker threads start doing their work and atomically incrementing
those fields.

ObjectSynchronizer::prepare_deflate_idle_monitors() is called by
SafepointSynchronize::do_cleanup_tasks() before it uses a group of:

     WorkGang* cleanup_workers = heap->get_safepoint_workers();

to get the work done:

   if (cleanup_workers != NULL) {
     // Parallel cleanup using GC provided thread pool.
     uint num_cleanup_workers = cleanup_workers->active_workers();
     ParallelSPCleanupTask cleanup(num_cleanup_workers, &deflate_counters);
     StrongRootsScope srs(num_cleanup_workers);
     cleanup_workers->run_task(&cleanup);
   } else {

I can't imagine there isn't a lock grab with a proper memory
barrier before we dispatch all the worker threads to do the work...
Of course, if there are no worker threads, then the VMThread
does it all and no memory barrier is needed.

So how should I change this line:

     L1947:   OrderAccess::storestore();           // flush inits for 
worker threads

> ---
>
> 2047     // The working list is local so no need for a memory barrier.
>
> Unclear what kind of barrier you thought you might need. Did you mean 
> "no need for locking/synchronization"?

This comment appears in three places:

     L1846:     // defined by free_head_p and free_tail_p. The working 
list is
     L1847:     // local so no need for a memory barrier.
     L1848:     if (*free_head_p == NULL) *free_head_p = mid;

     L1977:     // The working list is local so no need for a memory 
barrier.
     L1978:     guarantee(free_tail_p != NULL && deflated_count > 0, 
"invariant");

     L2047:     // The working list is local so no need for a memory 
barrier.
     L2048:     guarantee(free_tail_p != NULL && deflated_count > 0, 
"invariant");

I _think_ those comments were added when I had too many load_acquire() and
release_store() calls. So in these three cases, I'm saying that I don't
need memory barriers for load() or store() because these are local lists
that are not visible to other threads.

> ---
>
> 2189     // With fine grained locks on the monitor lists, it is 
> possible for
> 2190     // log_monitor_list_counts() to return a value that doesn't 
> match
> 2191     // LVars.population. So far a higher value has been seen in 
> testing
> 2192     // so something is being double counted by 
> log_monitor_list_counts().
>
> AFAICS this code is always executed at a safepoint so I don't 
> understand how the counts could be out-of-sync. Ditto for the comments 
> here:
>
> 2315     // With fine grained locks on LVars.free_list, it is possible 
> for an
> 2316     // ObjectMonitor to be prepended to LVars.free_list after we 
> started
> 2317     // calculating chk_om_free_count so LVars.free_count may not
> 2318     // match anymore.
>
> and here:
>
> 2346     // With fine grained locks on the monitor lists, it is 
> possible for
> 2347     // an exiting JavaThread to put its in-use ObjectMonitors on the
> 2348     // global in-use list after chk_om_in_use_count is calculated 
> above.
>
> ???

This header comment was intended to cover that rationale:

     L2152: // This function can be called at a safepoint or it can be 
called when
     L2153: // we are trying to exit the VM. When we are trying to exit 
the VM, the
     L2154: // list walker functions can run in parallel with the other list
     L2155: // operations so spin-locking is used for safety.
     L2156: //
     L2157: // Calls to this function can be added in various places as 
a debugging
     L2158: // aid; pass 'true' for the 'on_exit' parameter to have 
in-use monitor
     L2159: // details logged at the Info level and 'false' for the 
'on_exit'
     L2160: // parameter to have in-use monitor details logged at the 
Trace level.
     L2161: //
     L2162: void ObjectSynchronizer::audit_and_print_stats(bool on_exit) {

so no the audit_and_print_stats() function and the various chk_*()
functions can be called at a non-safepoint.

>
> ---
>
> That's it. :)

As always, thanks for the thorough review.

Dan

>
> Thanks,
> David
>
>
> On 24/12/2019 7:57 am, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I'm extracting another standalone fix from the Async Monitor Deflation
>> project (JDK-8153224) and sending it out for review (and testing)
>> separately.
>>
>>      JDK-8235795 replace monitor list 
>> mux{Acquire,Release}(&gListLock) with spin locks
>>      https://bugs.openjdk.java.net/browse/JDK-8235795
>>
>> Here's the webrev URL:
>>
>> http://cr.openjdk.java.net/~dcubed/8235795-webrev/0-for-jdk15/
>>
>> Folks that have reviewed JDK-8153224 will recognize these changes as
>> a subset of the monitor list changes from the Async Monitor Deflation
>> project. It's a subset because the Async Monitor Deflation project
>> needs additional spin locking due to the async deflation work.
>>
>> The OpenJDK wiki for Async Monitor Deflation has several sections
>> dedicated to the Spin-Lock Monitor List Management changes. This
>> link will get you to the first section:
>>
>> Spin-Lock Monitor List Management In Theory
>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation#AsyncMonitorDeflation-Spin-LockMonitorListManagementInTheory 
>>
>>
>> The remaining monitor list sections are:
>>
>> Background: ObjectMonitor Movement Between the Lists
>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation#AsyncMonitorDeflation-Background:ObjectMonitorMovementBetweentheLists 
>>
>>
>> Spin-Lock Monitor List Management In Reality
>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation#AsyncMonitorDeflation-Spin-LockMonitorListManagementInReality 
>>
>>
>> Using The New Spin-Lock Monitor List Functions
>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation#AsyncMonitorDeflation-UsingTheNewSpin-LockMonitorListFunctions 
>>
>>
>> Of course, the OpenJDK wiki content is specific to the Async Monitor
>> Deflation project, but this extract is a very close subset.
>>
>> These changes have been tested in various Mach5 Tier[1-7] runs.
>> I'm also doing SPECjbb2015 runs.
>>
>> Thanks, in advance, for comments, questions or suggestions.
>>
>> Dan