RFR(L): 8235795: replace monitor list mux{Acquire,Release}(&gListLock) with spin locks

Fri Jan 24 16:48:09 UTC 2020

Hi David,

Replies embedded below...

On 1/23/20 9:09 PM, David Holmes wrote:
> Hi Dan,
>
> A few responses inline, with trimming ...
>
> On 24/01/2020 4:31 am, Daniel D. Daugherty wrote:
>> On 1/6/20 1:37 AM, David Holmes wrote:
>>> ---
>>>
>>> 1401     // We use the simpler lock-mid-as-we-go protocol to prevent 
>>> races
>>>
>>> "simpler" than what? Somewhere (in the code) we are missing a simple 
>>> description of the chained-locking/hand-over-hand-locking pattern 
>>> that is being used to allow safe traversal. The comment block at 
>>> L2275 touches on it but I think this needs to be clearly explained 
>>> in the meta comments for the whole OM lifecycle process. Perhaps 
>>> some list management notes immediately after the comment block at 
>>> L1233-L1237 ?
>>
>> Of course, due to this comment, I took a closer look at
>> ObjectSynchronizer::om_release() and I figured out that I could greatly
>> simplify the locking protocol that's used in that function so we no
>> longer use the lock-mid-as-we-go protocol.
>>
>> This simplification solves a bug in om_release() where this call:
>>
>>      L1415:         } else {
>>      L1416:           // mid is locked. Switch cur_mid_in_use's next 
>> field to next
>>      L1417:           // which is safe because we have no parallel 
>> list deletions,
>>      L1418:           // but we leave mid locked:
>>      L1419:           set_next(cur_mid_in_use, next);
>>      L1420:         }
>>
>> can accidentally unlock a cur_mid_in_use that was just locked by a list
>> walker (see lock_next_for_traversal()). When the list walker calls
>> om_unlock() we fail:
>>
>>      L194:   guarantee(((intptr_t)next & OM_LOCK_BIT) == OM_LOCK_BIT, 
>> "next=" INTPTR_FORMAT
>>      L195:             " must have OM_LOCK_BIT=%x set.", p2i(next), 
>> OM_LOCK_BIT);
>>
>> This bug doesn't exist in the 8153224 version of om_release() because
>> we had to use the lock-cur_mid_in_use-and-mid-as-we-go protocol due to
>> async deflation. So when I extracted and simplified the logic for this
>> bug fix (8235795), I introduced the bug (bad Dan!).
>>
>>
>> Of course, there is one other place where we used the simpler
>> lock-mid-as-we-go protocol so I also reexamined deflate_monitor_list().
>> It turns out that deflate_monitor_list() doesn't race with anything
>> else when accessing the in-use lists so I was able to drop all of the
>> locking calls from that code. It's not quite a reversion back to the
>> baseline version because of other changes, but it is close.
>>
>> deflate_monitor_list() is only called at a safepoint so it doesn't race
>> with any non-safepoint audit_and_print_stats() calls. The VMThread 
>> handles
>> deflating the global in-use list by itself. The per-thread in-use lists
>> are either handled by the VMThread by itself or by worker thread(s) 
>> which
>> each handle a JavaThread's per-thread in-use list by itself. So there is
>> no overlay by multiple threads calling deflate_monitor_list() on the 
>> same
>> list. Lastly, the VMThread makes the call to audit_and_print_stats() 
>> after
>> all the deflate_monitor_list() calls have finished so no overlap there.
>>
>> At this point, I don't know what I would/should write about list 
>> management
>> protocols after L1233-L1237. We no longer have the lock-mid-as-we-go 
>> protocol
>> as the minimum. Each place that does list walking now uses locking in 
>> a way
>> that's very specific to the function that's doing the list walking 
>> and what
>> it might be racing with.
>>
>> So while I've rewritten the list management locking in both
>> ObjectSynchronizer::om_release() and deflate_monitor_list(), I haven't
>> really updated header comments to talk about list management locking
>> protocols.
>
> Okay, I'll need to reconsider that when I see the updated code.

Stress testing the updated code now. Unfortunately I had an hour long
power outage yesterday that killed the stress runs that I had going... :-(

>
>>> ---
>>>
>>> 1525         // s is locked so there must be a racing walker thread 
>>> ahead
>>> 1526         // of us so we'll give it a chance to finish.
>>> 1527         while (is_locked(s)) {
>>> 1528           os::naked_short_sleep(1);
>>> 1529         }
>>>
>>> So this implies that any walking code is safepoint-check free (or 
>>> more generally non-blocking). Is that asserted where possible?
>>
>> If I understand correctly, what you're worried about here is whether
>> a thread can lock an ObjectMonitor ('s' in this case) and then block
>> for a safepoint which would cause the thread running this code to loop
>> forever. This is similar to my worry that an algorithm error can lead
>> to a thread forgetting to unlock an ObjectMonitor.
>>
>> I don't really see a way to assert() for this situation other than the
>> existing safepoint-takes-too-long checks done when SafepointTimeout
>> is enabled. So by default, the VM would appear to hang here...
>>
>> Do you have a suggestion or two here?
>
> NoSafepointVerifier in the walking code paths?

I'll take a look at adding those...

>
>>
>>> ---
>>>
>>> 1947   OrderAccess::storestore();           // flush inits for 
>>> worker threads
>>>
>>> A storestore barrier doesn't flush anything, it simple maintains an 
>>> ordering between stores astride the barrier. If you needed to ensure 
>>> visibility of these updates to the GC worker threads then this would 
>>> not suffice. But any communication with the GC worker threads (i.e. 
>>> telling them there is work to be done) requires explicit 
>>> synchronization anyway, so this barrier should not be needed as it 
>>> should be subsumed by that existing mechanism.
>>
>> So this particular "storestore()" call was added at your suggestion
>> back on 2019.10.22:
>>
>> On 10/22/19 3:40 AM, David Holmes wrote:
>>> And again:
>>>
>>> 2470 void 
>>> ObjectSynchronizer::prepare_deflate_idle_monitors(DeflateMonitorCounters* 
>>> counters) {
>>> 2471   OrderAccess::release_store(&counters->n_in_use, 0); // 
>>> currently associated with objects
>>> 2472 OrderAccess::release_store(&counters->n_in_circulation, 0); // 
>>> extant
>>> 2473 OrderAccess::release_store(&counters->n_scavenged, 0); // 
>>> reclaimed (global and per-thread)
>>> 2474 OrderAccess::release_store(&counters->per_thread_scavenged, 0); 
>>> // per-thread scavenge total
>>> 2475   counters->per_thread_times = 0.0; // per-thread scavenge times
>>> 2476 }
>>>
>>> A release_store is not needed for each write. If "counters" is to be 
>>> published somewhere for another thread to examine then it suffices 
>>> to publish it with a suitable memory barrier. If you wanted to 
>>> internalize it you could add a storeStore() barrier at the end. 
>>
>> My reply to that comment was:
>>
>>> I'll switch these counter inits to a simple store, e.g.:
>>>
>>>     L2471:   counters->n_in_use = 0;  // currently associated with 
>>> objects
>>>
>>> with a single storeStore() after them all or a comment stating which 
>>> memory
>>> sync we're relying on. It maybe safer for future safepoint work 
>>> juggling to
>>> put the storeStore() there rather than relying on a comment and 
>>> hoping for
>>> unchanging code... 
>>
>> so I thought I was doing what you requested back on 2019.10.22...
>> I added a storestore() barrier as you said to do...
>
> Yep that is all well and good. But now I see the comment it raises 
> questions about the broader context of how this code interacts with GC 
> worker threads. It seems to me the storestore is either insufficient 
> (because it only affects ordering not visibility), or else redundant 
> (because the sync with the worker threads already handles everything).
>
>> We want to make sure those fields are initialized before the
>> worker threads start doing their work and atomically incrementing
>> those fields.
>>
>> ObjectSynchronizer::prepare_deflate_idle_monitors() is called by
>> SafepointSynchronize::do_cleanup_tasks() before it uses a group of:
>>
>>      WorkGang* cleanup_workers = heap->get_safepoint_workers();
>>
>> to get the work done:
>>
>>    if (cleanup_workers != NULL) {
>>      // Parallel cleanup using GC provided thread pool.
>>      uint num_cleanup_workers = cleanup_workers->active_workers();
>>      ParallelSPCleanupTask cleanup(num_cleanup_workers, 
>> &deflate_counters);
>>      StrongRootsScope srs(num_cleanup_workers);
>>      cleanup_workers->run_task(&cleanup);
>>    } else {
>>
>> I can't imagine there isn't a lock grab with a proper memory
>> barrier before we dispatch all the worker threads to do the work...
>
> The main synchronization mechanism appears to be Semaphores and the 
> implementation of that does ensure memory synchronization - ref e.g.:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 
>
>
>> Of course, if there are no worker threads, then the VMThread
>> does it all and no memory barrier is needed.
>
> Right so ...
>
>> So how should I change this line:
>>
>>      L1947:   OrderAccess::storestore();           // flush inits for 
>> worker threads
>
> ... it seems this is actually redundant in this context and can be 
> removed. Perhaps with a clarifying command in relation to 
> synchronising with the worker threads or VMThread.

I'm going to delete this line. We did nothing special and said nothing in
prepare_deflate_idle_monitors() before the Async Monitor Deflation project
(or this fix) so I'm just going back to saying nothing here.

Also, after Async Monitor Deflation bakes for a while (assuming it gets
pushed), the safepoint based deflation mechanism will get removed which
means this code will eventually go away.

>
>>
>>> ---
>>>
>>> 2047     // The working list is local so no need for a memory barrier.
>>>
>>> Unclear what kind of barrier you thought you might need. Did you 
>>> mean "no need for locking/synchronization"?
>>
>> This comment appears in three places:
>>
>>      L1846:     // defined by free_head_p and free_tail_p. The 
>> working list is
>>      L1847:     // local so no need for a memory barrier.
>>      L1848:     if (*free_head_p == NULL) *free_head_p = mid;
>>
>>      L1977:     // The working list is local so no need for a memory 
>> barrier.
>>      L1978:     guarantee(free_tail_p != NULL && deflated_count > 0, 
>> "invariant");
>>
>>      L2047:     // The working list is local so no need for a memory 
>> barrier.
>>      L2048:     guarantee(free_tail_p != NULL && deflated_count > 0, 
>> "invariant");
>>
>> I _think_ those comments were added when I had too many 
>> load_acquire() and
>> release_store() calls. So in these three cases, I'm saying that I don't
>> need memory barriers for load() or store() because these are local lists
>> that are not visible to other threads.
>
> While I can see the comments aided you when figuring out the barrier 
> placement, in general I think these ones will confuse future readers 
> of the code.

I'll delete the comments.

>
>>
>>> ---
>>>
>>> 2189     // With fine grained locks on the monitor lists, it is 
>>> possible for
>>> 2190     // log_monitor_list_counts() to return a value that doesn't 
>>> match
>>> 2191     // LVars.population. So far a higher value has been seen in 
>>> testing
>>> 2192     // so something is being double counted by 
>>> log_monitor_list_counts().
>>>
>>> AFAICS this code is always executed at a safepoint so I don't 
>>> understand how the counts could be out-of-sync. Ditto for the 
>>> comments here:
>>>
>>> 2315     // With fine grained locks on LVars.free_list, it is 
>>> possible for an
>>> 2316     // ObjectMonitor to be prepended to LVars.free_list after 
>>> we started
>>> 2317     // calculating chk_om_free_count so LVars.free_count may not
>>> 2318     // match anymore.
>>>
>>> and here:
>>>
>>> 2346     // With fine grained locks on the monitor lists, it is 
>>> possible for
>>> 2347     // an exiting JavaThread to put its in-use ObjectMonitors 
>>> on the
>>> 2348     // global in-use list after chk_om_in_use_count is 
>>> calculated above.
>>>
>>> ???
>>
>> This header comment was intended to cover that rationale:
>>
>>      L2152: // This function can be called at a safepoint or it can 
>> be called when
>>      L2153: // we are trying to exit the VM. When we are trying to 
>> exit the VM, the
>>      L2154: // list walker functions can run in parallel with the 
>> other list
>>      L2155: // operations so spin-locking is used for safety.
>>      L2156: //
>>      L2157: // Calls to this function can be added in various places 
>> as a debugging
>>      L2158: // aid; pass 'true' for the 'on_exit' parameter to have 
>> in-use monitor
>>      L2159: // details logged at the Info level and 'false' for the 
>> 'on_exit'
>>      L2160: // parameter to have in-use monitor details logged at the 
>> Trace level.
>>      L2161: //
>>      L2162: void ObjectSynchronizer::audit_and_print_stats(bool 
>> on_exit) {
>>
>> so no the audit_and_print_stats() function and the various chk_*()
>> functions can be called at a non-safepoint.
>
> When we are exiting the VM this is called from exit_globals, which 
> does execute at a safepoint - either:
>
> - from VM_Exit::do_it() (a safepoint VM op); or
> - from Threads::destroy_vm() after the VMThread has been terminated, 
> which brings the VM to a safepoint first.

Hmmm... I have this assert here:

2162 void ObjectSynchronizer::audit_and_print_stats(bool on_exit) {
2163   assert(on_exit || SafepointSynchronize::is_at_safepoint(), 
"invariant");

and I remember adding the "on_exit" conditional so the assert wouldn't
fire. I'll investigate more...

Also, audit_and_print_stats() calls can be added as a debugging
aide when chasing ObjectMonitor list issues. The way that it is
written, it doesn't require a safepoint (when on_exit == true).

I'll make the minor edits mentioned above before I send out the
webrev for the next round of review.

Thanks again for being such a thorough reviewer.

Dan

>
> Cheers,
> David
> -----
>
>>
>>>
>>> ---
>>>
>>> That's it. :)
>>
>> As always, thanks for the thorough review.
>>
>> Dan