RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)

Wed Oct 30 15:23:50 UTC 2019

This is just a checklist email to make sure that I addressed
everything in this particular review email...

On 10/22/19 4:46 PM, Daniel D. Daugherty wrote:
> Hi David,
>
> Thanks for chiming in on this review thread!  More below...
>
>
> On 10/22/19 3:40 AM, David Holmes wrote:
>> Hi Dan,
>>
>> On 22/10/2019 1:31 am, Daniel D. Daugherty wrote:
>>> Ping! Still looking for reviewers on this one. Thanks!
>>
>> I refreshed myself on the wiki then started looking at the code.
>
> Comments on the wiki are also appreciated, either here or as comments
> on the wiki itself. Which ever is easiest for you...
>
>
>> I started with the easy files (a couple of minor comments below) then 
>> something harder. :) 
>
> Good choice!
>
>
>> I still have the objectMonitor* files themselves to go through, which 
>> I will do tomorrow.
>
> I will keep my eyes open for more replies.
>
>
>>
>> Cheers,
>> David
>> ------
>>
>> src/hotspot/share/runtime/synchronizer.cpp
>>
>> My main comment on all the list management code pertains to the 
>> difficulty in seeing which release_store is paired with which 
>> load_acquire, and why. The whole point of the acquire/release 
>> protocol is that when you load_acquire a given field you will see all 
>> memory updates that preceded the release_store of the field, so that 
>> you can use those other field values knowing they are not out-of-date 
>> with respect to the field you did the load_acquire on. It is 
>> exceedingly difficult to see where these other fields are actually 
>> being read - if indeed they are.
>
> Agreed. I suspect that I have been overzealous with my use of 
> release_store()
> and load_acquire() since v.2.06 (where lock free monitor lists came 
> into the
> picture). I have a private query into Robbin and Erik O. asking for 
> feedback
> on my use of cmpxchg(), release_store() and load_acquire(). If I have 
> too much
> or too little (not likely), then I need to fix that.
>
> More below...
>
>
>> In many cases acquire/release is simply not needed. For example:
>>
>> 1245     int monitors_used = 
>> OrderAccess::load_acquire(&g_om_population) -
>> 1246 OrderAccess::load_acquire(&g_om_free_count);
>>
>> these are simple volatile counters. Their value does not relate to 
>> the updates of any other fields and we are not reading those other 
>> fields having done the load_acquire. AFAICS you use load_acquire to 
>> read all the counters (global and per-thread) everywhere but there is 
>> never a release_store associated with any of them, as they are all 
>> updated via Atomic operations. So the load_acquires are semantically 
>> meaningless, but may have an actual cost (both performance-wise and 
>> in terms of the reader trying to understand the code).
>
> This is exactly the kind of feedback I've been looking for since v2.06 
> hit
> the OpenJDK alias! Thanks!
>
> To fill in the history a bit... when I switched to lock free monitor 
> lists
> I started to see failures in the ObjectMonitor auditing code because the
> counters were out of sync with each other. I went through and switched 
> all
> the counter fetches to load_acquire() with the idea that these would 
> match
> up with the Atomic increments and decrements. It seemed to help. However,
> with subsequent intermittent failures I figured out that there are just
> naturally occurring races when the auditing code runs outside a safepoint
> because of the lock free nature of the monitor lists.
>
> Bad idea on the counter load_acquire() calls! I knew I should have pinged
> you on this before.
>
> Okay. So all the counter fetches can go back to being regular loads, e.g:
>
>     L1245: int monitors_used = g_om_population - g_om_free_count;
>
> Well, that will certainly help reduce the noise level in the code. 
> Thanks!

Done.

>
>
>> In a similar vein but from the other side these release_store 
>> operations in om_alloc are unnecessary:
>>
>> 1518     for (int i = 1; i < _BLOCKSIZE; i++) {
>> 1519       OrderAccess::release_store(&temp[i]._next_om, 
>> (ObjectMonitor*)&temp[i+1]);
>> 1520       assert(temp[i].is_free(), "invariant");
>> 1521     }
>> 1522
>> 1523     // terminate the last monitor as the end of list
>> 1524     OrderAccess::release_store(&temp[_BLOCKSIZE - 1]._next_om, 
>> (ObjectMonitor*)NULL);
>>
>> "temp" is completely thread-local at this stage. As long as temp 
>> itself is "published" with an appropriate memory barrier (which it 
>> is: prepend_block_to_list stores it with cmpxchg), and then read with 
>> an appropriate memory barrier (again it is) then all the updates you 
>> did within the "temp" block of memory are guaranteed to be visible.
>
> Agreed!
>
> The history here is that I was going through and making sure all the 
> stores
> to the _next_om field were "safe". Again, a little too over zealous. 
> Thanks
> for catching this.
>
> I'll change these to regular stores, e.g.:
>
>     L1519: temp[i]._next_om = (ObjectMonitor*)&temp[i + 1];

Done.

>
>> And again:
>>
>> 2470 void 
>> ObjectSynchronizer::prepare_deflate_idle_monitors(DeflateMonitorCounters* 
>> counters) {
>> 2471   OrderAccess::release_store(&counters->n_in_use, 0); // 
>> currently associated with objects
>> 2472 OrderAccess::release_store(&counters->n_in_circulation, 0); // 
>> extant
>> 2473   OrderAccess::release_store(&counters->n_scavenged, 0); // 
>> reclaimed (global and per-thread)
>> 2474 OrderAccess::release_store(&counters->per_thread_scavenged, 0); 
>> // per-thread scavenge total
>> 2475   counters->per_thread_times = 0.0; // per-thread scavenge times
>> 2476 }
>>
>> A release_store is not needed for each write. If "counters" is to be 
>> published somewhere for another thread to examine then it suffices to 
>> publish it with a suitable memory barrier. If you wanted to 
>> internalize it you could add a storeStore() barrier at the end.
>
> I'll switch these counter inits to a simple store, e.g.:
>
>     L2471:   counters->n_in_use = 0;  // currently associated with 
> objects
>
> with a single storeStore() after them all or a comment stating which 
> memory
> sync we're relying on. It maybe safer for future safepoint work 
> juggling to
> put the storeStore() there rather than relying on a comment and hoping 
> for
> unchanging code... :-)

Done.

>
>
>> These unnecessary load_acquire/store_release usages will hurt 
>> performance on weakly ordered machines. And semantically they just 
>> confuse the reader who is trying to determine how and why these loads 
>> and stores are paired.
>
> Agreed.
>
>
> Okay, time to return to my uses of release_store() and load_acquire().
> Consider this bit of code:
>
> src/hotspot/share/runtime/synchronizer.cpp:
>
>     L351: // Take an ObjectMonitor from the start of the specified 
> list. Also
>     L352: // decrements the specified counter. Returns NULL if none 
> are available.
>     L353: static ObjectMonitor* 
> take_from_start_of_common(ObjectMonitor* volatile * list_p,
>     L354:                                                 int volatile 
> * count_p) {
>     L355:   ObjectMonitor* next = NULL;
>     L356:   ObjectMonitor* take = NULL;
>     L357:   // Mark the list head to guard against A-B-A race:
>     L358:   if (!mark_list_head(list_p, &take, &next)) {
>     L359:     return NULL;  // None are available.
>     L360:   }
>     L361:   // Switch marked list head to next (which unmarks the list 
> head, but
>     L362:   // leaves take marked):
>     L363:   OrderAccess::release_store(list_p, next);
>     L364:   Atomic::dec(count_p);
>     L365:   // Unmark take, but leave the next value for any lagging list
>     L366:   // walkers. It will get cleaned up when take is prepended to
>     L367:   // the in-use list:
>     L368:   set_next(take, next);
>     L369:   return take;
>     L370: }
>
>
> On L358, we call mark_list_head() and if that call returns true, then
> the calling thread is the "owner" of the list head. My idea is that
> once my thread "owns" the list head, I can use release_store() as a
> smaller hammer than cmpxchg() to sync out the new value ('next') as
> the new list head.
>
> Should I be doing something other than release_store() here?
>
> My thinking is that release_store() pairs up with the load_acquire()
> of any reader thread that's trying to simply walk the list. And that
> release_store() also pairs up with any _other_ writer thread that's
> trying to mark the list head using:
>
>     mark_list_head() -> mark_next() -> cmpxchg()
>
> Here's mark_list_head():
>
>     L194: // Mark the next field in the list head ObjectMonitor. If 
> marking was
>     L195: // successful, then the mid and the unmarked next field are 
> returned
>     L196: // via parameter and true is returned. Otherwise false is 
> returned.
>     L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
>     L198:                            ObjectMonitor** mid_p, 
> ObjectMonitor** next_p) {
>     L199:   while (true) {
>     L200:     ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
>     L201:     if (mid == NULL) {
>     L202:       return false;  // The list is empty so nothing to mark.
>     L203:     }
>     L204:     if (mark_next(mid, next_p)) {
>     L205:       if (OrderAccess::load_acquire(list_p) != mid) {
>     L206:         // The list head changed so we have to retry.
>     L207:         set_next(mid, *next_p);  // unmark mid
>     L208:         continue;
>     L209:       }
>     L210:       // We marked next field to guard against races.
>     L211:       *mid_p = mid;
>     L212:       return true;
>     L213:     }
>     L214:   }
>     L215: }
>
> which does a matching load_acquire() to get the current list head
> and calls mark_next() to try and mark it. It then calls load_acquire()
> again to verify that the list head hasn't changed while doing the
> mark_next().
>
> So here's mark_next():
>
>  161 // Mark the next field in an ObjectMonitor. If marking was 
> successful,
>  162 // then the unmarked next field is returned via parameter and 
> true is
>  163 // returned. Otherwise false is returned.
>  164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
>  165   // Get current next field without any marking value.
>  166   ObjectMonitor* next = (ObjectMonitor*)
>  167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
>  168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) != 
> next) {
>  169     return false;  // Could not mark the next field or it was 
> already marked.
>  170   }
>  171   *next_p = next;
>  172   return true;
>  173 }
>
> We use load_acquire() on L166-7 to make sure that we have the latest
> value in the next_om field. That load_acquire() matches up with the
> release_store() done by another thread to unmark the next field (see
> the set_next() function). Or it matches with the cmpxchg() done by
> another to mark the next field. Yes, we could do a regular load when
> we know that the field is only changed by cmpxchg(), but it could
> have been changed by release_store() and my thinking is that we need
> a matching load_acquire().
>
> We have to use cmpxchg() here to try and set the mark on the next field.
> If successful, then we return the value of the unmarked next field via
> the 'next_p' param and we return true. However, just because we marked
> the next field in the ObjectMonitor doesn't mean that we marked the list
> head. That's what the second load_acquire() on L205 above verifies.
>
>
> Okay, that was pretty gory on the details of my thinking! I think it's
> pretty clear that I'm doing load_acquire()/release_store() pairs on
> the list head pointers or the next_om fields to make sure that I match
> loads and stores of _just_ the list head pointer or the next_om field
> in question. I'm not trying affect other fields in the ObjectMonitor
> or at least I don't think I have to do that.
>
> Maybe I've misunderstood the load_acquire()/release_store() stuff (again)
> and this code is doing way too much! I would be happy to have my thinking
> about this load_acquire()/release_store() stuff corrected and the code
> simplified. Maybe in this sequence:
>
>     mark_list_head() -> mark_next() -> cmpxchg()
>
> we don't need all the load_acquire() calls because the sequence only ends
> successfully with a cmpxchg() and that makes memory happy everywhere with
> just simple loads. Dunno. I definitely need more feedback here!
>
>
>
>>
>> ---
>>
>> A couple of minor comments/queries on other files ...
>>
>> ---
>>
>> src/hotspot/share/prims/jvm.cpp
>>
>> I'm not clear what the relationship between async deflation and 
>> System.gc is. In this logic:
>>
>> +     if (AsyncDeflateIdleMonitors) {
>> +       // AsyncDeflateIdleMonitors needs to know when System.gc() is
>> +       // called so any special deflation can be done at a safepoint.
>> + ObjectSynchronizer::set_is_special_deflation_requested(true);
>> +     }
>> Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>
>> if we are using a GC that treats System.gc() as a no-op, does the 
>> fact we called set_is_special_deflation_requested(true), a problem?
>
> I don't think we have a problem here. Let me take a step back and
> explain the rationale here.
>
> We have some tests that assume that GC will eventually cleanup any
> unreferenced object. ObjectMonitor is a root so an object with an
> associated ObjectMonitor won't get GC'ed until the ObjectMonitor
> gets deflated.
>
>     Object foo;
>     synchronized (foo) {
>         foo.wait(1);
>     }
>     // foo now has an associated ObjectMonitor
>     foo = null;
>     // no local ref so 'foo' should be GC'able (in theory)
>     <do stuff to use up all of memory>
>     <do System.gc() calls>
>     <verify that foo has been GC'ed somehow>
>
> Note: There are also one or two tests that assume that using up all of
> memory is enough to cause 'foo' to be GC'ed (no System.gc() calls 
> required).
>
> When JavaThreads were responsible for deflating their own ObjectMonitors,
> we had to have System.gc() request a special deflation so that the
> ObjectMonitor associated with 'foo' is deflated so that 'foo' could be
> GC'ed.
>
> When we moved async deflation over the ServiceThread, I kept this code
> around because there's no guarantee when the ServiceThread will get
> around to deflating the ObjectMonitor associated with 'foo'. I didn't
> want to deal with intermittent test failures.
>
> Okay, back to the question... If we have a GC that treats System.gc() as
> a no-op, it's likely going to fail the test anyway because 'foo' will 
> never
> get GC'ed. The special cleanup request shouldn't break anything since all
> that does is cause safepoint based deflation to happen during a safepoint
> cleanup phase (like what happens in the baseline today).
>
>
>>
>> ---
>>
>> src/hotspot/share/runtime/serviceThread.cpp
>>
>> I have some concerns that the serviceThread must now wakeup 
>> periodically instead of only on demand. It has a lot of tasks to 
>> check now and it is not obvious how complex or trivial those checks 
>> are. We don't currently gather any statistics in this area so we have 
>> no idea how long the serviceThread typically waits between requests. 
>> The fact the serviceThread is woken on demand means that the desire 
>> to be "checked at the same interval" seems exceedingly unlikely - the 
>> serviceThread would have to waiting for extended periods of time for 
>> the timed-wait to have any real affect on the frequency of the async 
>> monitor deflation.
>
> Periodically waking up the ServiceThread is done to match the safepoint
> cleanup period of GuaranteedSafepointInterval when doing safepoint based
> deflation. Without waking up periodically, we could go a long time before
> doing any deflation via the ServiceThread and that would definitely be
> an observable change in behavior relative to safepoint based deflation.
> In some of my test runs, I had seen us go for 8-27 seconds without doing
> any async monitor deflation. Of course, that meant that there was more to
> cleanup when we did do it.
>
> I've been thinking about have my own AsyncDeflateIdleMonitorsThread 
> and move
> all the async deflation code from ServiceThread there. So many things are
> being added to the ServiceThread that I do wonder whether it is 
> overloaded,
> but (as you said) there's no way to know that (yet).
>
> BTW, I think GuaranteedSafepointInterval might be "broken" at this point
> in that we don't guarantee a safepoint at that interval anymore. This has
> kind of slipped through the cracks...
>
>
>> ---
>>
>> src/hotspot/share/runtime/thread.cpp
>>
>> No changes to this file.
>>
>> ---
>>
>> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>>
>> No changes to this file.
>
> Correct. I'm unsure how to deal with this type of thing. This project
> is managed as a series of patches. Earlier patches changed those files,
> but a later patch undid those changes. So depending on where you are
> in the patch queue, the logical 'files.list' contents are different.
> I'm currently using a 'files.list' that reflects every file touched by
> the current set of patches and that's why those two files are included
> with no changes.
>
> Suggestions?
>
> Thanks for the partial review! I'm looking forward to resolving the
> queries about and your next set of review comments.
>
> Dan

Okay. I think I've made all the changes from David H's part 1 of 3
code review email.

Please let me know if I'm missed anything.

Dan

>
>
>>
>> ---
>>
>>
>>
>>> Dan
>>>
>>>
>>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> The Async Monitor Deflation project is reaching the end game. I 
>>>> have no
>>>> changes planned for the project at this time so all that is left is 
>>>> code
>>>> review and any changes that results from those reviews.
>>>>
>>>> Carsten and Roman! Time for you guys to chime in again on the code 
>>>> reviews.
>>>>
>>>> I have attached the list of fixes from CR6 to CR7 instead of 
>>>> putting it
>>>> in the main body of this email.
>>>>
>>>> Main bug URL:
>>>>
>>>>     JDK-8153224 Monitor deflation prolong safepoints
>>>>     https://bugs.openjdk.java.net/browse/JDK-8153224
>>>>
>>>> The project is currently baselined on jdk-14+19.
>>>>
>>>> Here's the full webrev URL for those folks that want to see all of the
>>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full 
>>>>
>>>>
>>>> Some folks might want to see just what has changed since the last 
>>>> review
>>>> cycle so here's a webrev for that (v2.07 inc):
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/ 
>>>>
>>>>
>>>> The OpenJDK wiki has been updated to match the 
>>>> CR7/v2.07/10-for-jdk14 changes:
>>>>
>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>>
>>>> The jdk-14+18 based v2.07 version of the patch has been thru Mach5 
>>>> tier[1-8]
>>>> testing on Oracle's usual set of platforms. It has also been 
>>>> through my usual
>>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with the 
>>>> addition
>>>> of Robbin's "MoCrazy 1024" test running in parallel with the other 
>>>> tests in
>>>> my lab.
>>>>
>>>> The jdk-14+19 based v2.07 version of the patch has been thru Mach5 
>>>> tier[1-3]
>>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in 
>>>> process.
>>>>
>>>> I did another round of SPECjbb2015 testing in Oracle's Aurora 
>>>> Performance lab
>>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>>
>>>>     - "base" is jdk-14+18
>>>>     - "v2.07" is the latest version and includes C2 
>>>> inc_om_ref_count() support
>>>>       on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors option
>>>>     - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>>     - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors 
>>>> specified
>>>>
>>>>          hbIR           hbIR
>>>>     (max attempted)  (settled)  max-jOPS  critical-jOPS runtime
>>>>     ---------------  ---------  --------  ------------- -------
>>>>            34282.00   30635.90  28831.30       20969.20 3841.30 base
>>>>            34282.00   30973.00  29345.80       21025.20 3964.10 v2.07
>>>>            34282.00   31105.60  29174.30       21074.00 3931.30 
>>>> v2.07_handshake
>>>>            34282.00   30789.70  27151.60       19839.10 3850.20 
>>>> v2.07_off
>>>>
>>>>     - The Aurora Perf comparison tool reports:
>>>>
>>>>         Comparison              max-jOPS critical-jOPS
>>>>         ----------------------  -------------------- 
>>>> --------------------
>>>>         base vs 2.07            +1.78% (s, p=0.000)   +0.27% (ns, 
>>>> p=0.790)
>>>>         base vs 2.07_handshake  +1.19% (s, p=0.007)   +0.58% (ns, 
>>>> p=0.536)
>>>>         base vs 2.07_off        -5.83% (ns, p=0.394)  -5.39% (ns, 
>>>> p=0.347)
>>>>
>>>>         (s) - significant  (ns) - not-significant
>>>>
>>>>     - For historical comparison, the Aurora Perf comparision tool
>>>>         reported for v2.06 with a baseline of jdk-13+31:
>>>>
>>>>         Comparison              max-jOPS critical-jOPS
>>>>         ----------------------  -------------------- 
>>>> --------------------
>>>>         base vs 2.06            -0.32% (ns, p=0.345)  +0.71% (ns, 
>>>> p=0.646)
>>>>         base vs 2.06_off        +0.49% (ns, p=0.292)  -1.21% (ns, 
>>>> p=0.481)
>>>>
>>>>         (s) - significant  (ns) - not-significant
>>>>
>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>
>>>> Dan
>
> <trimmed older review invites>
>
>