RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Wed Oct 30 15:23:50 UTC 2019
This is just a checklist email to make sure that I addressed
everything in this particular review email...
On 10/22/19 4:46 PM, Daniel D. Daugherty wrote:
> Hi David,
>
> Thanks for chiming in on this review thread! More below...
>
>
> On 10/22/19 3:40 AM, David Holmes wrote:
>> Hi Dan,
>>
>> On 22/10/2019 1:31 am, Daniel D. Daugherty wrote:
>>> Ping! Still looking for reviewers on this one. Thanks!
>>
>> I refreshed myself on the wiki then started looking at the code.
>
> Comments on the wiki are also appreciated, either here or as comments
> on the wiki itself. Which ever is easiest for you...
>
>
>> I started with the easy files (a couple of minor comments below) then
>> something harder. :)
>
> Good choice!
>
>
>> I still have the objectMonitor* files themselves to go through, which
>> I will do tomorrow.
>
> I will keep my eyes open for more replies.
>
>
>>
>> Cheers,
>> David
>> ------
>>
>> src/hotspot/share/runtime/synchronizer.cpp
>>
>> My main comment on all the list management code pertains to the
>> difficulty in seeing which release_store is paired with which
>> load_acquire, and why. The whole point of the acquire/release
>> protocol is that when you load_acquire a given field you will see all
>> memory updates that preceded the release_store of the field, so that
>> you can use those other field values knowing they are not out-of-date
>> with respect to the field you did the load_acquire on. It is
>> exceedingly difficult to see where these other fields are actually
>> being read - if indeed they are.
>
> Agreed. I suspect that I have been overzealous with my use of
> release_store()
> and load_acquire() since v.2.06 (where lock free monitor lists came
> into the
> picture). I have a private query into Robbin and Erik O. asking for
> feedback
> on my use of cmpxchg(), release_store() and load_acquire(). If I have
> too much
> or too little (not likely), then I need to fix that.
>
> More below...
>
>
>> In many cases acquire/release is simply not needed. For example:
>>
>> 1245 int monitors_used =
>> OrderAccess::load_acquire(&g_om_population) -
>> 1246 OrderAccess::load_acquire(&g_om_free_count);
>>
>> these are simple volatile counters. Their value does not relate to
>> the updates of any other fields and we are not reading those other
>> fields having done the load_acquire. AFAICS you use load_acquire to
>> read all the counters (global and per-thread) everywhere but there is
>> never a release_store associated with any of them, as they are all
>> updated via Atomic operations. So the load_acquires are semantically
>> meaningless, but may have an actual cost (both performance-wise and
>> in terms of the reader trying to understand the code).
>
> This is exactly the kind of feedback I've been looking for since v2.06
> hit
> the OpenJDK alias! Thanks!
>
> To fill in the history a bit... when I switched to lock free monitor
> lists
> I started to see failures in the ObjectMonitor auditing code because the
> counters were out of sync with each other. I went through and switched
> all
> the counter fetches to load_acquire() with the idea that these would
> match
> up with the Atomic increments and decrements. It seemed to help. However,
> with subsequent intermittent failures I figured out that there are just
> naturally occurring races when the auditing code runs outside a safepoint
> because of the lock free nature of the monitor lists.
>
> Bad idea on the counter load_acquire() calls! I knew I should have pinged
> you on this before.
>
> Okay. So all the counter fetches can go back to being regular loads, e.g:
>
> L1245: int monitors_used = g_om_population - g_om_free_count;
>
> Well, that will certainly help reduce the noise level in the code.
> Thanks!
Done.
>
>
>> In a similar vein but from the other side these release_store
>> operations in om_alloc are unnecessary:
>>
>> 1518 for (int i = 1; i < _BLOCKSIZE; i++) {
>> 1519 OrderAccess::release_store(&temp[i]._next_om,
>> (ObjectMonitor*)&temp[i+1]);
>> 1520 assert(temp[i].is_free(), "invariant");
>> 1521 }
>> 1522
>> 1523 // terminate the last monitor as the end of list
>> 1524 OrderAccess::release_store(&temp[_BLOCKSIZE - 1]._next_om,
>> (ObjectMonitor*)NULL);
>>
>> "temp" is completely thread-local at this stage. As long as temp
>> itself is "published" with an appropriate memory barrier (which it
>> is: prepend_block_to_list stores it with cmpxchg), and then read with
>> an appropriate memory barrier (again it is) then all the updates you
>> did within the "temp" block of memory are guaranteed to be visible.
>
> Agreed!
>
> The history here is that I was going through and making sure all the
> stores
> to the _next_om field were "safe". Again, a little too over zealous.
> Thanks
> for catching this.
>
> I'll change these to regular stores, e.g.:
>
> L1519: temp[i]._next_om = (ObjectMonitor*)&temp[i + 1];
Done.
>
>> And again:
>>
>> 2470 void
>> ObjectSynchronizer::prepare_deflate_idle_monitors(DeflateMonitorCounters*
>> counters) {
>> 2471 OrderAccess::release_store(&counters->n_in_use, 0); //
>> currently associated with objects
>> 2472 OrderAccess::release_store(&counters->n_in_circulation, 0); //
>> extant
>> 2473 OrderAccess::release_store(&counters->n_scavenged, 0); //
>> reclaimed (global and per-thread)
>> 2474 OrderAccess::release_store(&counters->per_thread_scavenged, 0);
>> // per-thread scavenge total
>> 2475 counters->per_thread_times = 0.0; // per-thread scavenge times
>> 2476 }
>>
>> A release_store is not needed for each write. If "counters" is to be
>> published somewhere for another thread to examine then it suffices to
>> publish it with a suitable memory barrier. If you wanted to
>> internalize it you could add a storeStore() barrier at the end.
>
> I'll switch these counter inits to a simple store, e.g.:
>
> L2471: counters->n_in_use = 0; // currently associated with
> objects
>
> with a single storeStore() after them all or a comment stating which
> memory
> sync we're relying on. It maybe safer for future safepoint work
> juggling to
> put the storeStore() there rather than relying on a comment and hoping
> for
> unchanging code... :-)
Done.
>
>
>> These unnecessary load_acquire/store_release usages will hurt
>> performance on weakly ordered machines. And semantically they just
>> confuse the reader who is trying to determine how and why these loads
>> and stores are paired.
>
> Agreed.
>
>
> Okay, time to return to my uses of release_store() and load_acquire().
> Consider this bit of code:
>
> src/hotspot/share/runtime/synchronizer.cpp:
>
> L351: // Take an ObjectMonitor from the start of the specified
> list. Also
> L352: // decrements the specified counter. Returns NULL if none
> are available.
> L353: static ObjectMonitor*
> take_from_start_of_common(ObjectMonitor* volatile * list_p,
> L354: int volatile
> * count_p) {
> L355: ObjectMonitor* next = NULL;
> L356: ObjectMonitor* take = NULL;
> L357: // Mark the list head to guard against A-B-A race:
> L358: if (!mark_list_head(list_p, &take, &next)) {
> L359: return NULL; // None are available.
> L360: }
> L361: // Switch marked list head to next (which unmarks the list
> head, but
> L362: // leaves take marked):
> L363: OrderAccess::release_store(list_p, next);
> L364: Atomic::dec(count_p);
> L365: // Unmark take, but leave the next value for any lagging list
> L366: // walkers. It will get cleaned up when take is prepended to
> L367: // the in-use list:
> L368: set_next(take, next);
> L369: return take;
> L370: }
>
>
> On L358, we call mark_list_head() and if that call returns true, then
> the calling thread is the "owner" of the list head. My idea is that
> once my thread "owns" the list head, I can use release_store() as a
> smaller hammer than cmpxchg() to sync out the new value ('next') as
> the new list head.
>
> Should I be doing something other than release_store() here?
>
> My thinking is that release_store() pairs up with the load_acquire()
> of any reader thread that's trying to simply walk the list. And that
> release_store() also pairs up with any _other_ writer thread that's
> trying to mark the list head using:
>
> mark_list_head() -> mark_next() -> cmpxchg()
>
> Here's mark_list_head():
>
> L194: // Mark the next field in the list head ObjectMonitor. If
> marking was
> L195: // successful, then the mid and the unmarked next field are
> returned
> L196: // via parameter and true is returned. Otherwise false is
> returned.
> L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
> L198: ObjectMonitor** mid_p,
> ObjectMonitor** next_p) {
> L199: while (true) {
> L200: ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
> L201: if (mid == NULL) {
> L202: return false; // The list is empty so nothing to mark.
> L203: }
> L204: if (mark_next(mid, next_p)) {
> L205: if (OrderAccess::load_acquire(list_p) != mid) {
> L206: // The list head changed so we have to retry.
> L207: set_next(mid, *next_p); // unmark mid
> L208: continue;
> L209: }
> L210: // We marked next field to guard against races.
> L211: *mid_p = mid;
> L212: return true;
> L213: }
> L214: }
> L215: }
>
> which does a matching load_acquire() to get the current list head
> and calls mark_next() to try and mark it. It then calls load_acquire()
> again to verify that the list head hasn't changed while doing the
> mark_next().
>
> So here's mark_next():
>
> 161 // Mark the next field in an ObjectMonitor. If marking was
> successful,
> 162 // then the unmarked next field is returned via parameter and
> true is
> 163 // returned. Otherwise false is returned.
> 164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
> 165 // Get current next field without any marking value.
> 166 ObjectMonitor* next = (ObjectMonitor*)
> 167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
> 168 if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) !=
> next) {
> 169 return false; // Could not mark the next field or it was
> already marked.
> 170 }
> 171 *next_p = next;
> 172 return true;
> 173 }
>
> We use load_acquire() on L166-7 to make sure that we have the latest
> value in the next_om field. That load_acquire() matches up with the
> release_store() done by another thread to unmark the next field (see
> the set_next() function). Or it matches with the cmpxchg() done by
> another to mark the next field. Yes, we could do a regular load when
> we know that the field is only changed by cmpxchg(), but it could
> have been changed by release_store() and my thinking is that we need
> a matching load_acquire().
>
> We have to use cmpxchg() here to try and set the mark on the next field.
> If successful, then we return the value of the unmarked next field via
> the 'next_p' param and we return true. However, just because we marked
> the next field in the ObjectMonitor doesn't mean that we marked the list
> head. That's what the second load_acquire() on L205 above verifies.
>
>
> Okay, that was pretty gory on the details of my thinking! I think it's
> pretty clear that I'm doing load_acquire()/release_store() pairs on
> the list head pointers or the next_om fields to make sure that I match
> loads and stores of _just_ the list head pointer or the next_om field
> in question. I'm not trying affect other fields in the ObjectMonitor
> or at least I don't think I have to do that.
>
> Maybe I've misunderstood the load_acquire()/release_store() stuff (again)
> and this code is doing way too much! I would be happy to have my thinking
> about this load_acquire()/release_store() stuff corrected and the code
> simplified. Maybe in this sequence:
>
> mark_list_head() -> mark_next() -> cmpxchg()
>
> we don't need all the load_acquire() calls because the sequence only ends
> successfully with a cmpxchg() and that makes memory happy everywhere with
> just simple loads. Dunno. I definitely need more feedback here!
>
>
>
>>
>> ---
>>
>> A couple of minor comments/queries on other files ...
>>
>> ---
>>
>> src/hotspot/share/prims/jvm.cpp
>>
>> I'm not clear what the relationship between async deflation and
>> System.gc is. In this logic:
>>
>> + if (AsyncDeflateIdleMonitors) {
>> + // AsyncDeflateIdleMonitors needs to know when System.gc() is
>> + // called so any special deflation can be done at a safepoint.
>> + ObjectSynchronizer::set_is_special_deflation_requested(true);
>> + }
>> Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>
>> if we are using a GC that treats System.gc() as a no-op, does the
>> fact we called set_is_special_deflation_requested(true), a problem?
>
> I don't think we have a problem here. Let me take a step back and
> explain the rationale here.
>
> We have some tests that assume that GC will eventually cleanup any
> unreferenced object. ObjectMonitor is a root so an object with an
> associated ObjectMonitor won't get GC'ed until the ObjectMonitor
> gets deflated.
>
> Object foo;
> synchronized (foo) {
> foo.wait(1);
> }
> // foo now has an associated ObjectMonitor
> foo = null;
> // no local ref so 'foo' should be GC'able (in theory)
> <do stuff to use up all of memory>
> <do System.gc() calls>
> <verify that foo has been GC'ed somehow>
>
> Note: There are also one or two tests that assume that using up all of
> memory is enough to cause 'foo' to be GC'ed (no System.gc() calls
> required).
>
> When JavaThreads were responsible for deflating their own ObjectMonitors,
> we had to have System.gc() request a special deflation so that the
> ObjectMonitor associated with 'foo' is deflated so that 'foo' could be
> GC'ed.
>
> When we moved async deflation over the ServiceThread, I kept this code
> around because there's no guarantee when the ServiceThread will get
> around to deflating the ObjectMonitor associated with 'foo'. I didn't
> want to deal with intermittent test failures.
>
> Okay, back to the question... If we have a GC that treats System.gc() as
> a no-op, it's likely going to fail the test anyway because 'foo' will
> never
> get GC'ed. The special cleanup request shouldn't break anything since all
> that does is cause safepoint based deflation to happen during a safepoint
> cleanup phase (like what happens in the baseline today).
>
>
>>
>> ---
>>
>> src/hotspot/share/runtime/serviceThread.cpp
>>
>> I have some concerns that the serviceThread must now wakeup
>> periodically instead of only on demand. It has a lot of tasks to
>> check now and it is not obvious how complex or trivial those checks
>> are. We don't currently gather any statistics in this area so we have
>> no idea how long the serviceThread typically waits between requests.
>> The fact the serviceThread is woken on demand means that the desire
>> to be "checked at the same interval" seems exceedingly unlikely - the
>> serviceThread would have to waiting for extended periods of time for
>> the timed-wait to have any real affect on the frequency of the async
>> monitor deflation.
>
> Periodically waking up the ServiceThread is done to match the safepoint
> cleanup period of GuaranteedSafepointInterval when doing safepoint based
> deflation. Without waking up periodically, we could go a long time before
> doing any deflation via the ServiceThread and that would definitely be
> an observable change in behavior relative to safepoint based deflation.
> In some of my test runs, I had seen us go for 8-27 seconds without doing
> any async monitor deflation. Of course, that meant that there was more to
> cleanup when we did do it.
>
> I've been thinking about have my own AsyncDeflateIdleMonitorsThread
> and move
> all the async deflation code from ServiceThread there. So many things are
> being added to the ServiceThread that I do wonder whether it is
> overloaded,
> but (as you said) there's no way to know that (yet).
>
> BTW, I think GuaranteedSafepointInterval might be "broken" at this point
> in that we don't guarantee a safepoint at that interval anymore. This has
> kind of slipped through the cracks...
>
>
>> ---
>>
>> src/hotspot/share/runtime/thread.cpp
>>
>> No changes to this file.
>>
>> ---
>>
>> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>>
>> No changes to this file.
>
> Correct. I'm unsure how to deal with this type of thing. This project
> is managed as a series of patches. Earlier patches changed those files,
> but a later patch undid those changes. So depending on where you are
> in the patch queue, the logical 'files.list' contents are different.
> I'm currently using a 'files.list' that reflects every file touched by
> the current set of patches and that's why those two files are included
> with no changes.
>
> Suggestions?
>
> Thanks for the partial review! I'm looking forward to resolving the
> queries about and your next set of review comments.
>
> Dan
Okay. I think I've made all the changes from David H's part 1 of 3
code review email.
Please let me know if I'm missed anything.
Dan
>
>
>>
>> ---
>>
>>
>>
>>> Dan
>>>
>>>
>>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> The Async Monitor Deflation project is reaching the end game. I
>>>> have no
>>>> changes planned for the project at this time so all that is left is
>>>> code
>>>> review and any changes that results from those reviews.
>>>>
>>>> Carsten and Roman! Time for you guys to chime in again on the code
>>>> reviews.
>>>>
>>>> I have attached the list of fixes from CR6 to CR7 instead of
>>>> putting it
>>>> in the main body of this email.
>>>>
>>>> Main bug URL:
>>>>
>>>> JDK-8153224 Monitor deflation prolong safepoints
>>>> https://bugs.openjdk.java.net/browse/JDK-8153224
>>>>
>>>> The project is currently baselined on jdk-14+19.
>>>>
>>>> Here's the full webrev URL for those folks that want to see all of the
>>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full
>>>>
>>>>
>>>> Some folks might want to see just what has changed since the last
>>>> review
>>>> cycle so here's a webrev for that (v2.07 inc):
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/
>>>>
>>>>
>>>> The OpenJDK wiki has been updated to match the
>>>> CR7/v2.07/10-for-jdk14 changes:
>>>>
>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>>
>>>> The jdk-14+18 based v2.07 version of the patch has been thru Mach5
>>>> tier[1-8]
>>>> testing on Oracle's usual set of platforms. It has also been
>>>> through my usual
>>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with the
>>>> addition
>>>> of Robbin's "MoCrazy 1024" test running in parallel with the other
>>>> tests in
>>>> my lab.
>>>>
>>>> The jdk-14+19 based v2.07 version of the patch has been thru Mach5
>>>> tier[1-3]
>>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in
>>>> process.
>>>>
>>>> I did another round of SPECjbb2015 testing in Oracle's Aurora
>>>> Performance lab
>>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>>
>>>> - "base" is jdk-14+18
>>>> - "v2.07" is the latest version and includes C2
>>>> inc_om_ref_count() support
>>>> on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors option
>>>> - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>> - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors
>>>> specified
>>>>
>>>> hbIR hbIR
>>>> (max attempted) (settled) max-jOPS critical-jOPS runtime
>>>> --------------- --------- -------- ------------- -------
>>>> 34282.00 30635.90 28831.30 20969.20 3841.30 base
>>>> 34282.00 30973.00 29345.80 21025.20 3964.10 v2.07
>>>> 34282.00 31105.60 29174.30 21074.00 3931.30
>>>> v2.07_handshake
>>>> 34282.00 30789.70 27151.60 19839.10 3850.20
>>>> v2.07_off
>>>>
>>>> - The Aurora Perf comparison tool reports:
>>>>
>>>> Comparison max-jOPS critical-jOPS
>>>> ---------------------- --------------------
>>>> --------------------
>>>> base vs 2.07 +1.78% (s, p=0.000) +0.27% (ns,
>>>> p=0.790)
>>>> base vs 2.07_handshake +1.19% (s, p=0.007) +0.58% (ns,
>>>> p=0.536)
>>>> base vs 2.07_off -5.83% (ns, p=0.394) -5.39% (ns,
>>>> p=0.347)
>>>>
>>>> (s) - significant (ns) - not-significant
>>>>
>>>> - For historical comparison, the Aurora Perf comparision tool
>>>> reported for v2.06 with a baseline of jdk-13+31:
>>>>
>>>> Comparison max-jOPS critical-jOPS
>>>> ---------------------- --------------------
>>>> --------------------
>>>> base vs 2.06 -0.32% (ns, p=0.345) +0.71% (ns,
>>>> p=0.646)
>>>> base vs 2.06_off +0.49% (ns, p=0.292) -1.21% (ns,
>>>> p=0.481)
>>>>
>>>> (s) - significant (ns) - not-significant
>>>>
>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>
>>>> Dan
>
> <trimmed older review invites>
>
>
More information about the hotspot-runtime-dev
mailing list