RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)

Tue Oct 22 20:46:46 UTC 2019

Hi David,

Thanks for chiming in on this review thread!  More below...

On 10/22/19 3:40 AM, David Holmes wrote:
> Hi Dan,
>
> On 22/10/2019 1:31 am, Daniel D. Daugherty wrote:
>> Ping! Still looking for reviewers on this one. Thanks!
>
> I refreshed myself on the wiki then started looking at the code.

Comments on the wiki are also appreciated, either here or as comments
on the wiki itself. Which ever is easiest for you...

> I started with the easy files (a couple of minor comments below) then 
> something harder. :) 

Good choice!

> I still have the objectMonitor* files themselves to go through, which 
> I will do tomorrow.

I will keep my eyes open for more replies.

>
> Cheers,
> David
> ------
>
> src/hotspot/share/runtime/synchronizer.cpp
>
> My main comment on all the list management code pertains to the 
> difficulty in seeing which release_store is paired with which 
> load_acquire, and why. The whole point of the acquire/release protocol 
> is that when you load_acquire a given field you will see all memory 
> updates that preceded the release_store of the field, so that you can 
> use those other field values knowing they are not out-of-date with 
> respect to the field you did the load_acquire on. It is exceedingly 
> difficult to see where these other fields are actually being read - if 
> indeed they are.

Agreed. I suspect that I have been overzealous with my use of 
release_store()
and load_acquire() since v.2.06 (where lock free monitor lists came into the
picture). I have a private query into Robbin and Erik O. asking for feedback
on my use of cmpxchg(), release_store() and load_acquire(). If I have 
too much
or too little (not likely), then I need to fix that.

More below...

> In many cases acquire/release is simply not needed. For example:
>
> 1245     int monitors_used = 
> OrderAccess::load_acquire(&g_om_population) -
> 1246 OrderAccess::load_acquire(&g_om_free_count);
>
> these are simple volatile counters. Their value does not relate to the 
> updates of any other fields and we are not reading those other fields 
> having done the load_acquire. AFAICS you use load_acquire to read all 
> the counters (global and per-thread) everywhere but there is never a 
> release_store associated with any of them, as they are all updated via 
> Atomic operations. So the load_acquires are semantically meaningless, 
> but may have an actual cost (both performance-wise and in terms of the 
> reader trying to understand the code).

This is exactly the kind of feedback I've been looking for since v2.06 hit
the OpenJDK alias! Thanks!

To fill in the history a bit... when I switched to lock free monitor lists
I started to see failures in the ObjectMonitor auditing code because the
counters were out of sync with each other. I went through and switched all
the counter fetches to load_acquire() with the idea that these would match
up with the Atomic increments and decrements. It seemed to help. However,
with subsequent intermittent failures I figured out that there are just
naturally occurring races when the auditing code runs outside a safepoint
because of the lock free nature of the monitor lists.

Bad idea on the counter load_acquire() calls! I knew I should have pinged
you on this before.

Okay. So all the counter fetches can go back to being regular loads, e.g:

     L1245: int monitors_used = g_om_population - g_om_free_count;

Well, that will certainly help reduce the noise level in the code. Thanks!

> In a similar vein but from the other side these release_store 
> operations in om_alloc are unnecessary:
>
> 1518     for (int i = 1; i < _BLOCKSIZE; i++) {
> 1519       OrderAccess::release_store(&temp[i]._next_om, 
> (ObjectMonitor*)&temp[i+1]);
> 1520       assert(temp[i].is_free(), "invariant");
> 1521     }
> 1522
> 1523     // terminate the last monitor as the end of list
> 1524     OrderAccess::release_store(&temp[_BLOCKSIZE - 1]._next_om, 
> (ObjectMonitor*)NULL);
>
> "temp" is completely thread-local at this stage. As long as temp 
> itself is "published" with an appropriate memory barrier (which it is: 
> prepend_block_to_list stores it with cmpxchg), and then read with an 
> appropriate memory barrier (again it is) then all the updates you did 
> within the "temp" block of memory are guaranteed to be visible.

Agreed!

The history here is that I was going through and making sure all the stores
to the _next_om field were "safe". Again, a little too over zealous. Thanks
for catching this.

I'll change these to regular stores, e.g.:

     L1519: temp[i]._next_om = (ObjectMonitor*)&temp[i + 1];

> And again:
>
> 2470 void 
> ObjectSynchronizer::prepare_deflate_idle_monitors(DeflateMonitorCounters* 
> counters) {
> 2471   OrderAccess::release_store(&counters->n_in_use, 0); // 
> currently associated with objects
> 2472 OrderAccess::release_store(&counters->n_in_circulation, 0); // 
> extant
> 2473   OrderAccess::release_store(&counters->n_scavenged, 0); // 
> reclaimed (global and per-thread)
> 2474 OrderAccess::release_store(&counters->per_thread_scavenged, 0); 
> // per-thread scavenge total
> 2475   counters->per_thread_times = 0.0; // per-thread scavenge times
> 2476 }
>
> A release_store is not needed for each write. If "counters" is to be 
> published somewhere for another thread to examine then it suffices to 
> publish it with a suitable memory barrier. If you wanted to 
> internalize it you could add a storeStore() barrier at the end.

I'll switch these counter inits to a simple store, e.g.:

     L2471:   counters->n_in_use = 0;  // currently associated with objects

with a single storeStore() after them all or a comment stating which memory
sync we're relying on. It maybe safer for future safepoint work juggling to
put the storeStore() there rather than relying on a comment and hoping for
unchanging code... :-)

> These unnecessary load_acquire/store_release usages will hurt 
> performance on weakly ordered machines. And semantically they just 
> confuse the reader who is trying to determine how and why these loads 
> and stores are paired.

Agreed.

Okay, time to return to my uses of release_store() and load_acquire().
Consider this bit of code:

src/hotspot/share/runtime/synchronizer.cpp:

     L351: // Take an ObjectMonitor from the start of the specified 
list. Also
     L352: // decrements the specified counter. Returns NULL if none are 
available.
     L353: static ObjectMonitor* 
take_from_start_of_common(ObjectMonitor* volatile * list_p,
     L354:                                                 int volatile 
* count_p) {
     L355:   ObjectMonitor* next = NULL;
     L356:   ObjectMonitor* take = NULL;
     L357:   // Mark the list head to guard against A-B-A race:
     L358:   if (!mark_list_head(list_p, &take, &next)) {
     L359:     return NULL;  // None are available.
     L360:   }
     L361:   // Switch marked list head to next (which unmarks the list 
head, but
     L362:   // leaves take marked):
     L363:   OrderAccess::release_store(list_p, next);
     L364:   Atomic::dec(count_p);
     L365:   // Unmark take, but leave the next value for any lagging list
     L366:   // walkers. It will get cleaned up when take is prepended to
     L367:   // the in-use list:
     L368:   set_next(take, next);
     L369:   return take;
     L370: }

On L358, we call mark_list_head() and if that call returns true, then
the calling thread is the "owner" of the list head. My idea is that
once my thread "owns" the list head, I can use release_store() as a
smaller hammer than cmpxchg() to sync out the new value ('next') as
the new list head.

Should I be doing something other than release_store() here?

My thinking is that release_store() pairs up with the load_acquire()
of any reader thread that's trying to simply walk the list. And that
release_store() also pairs up with any _other_ writer thread that's
trying to mark the list head using:

     mark_list_head() -> mark_next() -> cmpxchg()

Here's mark_list_head():

     L194: // Mark the next field in the list head ObjectMonitor. If 
marking was
     L195: // successful, then the mid and the unmarked next field are 
returned
     L196: // via parameter and true is returned. Otherwise false is 
returned.
     L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
     L198:                            ObjectMonitor** mid_p, 
ObjectMonitor** next_p) {
     L199:   while (true) {
     L200:     ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
     L201:     if (mid == NULL) {
     L202:       return false;  // The list is empty so nothing to mark.
     L203:     }
     L204:     if (mark_next(mid, next_p)) {
     L205:       if (OrderAccess::load_acquire(list_p) != mid) {
     L206:         // The list head changed so we have to retry.
     L207:         set_next(mid, *next_p);  // unmark mid
     L208:         continue;
     L209:       }
     L210:       // We marked next field to guard against races.
     L211:       *mid_p = mid;
     L212:       return true;
     L213:     }
     L214:   }
     L215: }

which does a matching load_acquire() to get the current list head
and calls mark_next() to try and mark it. It then calls load_acquire()
again to verify that the list head hasn't changed while doing the
mark_next().

So here's mark_next():

  161 // Mark the next field in an ObjectMonitor. If marking was successful,
  162 // then the unmarked next field is returned via parameter and true is
  163 // returned. Otherwise false is returned.
  164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
  165   // Get current next field without any marking value.
  166   ObjectMonitor* next = (ObjectMonitor*)
  167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
  168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) != 
next) {
  169     return false;  // Could not mark the next field or it was 
already marked.
  170   }
  171   *next_p = next;
  172   return true;
  173 }

We use load_acquire() on L166-7 to make sure that we have the latest
value in the next_om field. That load_acquire() matches up with the
release_store() done by another thread to unmark the next field (see
the set_next() function). Or it matches with the cmpxchg() done by
another to mark the next field. Yes, we could do a regular load when
we know that the field is only changed by cmpxchg(), but it could
have been changed by release_store() and my thinking is that we need
a matching load_acquire().

We have to use cmpxchg() here to try and set the mark on the next field.
If successful, then we return the value of the unmarked next field via
the 'next_p' param and we return true. However, just because we marked
the next field in the ObjectMonitor doesn't mean that we marked the list
head. That's what the second load_acquire() on L205 above verifies.

Okay, that was pretty gory on the details of my thinking! I think it's
pretty clear that I'm doing load_acquire()/release_store() pairs on
the list head pointers or the next_om fields to make sure that I match
loads and stores of _just_ the list head pointer or the next_om field
in question. I'm not trying affect other fields in the ObjectMonitor
or at least I don't think I have to do that.

Maybe I've misunderstood the load_acquire()/release_store() stuff (again)
and this code is doing way too much! I would be happy to have my thinking
about this load_acquire()/release_store() stuff corrected and the code
simplified. Maybe in this sequence:

     mark_list_head() -> mark_next() -> cmpxchg()

we don't need all the load_acquire() calls because the sequence only ends
successfully with a cmpxchg() and that makes memory happy everywhere with
just simple loads. Dunno. I definitely need more feedback here!

>
> ---
>
> A couple of minor comments/queries on other files ...
>
> ---
>
> src/hotspot/share/prims/jvm.cpp
>
> I'm not clear what the relationship between async deflation and 
> System.gc is. In this logic:
>
> +     if (AsyncDeflateIdleMonitors) {
> +       // AsyncDeflateIdleMonitors needs to know when System.gc() is
> +       // called so any special deflation can be done at a safepoint.
> + ObjectSynchronizer::set_is_special_deflation_requested(true);
> +     }
>       Universe::heap()->collect(GCCause::_java_lang_system_gc);
>
> if we are using a GC that treats System.gc() as a no-op, does the fact 
> we called set_is_special_deflation_requested(true), a problem?

I don't think we have a problem here. Let me take a step back and
explain the rationale here.

We have some tests that assume that GC will eventually cleanup any
unreferenced object. ObjectMonitor is a root so an object with an
associated ObjectMonitor won't get GC'ed until the ObjectMonitor
gets deflated.

     Object foo;
     synchronized (foo) {
         foo.wait(1);
     }
     // foo now has an associated ObjectMonitor
     foo = null;
     // no local ref so 'foo' should be GC'able (in theory)
     <do stuff to use up all of memory>
     <do System.gc() calls>
     <verify that foo has been GC'ed somehow>

Note: There are also one or two tests that assume that using up all of
memory is enough to cause 'foo' to be GC'ed (no System.gc() calls required).

When JavaThreads were responsible for deflating their own ObjectMonitors,
we had to have System.gc() request a special deflation so that the
ObjectMonitor associated with 'foo' is deflated so that 'foo' could be
GC'ed.

When we moved async deflation over the ServiceThread, I kept this code
around because there's no guarantee when the ServiceThread will get
around to deflating the ObjectMonitor associated with 'foo'. I didn't
want to deal with intermittent test failures.

Okay, back to the question... If we have a GC that treats System.gc() as
a no-op, it's likely going to fail the test anyway because 'foo' will never
get GC'ed. The special cleanup request shouldn't break anything since all
that does is cause safepoint based deflation to happen during a safepoint
cleanup phase (like what happens in the baseline today).

>
> ---
>
> src/hotspot/share/runtime/serviceThread.cpp
>
> I have some concerns that the serviceThread must now wakeup 
> periodically instead of only on demand. It has a lot of tasks to check 
> now and it is not obvious how complex or trivial those checks are. We 
> don't currently gather any statistics in this area so we have no idea 
> how long the serviceThread typically waits between requests. The fact 
> the serviceThread is woken on demand means that the desire to be 
> "checked at the same interval" seems exceedingly unlikely - the 
> serviceThread would have to waiting for extended periods of time for 
> the timed-wait to have any real affect on the frequency of the async 
> monitor deflation.

Periodically waking up the ServiceThread is done to match the safepoint
cleanup period of GuaranteedSafepointInterval when doing safepoint based
deflation. Without waking up periodically, we could go a long time before
doing any deflation via the ServiceThread and that would definitely be
an observable change in behavior relative to safepoint based deflation.
In some of my test runs, I had seen us go for 8-27 seconds without doing
any async monitor deflation. Of course, that meant that there was more to
cleanup when we did do it.

I've been thinking about have my own AsyncDeflateIdleMonitorsThread and move
all the async deflation code from ServiceThread there. So many things are
being added to the ServiceThread that I do wonder whether it is overloaded,
but (as you said) there's no way to know that (yet).

BTW, I think GuaranteedSafepointInterval might be "broken" at this point
in that we don't guarantee a safepoint at that interval anymore. This has
kind of slipped through the cracks...

> ---
>
> src/hotspot/share/runtime/thread.cpp
>
> No changes to this file.
>
> ---
>
> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>
> No changes to this file.

Correct. I'm unsure how to deal with this type of thing. This project
is managed as a series of patches. Earlier patches changed those files,
but a later patch undid those changes. So depending on where you are
in the patch queue, the logical 'files.list' contents are different.
I'm currently using a 'files.list' that reflects every file touched by
the current set of patches and that's why those two files are included
with no changes.

Suggestions?

Thanks for the partial review! I'm looking forward to resolving the
queries about and your next set of review comments.

Dan

>
> ---
>
>
>
>> Dan
>>
>>
>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> The Async Monitor Deflation project is reaching the end game. I have no
>>> changes planned for the project at this time so all that is left is 
>>> code
>>> review and any changes that results from those reviews.
>>>
>>> Carsten and Roman! Time for you guys to chime in again on the code 
>>> reviews.
>>>
>>> I have attached the list of fixes from CR6 to CR7 instead of putting it
>>> in the main body of this email.
>>>
>>> Main bug URL:
>>>
>>>     JDK-8153224 Monitor deflation prolong safepoints
>>>     https://bugs.openjdk.java.net/browse/JDK-8153224
>>>
>>> The project is currently baselined on jdk-14+19.
>>>
>>> Here's the full webrev URL for those folks that want to see all of the
>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>
>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full 
>>>
>>>
>>> Some folks might want to see just what has changed since the last 
>>> review
>>> cycle so here's a webrev for that (v2.07 inc):
>>>
>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/ 
>>>
>>>
>>> The OpenJDK wiki has been updated to match the 
>>> CR7/v2.07/10-for-jdk14 changes:
>>>
>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>
>>> The jdk-14+18 based v2.07 version of the patch has been thru Mach5 
>>> tier[1-8]
>>> testing on Oracle's usual set of platforms. It has also been through 
>>> my usual
>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with the 
>>> addition
>>> of Robbin's "MoCrazy 1024" test running in parallel with the other 
>>> tests in
>>> my lab.
>>>
>>> The jdk-14+19 based v2.07 version of the patch has been thru Mach5 
>>> tier[1-3]
>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in 
>>> process.
>>>
>>> I did another round of SPECjbb2015 testing in Oracle's Aurora 
>>> Performance lab
>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>
>>>     - "base" is jdk-14+18
>>>     - "v2.07" is the latest version and includes C2 
>>> inc_om_ref_count() support
>>>       on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors option
>>>     - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>     - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors 
>>> specified
>>>
>>>          hbIR           hbIR
>>>     (max attempted)  (settled)  max-jOPS  critical-jOPS runtime
>>>     ---------------  ---------  --------  ------------- -------
>>>            34282.00   30635.90  28831.30       20969.20 3841.30 base
>>>            34282.00   30973.00  29345.80       21025.20 3964.10 v2.07
>>>            34282.00   31105.60  29174.30       21074.00 3931.30 
>>> v2.07_handshake
>>>            34282.00   30789.70  27151.60       19839.10 3850.20 
>>> v2.07_off
>>>
>>>     - The Aurora Perf comparison tool reports:
>>>
>>>         Comparison              max-jOPS critical-jOPS
>>>         ----------------------  -------------------- 
>>> --------------------
>>>         base vs 2.07            +1.78% (s, p=0.000)   +0.27% (ns, 
>>> p=0.790)
>>>         base vs 2.07_handshake  +1.19% (s, p=0.007)   +0.58% (ns, 
>>> p=0.536)
>>>         base vs 2.07_off        -5.83% (ns, p=0.394)  -5.39% (ns, 
>>> p=0.347)
>>>
>>>         (s) - significant  (ns) - not-significant
>>>
>>>     - For historical comparison, the Aurora Perf comparision tool
>>>         reported for v2.06 with a baseline of jdk-13+31:
>>>
>>>         Comparison              max-jOPS critical-jOPS
>>>         ----------------------  -------------------- 
>>> --------------------
>>>         base vs 2.06            -0.32% (ns, p=0.345)  +0.71% (ns, 
>>> p=0.646)
>>>         base vs 2.06_off        +0.49% (ns, p=0.292)  -1.21% (ns, 
>>> p=0.481)
>>>
>>>         (s) - significant  (ns) - not-significant
>>>
>>> Thanks, in advance, for any questions, comments or suggestions.
>>>
>>> Dan

<trimmed older review invites>