RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)

Wed Oct 23 04:00:53 UTC 2019

Hi Dan,

Trimming to the open issues.

On 23/10/2019 6:46 am, Daniel D. Daugherty wrote:
> Okay, time to return to my uses of release_store() and load_acquire().
> Consider this bit of code:
> 
> src/hotspot/share/runtime/synchronizer.cpp:
> 
>      L351: // Take an ObjectMonitor from the start of the specified 
> list. Also
>      L352: // decrements the specified counter. Returns NULL if none are 
> available.
>      L353: static ObjectMonitor* 
> take_from_start_of_common(ObjectMonitor* volatile * list_p,
>      L354:                                                 int volatile 
> * count_p) {
>      L355:   ObjectMonitor* next = NULL;
>      L356:   ObjectMonitor* take = NULL;
>      L357:   // Mark the list head to guard against A-B-A race:
>      L358:   if (!mark_list_head(list_p, &take, &next)) {
>      L359:     return NULL;  // None are available.
>      L360:   }
>      L361:   // Switch marked list head to next (which unmarks the list 
> head, but
>      L362:   // leaves take marked):
>      L363:   OrderAccess::release_store(list_p, next);
>      L364:   Atomic::dec(count_p);
>      L365:   // Unmark take, but leave the next value for any lagging list
>      L366:   // walkers. It will get cleaned up when take is prepended to
>      L367:   // the in-use list:
>      L368:   set_next(take, next);
>      L369:   return take;
>      L370: }
> 
> 
> On L358, we call mark_list_head() and if that call returns true, then
> the calling thread is the "owner" of the list head. My idea is that
> once my thread "owns" the list head, I can use release_store() as a
> smaller hammer than cmpxchg() to sync out the new value ('next') as
> the new list head.
> 
> Should I be doing something other than release_store() here?
> 
> My thinking is that release_store() pairs up with the load_acquire()
> of any reader thread that's trying to simply walk the list. And that
> release_store() also pairs up with any _other_ writer thread that's
> trying to mark the list head using:
> 
>      mark_list_head() -> mark_next() -> cmpxchg()
> 
> Here's mark_list_head():
> 
>      L194: // Mark the next field in the list head ObjectMonitor. If 
> marking was
>      L195: // successful, then the mid and the unmarked next field are 
> returned
>      L196: // via parameter and true is returned. Otherwise false is 
> returned.
>      L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
>      L198:                            ObjectMonitor** mid_p, 
> ObjectMonitor** next_p) {
>      L199:   while (true) {
>      L200:     ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
>      L201:     if (mid == NULL) {
>      L202:       return false;  // The list is empty so nothing to mark.
>      L203:     }
>      L204:     if (mark_next(mid, next_p)) {
>      L205:       if (OrderAccess::load_acquire(list_p) != mid) {
>      L206:         // The list head changed so we have to retry.
>      L207:         set_next(mid, *next_p);  // unmark mid
>      L208:         continue;
>      L209:       }
>      L210:       // We marked next field to guard against races.
>      L211:       *mid_p = mid;
>      L212:       return true;
>      L213:     }
>      L214:   }
>      L215: }
> 
> which does a matching load_acquire() to get the current list head
> and calls mark_next() to try and mark it. It then calls load_acquire()
> again to verify that the list head hasn't changed while doing the
> mark_next().
> 
> So here's mark_next():
> 
>   161 // Mark the next field in an ObjectMonitor. If marking was 
> successful,
>   162 // then the unmarked next field is returned via parameter and true is
>   163 // returned. Otherwise false is returned.
>   164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
>   165   // Get current next field without any marking value.
>   166   ObjectMonitor* next = (ObjectMonitor*)
>   167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
>   168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) != 
> next) {
>   169     return false;  // Could not mark the next field or it was 
> already marked.
>   170   }
>   171   *next_p = next;
>   172   return true;
>   173 }
> 
> We use load_acquire() on L166-7 to make sure that we have the latest
> value in the next_om field. That load_acquire() matches up with the
> release_store() done by another thread to unmark the next field (see
> the set_next() function). Or it matches with the cmpxchg() done by
> another to mark the next field. Yes, we could do a regular load when
> we know that the field is only changed by cmpxchg(), but it could
> have been changed by release_store() and my thinking is that we need
> a matching load_acquire().
> 
> We have to use cmpxchg() here to try and set the mark on the next field.
> If successful, then we return the value of the unmarked next field via
> the 'next_p' param and we return true. However, just because we marked
> the next field in the ObjectMonitor doesn't mean that we marked the list
> head. That's what the second load_acquire() on L205 above verifies.
> 
> 
> Okay, that was pretty gory on the details of my thinking! I think it's
> pretty clear that I'm doing load_acquire()/release_store() pairs on
> the list head pointers or the next_om fields to make sure that I match
> loads and stores of _just_ the list head pointer or the next_om field
> in question. I'm not trying affect other fields in the ObjectMonitor
> or at least I don't think I have to do that.
> 
> Maybe I've misunderstood the load_acquire()/release_store() stuff (again)
> and this code is doing way too much! I would be happy to have my thinking
> about this load_acquire()/release_store() stuff corrected and the code
> simplified. Maybe in this sequence:
> 
>      mark_list_head() -> mark_next() -> cmpxchg()
> 
> we don't need all the load_acquire() calls because the sequence only ends
> successfully with a cmpxchg() and that makes memory happy everywhere with
> just simple loads. Dunno. I definitely need more feedback here!

Okay that was a massive deep dive :) but it is exactly what I was 
referring to when I said:

"My main comment on all the list management code pertains to the
difficulty in seeing which release_store is paired with which
load_acquire, and why."

To understand the use of acquire/release/cmpxchg in detail you really 
need to see all that code inlined/flattened so that you can see how they 
actually arrange themselves. I'm immediately suspicious of a 
load_acquire that is just before a cmpxchg on the same field because the 
cmpxchg will subsume any memory affects of the load-acquire. But the 
devil is in the detail. If a field can be set by release_store, and 
there are previous stores related to that which must be visible, then a 
read of that field can't be via a plain load (but either load_acquire or 
as part of cmpxchg).

But there are so many fields at play here that it is very difficult to 
keep track of everything.

You mention above:

"My idea is that once my thread "owns" the list head, I can use 
release_store() as a smaller hammer than cmpxchg() to sync out the new 
value ('next') as the new list head."

and I tend to agree that the second cmpxchg is certainly not needed as 
there cannot be any contended update after you've marked the "next" 
field. In essence the marking is implementing a spin lock so once you 
have the "lock" you can use normal load/stores to make updates - but 
that assumes all updates occur under the "lock", that releasing the 
"lock" also has the right barriers and that any reads sync correctly 
with the writes. If any of that doesn't hold then release_store and 
load_acquire will be needed rather than plain stores and loads. So again 
you'd need to look at all the flattened code paths to see how this all 
hangs together.

>> src/hotspot/share/prims/jvm.cpp
>>
>> I'm not clear what the relationship between async deflation and 
>> System.gc is. In this logic:
>>
>> +     if (AsyncDeflateIdleMonitors) {
>> +       // AsyncDeflateIdleMonitors needs to know when System.gc() is
>> +       // called so any special deflation can be done at a safepoint.
>> + ObjectSynchronizer::set_is_special_deflation_requested(true);
>> +     }
>>       Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>
>> if we are using a GC that treats System.gc() as a no-op, does the fact 
>> we called set_is_special_deflation_requested(true), a problem?
> 
...> Okay, back to the question... If we have a GC that treats 
System.gc() as
> a no-op, it's likely going to fail the test anyway because 'foo' will never
> get GC'ed. The special cleanup request shouldn't break anything since all
> that does is cause safepoint based deflation to happen during a safepoint
> cleanup phase (like what happens in the baseline today).

Okay. Thanks for clarifying.

>> src/hotspot/share/runtime/serviceThread.cpp
>>
>> I have some concerns that the serviceThread must now wakeup 
>> periodically instead of only on demand. It has a lot of tasks to check 
>> now and it is not obvious how complex or trivial those checks are. We 
>> don't currently gather any statistics in this area so we have no idea 
>> how long the serviceThread typically waits between requests. The fact 
>> the serviceThread is woken on demand means that the desire to be 
>> "checked at the same interval" seems exceedingly unlikely - the 
>> serviceThread would have to waiting for extended periods of time for 
>> the timed-wait to have any real affect on the frequency of the async 
>> monitor deflation.
> 
> Periodically waking up the ServiceThread is done to match the safepoint
> cleanup period of GuaranteedSafepointInterval when doing safepoint based
> deflation. Without waking up periodically, we could go a long time before
> doing any deflation via the ServiceThread and that would definitely be
> an observable change in behavior relative to safepoint based deflation.
> In some of my test runs, I had seen us go for 8-27 seconds without doing
> any async monitor deflation. Of course, that meant that there was more to
> cleanup when we did do it.

How does that occur? If the ServiceThread is woken when there are 
monitors to deflate then that implies that if it was not woken then 
there were none to deflate. Or turning it around, if after waiting for 
GuaranteedSafepointInterval ms the Service Thread finds there are 
monitors to deflate why wasn't it directly notified of that?

> I've been thinking about have my own AsyncDeflateIdleMonitorsThread and 
> move
> all the async deflation code from ServiceThread there. So many things are
> being added to the ServiceThread that I do wonder whether it is overloaded,
> but (as you said) there's no way to know that (yet).

Some stuff just got moved out of the ServiceThread to a new 
NotificationThread (which is a non-hidden Java thread), so the 
ServiceThread has a little more capacity (potentially) to handle async 
monitor deflation. But yes if the ServiceThread becomes a bottleneck it 
will defeat the purpose of offloading work from the VMThread at safepoints.

> BTW, I think GuaranteedSafepointInterval might be "broken" at this point
> in that we don't guarantee a safepoint at that interval anymore. This has
> kind of slipped through the cracks...

We can look at that separately.

> 
>> ---
>>
>> src/hotspot/share/runtime/thread.cpp
>>
>> No changes to this file.
>>
>> ---
>>
>> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>>
>> No changes to this file.
> 
> Correct. I'm unsure how to deal with this type of thing. This project
> is managed as a series of patches. Earlier patches changed those files,
> but a later patch undid those changes. So depending on where you are
> in the patch queue, the logical 'files.list' contents are different.
> I'm currently using a 'files.list' that reflects every file touched by
> the current set of patches and that's why those two files are included
> with no changes.
> 
> Suggestions?

When you think you have the final version ready for review then you 
should flatten out the patches (I don't know the right terminology) so 
that the final webrev is accurate.

> Thanks for the partial review! I'm looking forward to resolving the
> queries about and your next set of review comments.

I will try to look at the ObjectMonitor changes themselves this 
afternoon. So many non-trivial reviews/discussions at the moment, plus 
my own work :)

Thanks,
David
-----

> Dan
> 
> 
>>
>> ---
>>
>>
>>
>>> Dan
>>>
>>>
>>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> The Async Monitor Deflation project is reaching the end game. I have no
>>>> changes planned for the project at this time so all that is left is 
>>>> code
>>>> review and any changes that results from those reviews.
>>>>
>>>> Carsten and Roman! Time for you guys to chime in again on the code 
>>>> reviews.
>>>>
>>>> I have attached the list of fixes from CR6 to CR7 instead of putting it
>>>> in the main body of this email.
>>>>
>>>> Main bug URL:
>>>>
>>>>     JDK-8153224 Monitor deflation prolong safepoints
>>>>     https://bugs.openjdk.java.net/browse/JDK-8153224
>>>>
>>>> The project is currently baselined on jdk-14+19.
>>>>
>>>> Here's the full webrev URL for those folks that want to see all of the
>>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full 
>>>>
>>>>
>>>> Some folks might want to see just what has changed since the last 
>>>> review
>>>> cycle so here's a webrev for that (v2.07 inc):
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/ 
>>>>
>>>>
>>>> The OpenJDK wiki has been updated to match the 
>>>> CR7/v2.07/10-for-jdk14 changes:
>>>>
>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>>
>>>> The jdk-14+18 based v2.07 version of the patch has been thru Mach5 
>>>> tier[1-8]
>>>> testing on Oracle's usual set of platforms. It has also been through 
>>>> my usual
>>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with the 
>>>> addition
>>>> of Robbin's "MoCrazy 1024" test running in parallel with the other 
>>>> tests in
>>>> my lab.
>>>>
>>>> The jdk-14+19 based v2.07 version of the patch has been thru Mach5 
>>>> tier[1-3]
>>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in 
>>>> process.
>>>>
>>>> I did another round of SPECjbb2015 testing in Oracle's Aurora 
>>>> Performance lab
>>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>>
>>>>     - "base" is jdk-14+18
>>>>     - "v2.07" is the latest version and includes C2 
>>>> inc_om_ref_count() support
>>>>       on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors option
>>>>     - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>>     - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors 
>>>> specified
>>>>
>>>>          hbIR           hbIR
>>>>     (max attempted)  (settled)  max-jOPS  critical-jOPS runtime
>>>>     ---------------  ---------  --------  ------------- -------
>>>>            34282.00   30635.90  28831.30       20969.20 3841.30 base
>>>>            34282.00   30973.00  29345.80       21025.20 3964.10 v2.07
>>>>            34282.00   31105.60  29174.30       21074.00 3931.30 
>>>> v2.07_handshake
>>>>            34282.00   30789.70  27151.60       19839.10 3850.20 
>>>> v2.07_off
>>>>
>>>>     - The Aurora Perf comparison tool reports:
>>>>
>>>>         Comparison              max-jOPS critical-jOPS
>>>>         ----------------------  -------------------- 
>>>> --------------------
>>>>         base vs 2.07            +1.78% (s, p=0.000)   +0.27% (ns, 
>>>> p=0.790)
>>>>         base vs 2.07_handshake  +1.19% (s, p=0.007)   +0.58% (ns, 
>>>> p=0.536)
>>>>         base vs 2.07_off        -5.83% (ns, p=0.394)  -5.39% (ns, 
>>>> p=0.347)
>>>>
>>>>         (s) - significant  (ns) - not-significant
>>>>
>>>>     - For historical comparison, the Aurora Perf comparision tool
>>>>         reported for v2.06 with a baseline of jdk-13+31:
>>>>
>>>>         Comparison              max-jOPS critical-jOPS
>>>>         ----------------------  -------------------- 
>>>> --------------------
>>>>         base vs 2.06            -0.32% (ns, p=0.345)  +0.71% (ns, 
>>>> p=0.646)
>>>>         base vs 2.06_off        +0.49% (ns, p=0.292)  -1.21% (ns, 
>>>> p=0.481)
>>>>
>>>>         (s) - significant  (ns) - not-significant
>>>>
>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>
>>>> Dan
> 
> <trimmed older review invites>
>