RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)

Thu Oct 31 04:00:44 UTC 2019

Hi Dan,

First I've seen your additional tracking emails, but for now just want 
to respond to the load_acquire discussion below ...

On 30/10/2019 6:18 am, Daniel D. Daugherty wrote:
> Hi David,
> 
> Sorry for the delay in replying to this email. I spent most of last week
> migrating my development environment from my dying MBP13 to a new MBP13.
> I think I have everything recovered, but only time will tell...
> 
> More below...
> 
> On 10/23/19 12:00 AM, David Holmes wrote:
>> Hi Dan,
>>
>> Trimming to the open issues.
> 
> I'll try to do the same (but was only able to trim one thing)...
> 
> 
>>
>> On 23/10/2019 6:46 am, Daniel D. Daugherty wrote:
>>> Okay, time to return to my uses of release_store() and load_acquire().
>>> Consider this bit of code:
>>>
>>> src/hotspot/share/runtime/synchronizer.cpp:
>>>
>>>      L351: // Take an ObjectMonitor from the start of the specified 
>>> list. Also
>>>      L352: // decrements the specified counter. Returns NULL if none 
>>> are available.
>>>      L353: static ObjectMonitor* 
>>> take_from_start_of_common(ObjectMonitor* volatile * list_p,
>>>      L354:                                                 int 
>>> volatile * count_p) {
>>>      L355:   ObjectMonitor* next = NULL;
>>>      L356:   ObjectMonitor* take = NULL;
>>>      L357:   // Mark the list head to guard against A-B-A race:
>>>      L358:   if (!mark_list_head(list_p, &take, &next)) {
>>>      L359:     return NULL;  // None are available.
>>>      L360:   }
>>>      L361:   // Switch marked list head to next (which unmarks the 
>>> list head, but
>>>      L362:   // leaves take marked):
>>>      L363:   OrderAccess::release_store(list_p, next);
>>>      L364:   Atomic::dec(count_p);
>>>      L365:   // Unmark take, but leave the next value for any lagging 
>>> list
>>>      L366:   // walkers. It will get cleaned up when take is 
>>> prepended to
>>>      L367:   // the in-use list:
>>>      L368:   set_next(take, next);
>>>      L369:   return take;
>>>      L370: }
>>>
>>>
>>> On L358, we call mark_list_head() and if that call returns true, then
>>> the calling thread is the "owner" of the list head. My idea is that
>>> once my thread "owns" the list head, I can use release_store() as a
>>> smaller hammer than cmpxchg() to sync out the new value ('next') as
>>> the new list head.
>>>
>>> Should I be doing something other than release_store() here?
>>>
>>> My thinking is that release_store() pairs up with the load_acquire()
>>> of any reader thread that's trying to simply walk the list. And that
>>> release_store() also pairs up with any _other_ writer thread that's
>>> trying to mark the list head using:
>>>
>>>      mark_list_head() -> mark_next() -> cmpxchg()
>>>
>>> Here's mark_list_head():
>>>
>>>      L194: // Mark the next field in the list head ObjectMonitor. If 
>>> marking was
>>>      L195: // successful, then the mid and the unmarked next field 
>>> are returned
>>>      L196: // via parameter and true is returned. Otherwise false is 
>>> returned.
>>>      L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
>>>      L198:                            ObjectMonitor** mid_p, 
>>> ObjectMonitor** next_p) {
>>>      L199:   while (true) {
>>>      L200:     ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
>>>      L201:     if (mid == NULL) {
>>>      L202:       return false;  // The list is empty so nothing to mark.
>>>      L203:     }
>>>      L204:     if (mark_next(mid, next_p)) {
>>>      L205:       if (OrderAccess::load_acquire(list_p) != mid) {
>>>      L206:         // The list head changed so we have to retry.
>>>      L207:         set_next(mid, *next_p);  // unmark mid
>>>      L208:         continue;
>>>      L209:       }
>>>      L210:       // We marked next field to guard against races.
>>>      L211:       *mid_p = mid;
>>>      L212:       return true;
>>>      L213:     }
>>>      L214:   }
>>>      L215: }
>>>
>>> which does a matching load_acquire() to get the current list head
>>> and calls mark_next() to try and mark it. It then calls load_acquire()
>>> again to verify that the list head hasn't changed while doing the
>>> mark_next().
>>>
>>> So here's mark_next():
>>>
>>>   161 // Mark the next field in an ObjectMonitor. If marking was 
>>> successful,
>>>   162 // then the unmarked next field is returned via parameter and 
>>> true is
>>>   163 // returned. Otherwise false is returned.
>>>   164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
>>>   165   // Get current next field without any marking value.
>>>   166   ObjectMonitor* next = (ObjectMonitor*)
>>>   167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
>>>   168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) 
>>> != next) {
>>>   169     return false;  // Could not mark the next field or it was 
>>> already marked.
>>>   170   }
>>>   171   *next_p = next;
>>>   172   return true;
>>>   173 }
>>>
>>> We use load_acquire() on L166-7 to make sure that we have the latest
>>> value in the next_om field. That load_acquire() matches up with the
>>> release_store() done by another thread to unmark the next field (see
>>> the set_next() function). Or it matches with the cmpxchg() done by
>>> another to mark the next field. Yes, we could do a regular load when
>>> we know that the field is only changed by cmpxchg(), but it could
>>> have been changed by release_store() and my thinking is that we need
>>> a matching load_acquire().
>>>
>>> We have to use cmpxchg() here to try and set the mark on the next field.
>>> If successful, then we return the value of the unmarked next field via
>>> the 'next_p' param and we return true. However, just because we marked
>>> the next field in the ObjectMonitor doesn't mean that we marked the list
>>> head. That's what the second load_acquire() on L205 above verifies.
>>>
>>>
>>> Okay, that was pretty gory on the details of my thinking! I think it's
>>> pretty clear that I'm doing load_acquire()/release_store() pairs on
>>> the list head pointers or the next_om fields to make sure that I match
>>> loads and stores of _just_ the list head pointer or the next_om field
>>> in question. I'm not trying affect other fields in the ObjectMonitor
>>> or at least I don't think I have to do that.
>>>
>>> Maybe I've misunderstood the load_acquire()/release_store() stuff 
>>> (again)
>>> and this code is doing way too much! I would be happy to have my 
>>> thinking
>>> about this load_acquire()/release_store() stuff corrected and the code
>>> simplified. Maybe in this sequence:
>>>
>>>      mark_list_head() -> mark_next() -> cmpxchg()
>>>
>>> we don't need all the load_acquire() calls because the sequence only 
>>> ends
>>> successfully with a cmpxchg() and that makes memory happy everywhere 
>>> with
>>> just simple loads. Dunno. I definitely need more feedback here!
>>
>> Okay that was a massive deep dive :) but it is exactly what I was 
>> referring to when I said:
>>
>> "My main comment on all the list management code pertains to the
>> difficulty in seeing which release_store is paired with which
>> load_acquire, and why."
> 
> Yup. And that comment is exactly why I wrote the "massive deep dive"
> reply... so I _think_ I replied to the comment and thus proved the
> point that this stuff is difficult all at the same time. :-)
> 
> 
>> To understand the use of acquire/release/cmpxchg in detail you really 
>> need to see all that code inlined/flattened so that you can see how 
>> they actually arrange themselves.
> 
> Agreed...

To be clear I can't see the code that way so it is difficult to give 
concrete opinions on appropriateness of load_acquire/release_store at 
individual code locations.

> 
>> I'm immediately suspicious of a load_acquire that is just before a 
>> cmpxchg on the same field because the cmpxchg will subsume any memory 
>> affects of the load-acquire.
> 
> Okay, I'm trying to figure out if this comment is meant for L166, L167
> and L168 of mark_next():
> 
>    165   // Get current next field without any marking value.
>    166   ObjectMonitor* next = (ObjectMonitor*)
>    167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
>    168   if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) != 
> next) {
>    169     return false;  // Could not mark the next field or it was 
> already marked.
>    170   }
>    171   *next_p = next;
>    172   return true;
> 
> or if this is just a general comment or both... :-)

Both.

> 
>> But the devil is in the detail.
> 
> Agreed and with this project there are so many details. :-(
> 
> 
>> If a field can be set by release_store, and there are previous stores 
>> related to that which must be visible, then a read of that field can't 
>> be via a plain load (but either load_acquire or as part of cmpxchg).
> 
> As the comment on L166 says, the goal of L167 and L168 is to get the
> current next field without any marking value. Since the next field can be
> changed by either cmpxchg() or release_store() and we have no idea which
> one made the most recent update, we have to use a load_acquire() to get
> the latest value of the next field.
> 
> As for the cmpxchg() on L168, the potential update operation has to be
> done with cmpxchg() in order to safely and exclusively allow only one
> thread to mark the ObjectMonitor's next field (or no threads if the next
> field is already marked).
> 
> So... I think I have good reasons for a load_acquire() to be immediately
> followed by a cmpxchg(), in this case anyway.

Actually no. :) The purpose of a load_acquire is to ensure that 
subsequent loads of other variables will return values written before 
the release_store that is paired with the load_acquire. In this code 
there are no subsequent loads between the load_acquire and the cmpxchg. 
The cmpxchg itself provides full bi-directional fence semantics, so it 
deals with any loads after the cmpxchg. Hence the acquire part of the 
load_acquire is simply not needed.

> 
> Of course, I'm assuming that this is the load_acquire(), cmpxchg() pair
> that you are suspicious of. It could be some other pair or it could be
> the pairs in general.

Anywhere there are no intervening loads, between the load_acquire, and 
the cmpxchg, the acquire part is not needed.

> 
>> But there are so many fields at play here that it is very difficult to 
>> keep track of everything.
>>
>> You mention above:
>>
>> "My idea is that once my thread "owns" the list head, I can use 
>> release_store() as a smaller hammer than cmpxchg() to sync out the new 
>> value ('next') as the new list head."
>>
>> and I tend to agree that the second cmpxchg is certainly not needed as 
>> there cannot be any contended update after you've marked the "next" 
>> field.
> 
> Not quite as strong of an agreement as I was hoping for, but I'll take 
> it. :-)
> 
> Is there a different memory update/sync operation I should be using for my
> "smaller hammer"?  I don't think so, but I could be missing something...
> 
> For example, I don't think I can switch from OrderAccess::release_store()
> to Atomic::store() and from OrderAccess::load_acquire() to Atomic::load()
> because I'll lose the {release, acquire} matching up component.

Well you wouldn't need Atomic operations regardless.

Whether you could use plain load/store depends, as we've already agreed, 
on whether other accesses are dependent on the release/acquire 
semantics. And I can't see whether they are not with the current 
structure of the code, so I have no concrete suggestions here so lets 
move on.

> 
> I also need to clarify one thing. In my reply to your previous set of
> comments, I wrote:
> 
>> I think it's
>> pretty clear that I'm doing load_acquire()/release_store() pairs on
>> the list head pointers or the next_om fields to make sure that I match
>> loads and stores of _just_ the list head pointer or the next_om field
>> in question. I'm not trying affect other fields in the ObjectMonitor
>> or at least I don't think I have to do that.
> 
> Now that I've had time to think about it, I have to retract that last
> sentence. When doing these lock free list manipulations, we often will
> change some other field in the ObjectMonitor as part of that manipulation.
> 
> For example, as part of deflation, we will clear the _object field when
> we extract the ObjectMonitor from the containing in-use list and before
> we prepend it to the global free list. A release_store() is used to update
> the next field in the (cur_mid_in_use) ObjectMonitor that refers to the
> deflated ObjectMonitor. That release_store() is important for sync'ing
> out the clearing of the _object field because once the deflated
> ObjectMonitor is unlinked from an in-use list, that field is no longer
> updated by GC.

Agreed.

> 
>> In essence the marking is implementing a spin lock
> 
> I like the "spin lock" analogy.
> 
> 
>> so once you have the "lock" you can use normal load/stores to make 
>> updates
> 
> Yup. I get it, e.g., we mark an ObjectMonitor so that we can deflate it
> and part of deflation is making some updates...
> 
> 
>> - but that assumes all updates occur under the "lock", that releasing 
>> the "lock" also has the right barriers and that any reads sync 
>> correctly with the writes. If any of that doesn't hold then 
>> release_store and load_acquire will be needed rather than plain stores 
>> and loads.
> 
> I _think_ that all updates that we make as part of list manipulation are
> done after marking the ObjectMonitor's next field and before we unmark
> that same next field. So I _think_ the "lock" is properly held...
> 
> Of course, it's always possible that I missed something, but that's what
> we have code reviewers for right? :-)

I love your sense of humour ;-)

> 
>> So again you'd need to look at all the flattened code paths to see how 
>> this all hangs together.
> 
> For my load_acquire() calls I have matched them up with the corresponding
> release_store() call. And in some cases the load_acquire() call site 
> matches
> up with a call site that does release_store() on one branch and cmpxchg()
> on the other. Yes, complicated and mind numbing... Sorry about that...
> 
> I've also looked at the release_store() calls and matched them up with a
> corresponding load_acquire().
> 
> However, I've looked at this code so many different times that I might be
> blind to some aspects of it so another pair of eyes looking for missing
> matches is always welcome.

Without seeing an inlined/flattened version of the code it's near 
impossible (for me at least) to see the code paths in detail.

> 
>>
>>>> src/hotspot/share/runtime/serviceThread.cpp
>>>>
>>>> I have some concerns that the serviceThread must now wakeup 
>>>> periodically instead of only on demand. It has a lot of tasks to 
>>>> check now and it is not obvious how complex or trivial those checks 
>>>> are. We don't currently gather any statistics in this area so we 
>>>> have no idea how long the serviceThread typically waits between 
>>>> requests. The fact the serviceThread is woken on demand means that 
>>>> the desire to be "checked at the same interval" seems exceedingly 
>>>> unlikely - the serviceThread would have to waiting for extended 
>>>> periods of time for the timed-wait to have any real affect on the 
>>>> frequency of the async monitor deflation.
>>>
>>> Periodically waking up the ServiceThread is done to match the safepoint
>>> cleanup period of GuaranteedSafepointInterval when doing safepoint based
>>> deflation. Without waking up periodically, we could go a long time 
>>> before
>>> doing any deflation via the ServiceThread and that would definitely be
>>> an observable change in behavior relative to safepoint based deflation.
>>> In some of my test runs, I had seen us go for 8-27 seconds without doing
>>> any async monitor deflation. Of course, that meant that there was 
>>> more to
>>> cleanup when we did do it.
>>
>> How does that occur? If the ServiceThread is woken when there are 
>> monitors to deflate then that implies that if it was not woken then 
>> there were none to deflate. Or turning it around, if after waiting for 
>> GuaranteedSafepointInterval ms the Service Thread finds there are 
>> monitors to deflate why wasn't it directly notified of that?
> 
> I think there's a bit of confusion here.
> 
> We don't wake up the ServiceThread when there are ObjectMonitors to
> deflate. We wake up the ServiceThread to see if it can deflate any
> ObjectMonitors. ObjectMonitor idleness or suitability to be deflated is
> determined by the ServiceThread. ObjectMonitor idleness is not a property
> determined by another part of the system and the ServiceThread is woken
> to handle it.

Right my mistake.

That's it for this email.

Thanks,
David
-----

> 
>>
>>> I've been thinking about have my own AsyncDeflateIdleMonitorsThread 
>>> and move
>>> all the async deflation code from ServiceThread there. So many things 
>>> are
>>> being added to the ServiceThread that I do wonder whether it is 
>>> overloaded,
>>> but (as you said) there's no way to know that (yet).
>>
>> Some stuff just got moved out of the ServiceThread to a new 
>> NotificationThread (which is a non-hidden Java thread), so the 
>> ServiceThread has a little more capacity (potentially) to handle async 
>> monitor deflation. But yes if the ServiceThread becomes a bottleneck 
>> it will defeat the purpose of offloading work from the VMThread at 
>> safepoints.
> 
> The other "benefit" of having a dedicated AsyncDeflateIdleMonitorsThread
> is that it would be easier for us to determine how much actual work is
> done by the whole ObjectMonitor deflation procedure. We can say XX% of
> the VMs execution time went to the AsyncDeflateIdleMonitorsThread.
> 
> Also, the work moved out of the ServiceThread is not an "always on" thing
> so the help for async deflation is limited at best.
> 
> 
>>
>>> BTW, I think GuaranteedSafepointInterval might be "broken" at this point
>>> in that we don't guarantee a safepoint at that interval anymore. This 
>>> has
>>> kind of slipped through the cracks...
>>
>> We can look at that separately.
> 
> Agreed. Not directly related to the Async Monitor Deflation project
> except as the explanation for how I went so long without seeing an
> async monitor deflation... Without that cleanup safepoint happening
> every GuaranteedSafepointInterval (and without the wait() timeout value),
> we don't have ObjectSynchronizer::do_safepoint_work() periodically
> waking up the ServiceThread to do an async monitor deflation sweep...
> 
> 
> 
>>
>>>
>>>> ---
>>>>
>>>> src/hotspot/share/runtime/thread.cpp
>>>>
>>>> No changes to this file.
>>>>
>>>> ---
>>>>
>>>> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>>>>
>>>> No changes to this file.
>>>
>>> Correct. I'm unsure how to deal with this type of thing. This project
>>> is managed as a series of patches. Earlier patches changed those files,
>>> but a later patch undid those changes. So depending on where you are
>>> in the patch queue, the logical 'files.list' contents are different.
>>> I'm currently using a 'files.list' that reflects every file touched by
>>> the current set of patches and that's why those two files are included
>>> with no changes.
>>>
>>> Suggestions?
>>
>> When you think you have the final version ready for review then you 
>> should flatten out the patches (I don't know the right terminology) so 
>> that the final webrev is accurate.
> 
> Good point. I've been so focused on different reviewers wanting to come
> at this mass of code from different phases of the development process and
> that led me to keep all the various version patches intact. I think we're
> past the point where we are going to back up to, e.g., v2.05, and then
> do a completely different lock free monitor list as a new v2.06.
> 
> Okay. When we roll to v2.08, v2.07 and predecessors will get flattened so
> that we have an easier review process.
> 
> 
>>
>>> Thanks for the partial review! I'm looking forward to resolving the
>>> queries about and your next set of review comments.
>>
>> I will try to look at the ObjectMonitor changes themselves this 
>> afternoon. So many non-trivial reviews/discussions at the moment, plus 
>> my own work :)
> 
> Thanks for spending so much time and energy on this code.
> 
> Dan
> 
> 
>>
>> Thanks,
>> David
>> -----
>>
>>> Dan
>>>
>>>
>>>>
>>>> ---
>>>>
>>>>
>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> The Async Monitor Deflation project is reaching the end game. I 
>>>>>> have no
>>>>>> changes planned for the project at this time so all that is left 
>>>>>> is code
>>>>>> review and any changes that results from those reviews.
>>>>>>
>>>>>> Carsten and Roman! Time for you guys to chime in again on the code 
>>>>>> reviews.
>>>>>>
>>>>>> I have attached the list of fixes from CR6 to CR7 instead of 
>>>>>> putting it
>>>>>> in the main body of this email.
>>>>>>
>>>>>> Main bug URL:
>>>>>>
>>>>>>     JDK-8153224 Monitor deflation prolong safepoints
>>>>>>     https://bugs.openjdk.java.net/browse/JDK-8153224
>>>>>>
>>>>>> The project is currently baselined on jdk-14+19.
>>>>>>
>>>>>> Here's the full webrev URL for those folks that want to see all of 
>>>>>> the
>>>>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>>>>
>>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full 
>>>>>>
>>>>>>
>>>>>> Some folks might want to see just what has changed since the last 
>>>>>> review
>>>>>> cycle so here's a webrev for that (v2.07 inc):
>>>>>>
>>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/ 
>>>>>>
>>>>>>
>>>>>> The OpenJDK wiki has been updated to match the 
>>>>>> CR7/v2.07/10-for-jdk14 changes:
>>>>>>
>>>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>>>>
>>>>>> The jdk-14+18 based v2.07 version of the patch has been thru Mach5 
>>>>>> tier[1-8]
>>>>>> testing on Oracle's usual set of platforms. It has also been 
>>>>>> through my usual
>>>>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with 
>>>>>> the addition
>>>>>> of Robbin's "MoCrazy 1024" test running in parallel with the other 
>>>>>> tests in
>>>>>> my lab.
>>>>>>
>>>>>> The jdk-14+19 based v2.07 version of the patch has been thru Mach5 
>>>>>> tier[1-3]
>>>>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in 
>>>>>> process.
>>>>>>
>>>>>> I did another round of SPECjbb2015 testing in Oracle's Aurora 
>>>>>> Performance lab
>>>>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>>>>
>>>>>>     - "base" is jdk-14+18
>>>>>>     - "v2.07" is the latest version and includes C2 
>>>>>> inc_om_ref_count() support
>>>>>>       on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors 
>>>>>> option
>>>>>>     - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>>>>     - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors 
>>>>>> specified
>>>>>>
>>>>>>          hbIR           hbIR
>>>>>>     (max attempted)  (settled)  max-jOPS  critical-jOPS runtime
>>>>>>     ---------------  ---------  --------  ------------- -------
>>>>>>            34282.00   30635.90  28831.30       20969.20 3841.30 base
>>>>>>            34282.00   30973.00  29345.80       21025.20 3964.10 v2.07
>>>>>>            34282.00   31105.60  29174.30       21074.00 3931.30 
>>>>>> v2.07_handshake
>>>>>>            34282.00   30789.70  27151.60       19839.10 3850.20 
>>>>>> v2.07_off
>>>>>>
>>>>>>     - The Aurora Perf comparison tool reports:
>>>>>>
>>>>>>         Comparison              max-jOPS critical-jOPS
>>>>>>         ----------------------  -------------------- 
>>>>>> --------------------
>>>>>>         base vs 2.07            +1.78% (s, p=0.000) +0.27% (ns, 
>>>>>> p=0.790)
>>>>>>         base vs 2.07_handshake  +1.19% (s, p=0.007) +0.58% (ns, 
>>>>>> p=0.536)
>>>>>>         base vs 2.07_off        -5.83% (ns, p=0.394) -5.39% (ns, 
>>>>>> p=0.347)
>>>>>>
>>>>>>         (s) - significant  (ns) - not-significant
>>>>>>
>>>>>>     - For historical comparison, the Aurora Perf comparision tool
>>>>>>         reported for v2.06 with a baseline of jdk-13+31:
>>>>>>
>>>>>>         Comparison              max-jOPS critical-jOPS
>>>>>>         ----------------------  -------------------- 
>>>>>> --------------------
>>>>>>         base vs 2.06            -0.32% (ns, p=0.345) +0.71% (ns, 
>>>>>> p=0.646)
>>>>>>         base vs 2.06_off        +0.49% (ns, p=0.292) -1.21% (ns, 
>>>>>> p=0.481)
>>>>>>
>>>>>>         (s) - significant  (ns) - not-significant
>>>>>>
>>>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>>>
>>>>>> Dan
>>>
>>> <trimmed older review invites>
>>>
>