RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Wed Oct 30 16:00:50 UTC 2019
This is just a checklist email to make sure that I addressed
everything in this particular review email...
On 10/29/19 4:18 PM, Daniel D. Daugherty wrote:
> Hi David,
>
> Sorry for the delay in replying to this email. I spent most of last week
> migrating my development environment from my dying MBP13 to a new MBP13.
> I think I have everything recovered, but only time will tell...
>
> More below...
>
> On 10/23/19 12:00 AM, David Holmes wrote:
>> Hi Dan,
>>
>> Trimming to the open issues.
>
> I'll try to do the same (but was only able to trim one thing)...
>
>
>>
>> On 23/10/2019 6:46 am, Daniel D. Daugherty wrote:
>>> Okay, time to return to my uses of release_store() and load_acquire().
>>> Consider this bit of code:
>>>
>>> src/hotspot/share/runtime/synchronizer.cpp:
>>>
>>> L351: // Take an ObjectMonitor from the start of the specified
>>> list. Also
>>> L352: // decrements the specified counter. Returns NULL if none
>>> are available.
>>> L353: static ObjectMonitor*
>>> take_from_start_of_common(ObjectMonitor* volatile * list_p,
>>> L354: int
>>> volatile * count_p) {
>>> L355: ObjectMonitor* next = NULL;
>>> L356: ObjectMonitor* take = NULL;
>>> L357: // Mark the list head to guard against A-B-A race:
>>> L358: if (!mark_list_head(list_p, &take, &next)) {
>>> L359: return NULL; // None are available.
>>> L360: }
>>> L361: // Switch marked list head to next (which unmarks the
>>> list head, but
>>> L362: // leaves take marked):
>>> L363: OrderAccess::release_store(list_p, next);
>>> L364: Atomic::dec(count_p);
>>> L365: // Unmark take, but leave the next value for any
>>> lagging list
>>> L366: // walkers. It will get cleaned up when take is
>>> prepended to
>>> L367: // the in-use list:
>>> L368: set_next(take, next);
>>> L369: return take;
>>> L370: }
>>>
>>>
>>> On L358, we call mark_list_head() and if that call returns true, then
>>> the calling thread is the "owner" of the list head. My idea is that
>>> once my thread "owns" the list head, I can use release_store() as a
>>> smaller hammer than cmpxchg() to sync out the new value ('next') as
>>> the new list head.
>>>
>>> Should I be doing something other than release_store() here?
>>>
>>> My thinking is that release_store() pairs up with the load_acquire()
>>> of any reader thread that's trying to simply walk the list. And that
>>> release_store() also pairs up with any _other_ writer thread that's
>>> trying to mark the list head using:
>>>
>>> mark_list_head() -> mark_next() -> cmpxchg()
>>>
>>> Here's mark_list_head():
>>>
>>> L194: // Mark the next field in the list head ObjectMonitor. If
>>> marking was
>>> L195: // successful, then the mid and the unmarked next field
>>> are returned
>>> L196: // via parameter and true is returned. Otherwise false is
>>> returned.
>>> L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
>>> L198: ObjectMonitor** mid_p,
>>> ObjectMonitor** next_p) {
>>> L199: while (true) {
>>> L200: ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
>>> L201: if (mid == NULL) {
>>> L202: return false; // The list is empty so nothing to
>>> mark.
>>> L203: }
>>> L204: if (mark_next(mid, next_p)) {
>>> L205: if (OrderAccess::load_acquire(list_p) != mid) {
>>> L206: // The list head changed so we have to retry.
>>> L207: set_next(mid, *next_p); // unmark mid
>>> L208: continue;
>>> L209: }
>>> L210: // We marked next field to guard against races.
>>> L211: *mid_p = mid;
>>> L212: return true;
>>> L213: }
>>> L214: }
>>> L215: }
>>>
>>> which does a matching load_acquire() to get the current list head
>>> and calls mark_next() to try and mark it. It then calls load_acquire()
>>> again to verify that the list head hasn't changed while doing the
>>> mark_next().
>>>
>>> So here's mark_next():
>>>
>>> 161 // Mark the next field in an ObjectMonitor. If marking was
>>> successful,
>>> 162 // then the unmarked next field is returned via parameter and
>>> true is
>>> 163 // returned. Otherwise false is returned.
>>> 164 static bool mark_next(ObjectMonitor* om, ObjectMonitor**
>>> next_p) {
>>> 165 // Get current next field without any marking value.
>>> 166 ObjectMonitor* next = (ObjectMonitor*)
>>> 167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
>>> 168 if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next)
>>> != next) {
>>> 169 return false; // Could not mark the next field or it was
>>> already marked.
>>> 170 }
>>> 171 *next_p = next;
>>> 172 return true;
>>> 173 }
>>>
>>> We use load_acquire() on L166-7 to make sure that we have the latest
>>> value in the next_om field. That load_acquire() matches up with the
>>> release_store() done by another thread to unmark the next field (see
>>> the set_next() function). Or it matches with the cmpxchg() done by
>>> another to mark the next field. Yes, we could do a regular load when
>>> we know that the field is only changed by cmpxchg(), but it could
>>> have been changed by release_store() and my thinking is that we need
>>> a matching load_acquire().
>>>
>>> We have to use cmpxchg() here to try and set the mark on the next
>>> field.
>>> If successful, then we return the value of the unmarked next field via
>>> the 'next_p' param and we return true. However, just because we marked
>>> the next field in the ObjectMonitor doesn't mean that we marked the
>>> list
>>> head. That's what the second load_acquire() on L205 above verifies.
>>>
>>>
>>> Okay, that was pretty gory on the details of my thinking! I think it's
>>> pretty clear that I'm doing load_acquire()/release_store() pairs on
>>> the list head pointers or the next_om fields to make sure that I match
>>> loads and stores of _just_ the list head pointer or the next_om field
>>> in question. I'm not trying affect other fields in the ObjectMonitor
>>> or at least I don't think I have to do that.
>>>
>>> Maybe I've misunderstood the load_acquire()/release_store() stuff
>>> (again)
>>> and this code is doing way too much! I would be happy to have my
>>> thinking
>>> about this load_acquire()/release_store() stuff corrected and the code
>>> simplified. Maybe in this sequence:
>>>
>>> mark_list_head() -> mark_next() -> cmpxchg()
>>>
>>> we don't need all the load_acquire() calls because the sequence only
>>> ends
>>> successfully with a cmpxchg() and that makes memory happy everywhere
>>> with
>>> just simple loads. Dunno. I definitely need more feedback here!
>>
>> Okay that was a massive deep dive :) but it is exactly what I was
>> referring to when I said:
>>
>> "My main comment on all the list management code pertains to the
>> difficulty in seeing which release_store is paired with which
>> load_acquire, and why."
>
> Yup. And that comment is exactly why I wrote the "massive deep dive"
> reply... so I _think_ I replied to the comment and thus proved the
> point that this stuff is difficult all at the same time. :-)
>
>
>> To understand the use of acquire/release/cmpxchg in detail you really
>> need to see all that code inlined/flattened so that you can see how
>> they actually arrange themselves.
>
> Agreed...
>
>
>> I'm immediately suspicious of a load_acquire that is just before a
>> cmpxchg on the same field because the cmpxchg will subsume any memory
>> affects of the load-acquire.
>
> Okay, I'm trying to figure out if this comment is meant for L166, L167
> and L168 of mark_next():
>
> 165 // Get current next field without any marking value.
> 166 ObjectMonitor* next = (ObjectMonitor*)
> 167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
> 168 if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) !=
> next) {
> 169 return false; // Could not mark the next field or it was
> already marked.
> 170 }
> 171 *next_p = next;
> 172 return true;
>
> or if this is just a general comment or both... :-)
>
>
>> But the devil is in the detail.
>
> Agreed and with this project there are so many details. :-(
>
>
>> If a field can be set by release_store, and there are previous stores
>> related to that which must be visible, then a read of that field
>> can't be via a plain load (but either load_acquire or as part of
>> cmpxchg).
>
> As the comment on L166 says, the goal of L167 and L168 is to get the
> current next field without any marking value. Since the next field can be
> changed by either cmpxchg() or release_store() and we have no idea which
> one made the most recent update, we have to use a load_acquire() to get
> the latest value of the next field.
>
> As for the cmpxchg() on L168, the potential update operation has to be
> done with cmpxchg() in order to safely and exclusively allow only one
> thread to mark the ObjectMonitor's next field (or no threads if the next
> field is already marked).
>
> So... I think I have good reasons for a load_acquire() to be immediately
> followed by a cmpxchg(), in this case anyway.
>
>
> Of course, I'm assuming that this is the load_acquire(), cmpxchg() pair
> that you are suspicious of. It could be some other pair or it could be
> the pairs in general.
I took another look thru the code and I don't see an obvious
load_acquire(), cmpxchg() pair other than the one in mark_next()
so if you're happy with the above explanation, I think this
issue is done.
>
>> But there are so many fields at play here that it is very difficult
>> to keep track of everything.
>>
>> You mention above:
>>
>> "My idea is that once my thread "owns" the list head, I can use
>> release_store() as a smaller hammer than cmpxchg() to sync out the
>> new value ('next') as the new list head."
>>
>> and I tend to agree that the second cmpxchg is certainly not needed
>> as there cannot be any contended update after you've marked the
>> "next" field.
>
> Not quite as strong of an agreement as I was hoping for, but I'll take
> it. :-)
>
> Is there a different memory update/sync operation I should be using
> for my
> "smaller hammer"? I don't think so, but I could be missing something...
>
> For example, I don't think I can switch from OrderAccess::release_store()
> to Atomic::store() and from OrderAccess::load_acquire() to Atomic::load()
> because I'll lose the {release, acquire} matching up component.
>
>
> I also need to clarify one thing. In my reply to your previous set of
> comments, I wrote:
>
>> I think it's
>> pretty clear that I'm doing load_acquire()/release_store() pairs on
>> the list head pointers or the next_om fields to make sure that I match
>> loads and stores of _just_ the list head pointer or the next_om field
>> in question. I'm not trying affect other fields in the ObjectMonitor
>> or at least I don't think I have to do that.
>
> Now that I've had time to think about it, I have to retract that last
> sentence. When doing these lock free list manipulations, we often will
> change some other field in the ObjectMonitor as part of that
> manipulation.
>
> For example, as part of deflation, we will clear the _object field when
> we extract the ObjectMonitor from the containing in-use list and before
> we prepend it to the global free list. A release_store() is used to
> update
> the next field in the (cur_mid_in_use) ObjectMonitor that refers to the
> deflated ObjectMonitor. That release_store() is important for sync'ing
> out the clearing of the _object field because once the deflated
> ObjectMonitor is unlinked from an in-use list, that field is no longer
> updated by GC.
>
>
>> In essence the marking is implementing a spin lock
>
> I like the "spin lock" analogy.
>
>
>> so once you have the "lock" you can use normal load/stores to make
>> updates
>
> Yup. I get it, e.g., we mark an ObjectMonitor so that we can deflate it
> and part of deflation is making some updates...
>
>
>> - but that assumes all updates occur under the "lock", that releasing
>> the "lock" also has the right barriers and that any reads sync
>> correctly with the writes. If any of that doesn't hold then
>> release_store and load_acquire will be needed rather than plain
>> stores and loads.
>
> I _think_ that all updates that we make as part of list manipulation are
> done after marking the ObjectMonitor's next field and before we unmark
> that same next field. So I _think_ the "lock" is properly held...
>
> Of course, it's always possible that I missed something, but that's what
> we have code reviewers for right? :-)
>
>
>> So again you'd need to look at all the flattened code paths to see
>> how this all hangs together.
>
> For my load_acquire() calls I have matched them up with the corresponding
> release_store() call. And in some cases the load_acquire() call site
> matches
> up with a call site that does release_store() on one branch and cmpxchg()
> on the other. Yes, complicated and mind numbing... Sorry about that...
>
> I've also looked at the release_store() calls and matched them up with a
> corresponding load_acquire().
>
> However, I've looked at this code so many different times that I might be
> blind to some aspects of it so another pair of eyes looking for missing
> matches is always welcome.
>
>
>>
>>>> src/hotspot/share/runtime/serviceThread.cpp
>>>>
>>>> I have some concerns that the serviceThread must now wakeup
>>>> periodically instead of only on demand. It has a lot of tasks to
>>>> check now and it is not obvious how complex or trivial those checks
>>>> are. We don't currently gather any statistics in this area so we
>>>> have no idea how long the serviceThread typically waits between
>>>> requests. The fact the serviceThread is woken on demand means that
>>>> the desire to be "checked at the same interval" seems exceedingly
>>>> unlikely - the serviceThread would have to waiting for extended
>>>> periods of time for the timed-wait to have any real affect on the
>>>> frequency of the async monitor deflation.
>>>
>>> Periodically waking up the ServiceThread is done to match the safepoint
>>> cleanup period of GuaranteedSafepointInterval when doing safepoint
>>> based
>>> deflation. Without waking up periodically, we could go a long time
>>> before
>>> doing any deflation via the ServiceThread and that would definitely be
>>> an observable change in behavior relative to safepoint based deflation.
>>> In some of my test runs, I had seen us go for 8-27 seconds without
>>> doing
>>> any async monitor deflation. Of course, that meant that there was
>>> more to
>>> cleanup when we did do it.
>>
>> How does that occur? If the ServiceThread is woken when there are
>> monitors to deflate then that implies that if it was not woken then
>> there were none to deflate. Or turning it around, if after waiting
>> for GuaranteedSafepointInterval ms the Service Thread finds there are
>> monitors to deflate why wasn't it directly notified of that?
>
> I think there's a bit of confusion here.
>
> We don't wake up the ServiceThread when there are ObjectMonitors to
> deflate. We wake up the ServiceThread to see if it can deflate any
> ObjectMonitors. ObjectMonitor idleness or suitability to be deflated is
> determined by the ServiceThread. ObjectMonitor idleness is not a property
> determined by another part of the system and the ServiceThread is woken
> to handle it.
>
>
>>
>>> I've been thinking about have my own AsyncDeflateIdleMonitorsThread
>>> and move
>>> all the async deflation code from ServiceThread there. So many
>>> things are
>>> being added to the ServiceThread that I do wonder whether it is
>>> overloaded,
>>> but (as you said) there's no way to know that (yet).
>>
>> Some stuff just got moved out of the ServiceThread to a new
>> NotificationThread (which is a non-hidden Java thread), so the
>> ServiceThread has a little more capacity (potentially) to handle
>> async monitor deflation. But yes if the ServiceThread becomes a
>> bottleneck it will defeat the purpose of offloading work from the
>> VMThread at safepoints.
>
> The other "benefit" of having a dedicated AsyncDeflateIdleMonitorsThread
> is that it would be easier for us to determine how much actual work is
> done by the whole ObjectMonitor deflation procedure. We can say XX% of
> the VMs execution time went to the AsyncDeflateIdleMonitorsThread.
>
> Also, the work moved out of the ServiceThread is not an "always on" thing
> so the help for async deflation is limited at best.
I'm not adding an AsyncDeflateIdleMonitorsThread at this time.
>
>
>>
>>> BTW, I think GuaranteedSafepointInterval might be "broken" at this
>>> point
>>> in that we don't guarantee a safepoint at that interval anymore.
>>> This has
>>> kind of slipped through the cracks...
>>
>> We can look at that separately.
>
> Agreed. Not directly related to the Async Monitor Deflation project
> except as the explanation for how I went so long without seeing an
> async monitor deflation... Without that cleanup safepoint happening
> every GuaranteedSafepointInterval (and without the wait() timeout value),
> we don't have ObjectSynchronizer::do_safepoint_work() periodically
> waking up the ServiceThread to do an async monitor deflation sweep...
>
>
>
>>
>>>
>>>> ---
>>>>
>>>> src/hotspot/share/runtime/thread.cpp
>>>>
>>>> No changes to this file.
>>>>
>>>> ---
>>>>
>>>> test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
>>>>
>>>> No changes to this file.
>>>
>>> Correct. I'm unsure how to deal with this type of thing. This project
>>> is managed as a series of patches. Earlier patches changed those files,
>>> but a later patch undid those changes. So depending on where you are
>>> in the patch queue, the logical 'files.list' contents are different.
>>> I'm currently using a 'files.list' that reflects every file touched by
>>> the current set of patches and that's why those two files are included
>>> with no changes.
>>>
>>> Suggestions?
>>
>> When you think you have the final version ready for review then you
>> should flatten out the patches (I don't know the right terminology)
>> so that the final webrev is accurate.
>
> Good point. I've been so focused on different reviewers wanting to come
> at this mass of code from different phases of the development process and
> that led me to keep all the various version patches intact. I think we're
> past the point where we are going to back up to, e.g., v2.05, and then
> do a completely different lock free monitor list as a new v2.06.
>
> Okay. When we roll to v2.08, v2.07 and predecessors will get flattened so
> that we have an easier review process.
I'll make sure things are flattened for the CR8 review.
>
>
>>
>>> Thanks for the partial review! I'm looking forward to resolving the
>>> queries about and your next set of review comments.
>>
>> I will try to look at the ObjectMonitor changes themselves this
>> afternoon. So many non-trivial reviews/discussions at the moment,
>> plus my own work :)
>
> Thanks for spending so much time and energy on this code.
>
> Dan
>
>
>>
>> Thanks,
>> David
>> -----
I decided not to add a AsyncDeflateIdleMonitorsThread at this time
and I don't see any other issue that requires changes.
Please let me know if I've missed anything.
Dan
>>
>>> Dan
>>>
>>>
>>>>
>>>> ---
>>>>
>>>>
>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> The Async Monitor Deflation project is reaching the end game. I
>>>>>> have no
>>>>>> changes planned for the project at this time so all that is left
>>>>>> is code
>>>>>> review and any changes that results from those reviews.
>>>>>>
>>>>>> Carsten and Roman! Time for you guys to chime in again on the
>>>>>> code reviews.
>>>>>>
>>>>>> I have attached the list of fixes from CR6 to CR7 instead of
>>>>>> putting it
>>>>>> in the main body of this email.
>>>>>>
>>>>>> Main bug URL:
>>>>>>
>>>>>> JDK-8153224 Monitor deflation prolong safepoints
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8153224
>>>>>>
>>>>>> The project is currently baselined on jdk-14+19.
>>>>>>
>>>>>> Here's the full webrev URL for those folks that want to see all
>>>>>> of the
>>>>>> current Async Monitor Deflation code in one go (v2.07 full):
>>>>>>
>>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full
>>>>>>
>>>>>>
>>>>>> Some folks might want to see just what has changed since the last
>>>>>> review
>>>>>> cycle so here's a webrev for that (v2.07 inc):
>>>>>>
>>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/
>>>>>>
>>>>>>
>>>>>> The OpenJDK wiki has been updated to match the
>>>>>> CR7/v2.07/10-for-jdk14 changes:
>>>>>>
>>>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>>>>>
>>>>>>
>>>>>> The jdk-14+18 based v2.07 version of the patch has been thru
>>>>>> Mach5 tier[1-8]
>>>>>> testing on Oracle's usual set of platforms. It has also been
>>>>>> through my usual
>>>>>> set of stress testing on Linux-X64, macOSX and Solaris-X64 with
>>>>>> the addition
>>>>>> of Robbin's "MoCrazy 1024" test running in parallel with the
>>>>>> other tests in
>>>>>> my lab.
>>>>>>
>>>>>> The jdk-14+19 based v2.07 version of the patch has been thru
>>>>>> Mach5 tier[1-3]
>>>>>> test on Oracle's usual set of platforms. Mach5 tier[4-8] are in
>>>>>> process.
>>>>>>
>>>>>> I did another round of SPECjbb2015 testing in Oracle's Aurora
>>>>>> Performance lab
>>>>>> using using their tuned SPECjbb2015 Linux-X64 G1 configs:
>>>>>>
>>>>>> - "base" is jdk-14+18
>>>>>> - "v2.07" is the latest version and includes C2
>>>>>> inc_om_ref_count() support
>>>>>> on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors
>>>>>> option
>>>>>> - "off" is with -XX:-AsyncDeflateIdleMonitors specified
>>>>>> - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors
>>>>>> specified
>>>>>>
>>>>>> hbIR hbIR
>>>>>> (max attempted) (settled) max-jOPS critical-jOPS runtime
>>>>>> --------------- --------- -------- ------------- -------
>>>>>> 34282.00 30635.90 28831.30 20969.20 3841.30 base
>>>>>> 34282.00 30973.00 29345.80 21025.20 3964.10
>>>>>> v2.07
>>>>>> 34282.00 31105.60 29174.30 21074.00 3931.30
>>>>>> v2.07_handshake
>>>>>> 34282.00 30789.70 27151.60 19839.10 3850.20
>>>>>> v2.07_off
>>>>>>
>>>>>> - The Aurora Perf comparison tool reports:
>>>>>>
>>>>>> Comparison max-jOPS critical-jOPS
>>>>>> ---------------------- --------------------
>>>>>> --------------------
>>>>>> base vs 2.07 +1.78% (s, p=0.000) +0.27% (ns,
>>>>>> p=0.790)
>>>>>> base vs 2.07_handshake +1.19% (s, p=0.007) +0.58% (ns,
>>>>>> p=0.536)
>>>>>> base vs 2.07_off -5.83% (ns, p=0.394) -5.39% (ns,
>>>>>> p=0.347)
>>>>>>
>>>>>> (s) - significant (ns) - not-significant
>>>>>>
>>>>>> - For historical comparison, the Aurora Perf comparision tool
>>>>>> reported for v2.06 with a baseline of jdk-13+31:
>>>>>>
>>>>>> Comparison max-jOPS critical-jOPS
>>>>>> ---------------------- --------------------
>>>>>> --------------------
>>>>>> base vs 2.06 -0.32% (ns, p=0.345) +0.71% (ns,
>>>>>> p=0.646)
>>>>>> base vs 2.06_off +0.49% (ns, p=0.292) -1.21% (ns,
>>>>>> p=0.481)
>>>>>>
>>>>>> (s) - significant (ns) - not-significant
>>>>>>
>>>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>>>
>>>>>> Dan
>>>
>>> <trimmed older review invites>
>>>
>
>
More information about the hotspot-runtime-dev
mailing list