RFR(L) 8153224 Monitor deflation prolong safepoints

Mon Apr 15 19:19:21 UTC 2019

Dan,

Thank you for delving into this 

> On Apr 9, 2019, at 11:53 AM, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
> 
> Hi Carsten,
> 
> Thanks for responding to Karen's code review comments.
> 
> Karen,
> 
> I have a query for you down at the end of my reply...
> 
> More below...
> 
> On 4/5/19 11:01 PM, Carsten Varming wrote:
>> Dear Karen,
>> 
>> Please see inline answers.
>> 
>> On Fri, Apr 5, 2019 at 4:59 PM Karen Kinnear <karen.kinnear at oracle.com <mailto:karen.kinnear at oracle.com>> wrote:
>> 
>> 4. In EnterI: if _owner == DEFLATER_MARKER there are guarantees that 0 < _count
>> with comments that caller ensured _count <= 0
>> In ReenterI: guarantee 0 <= _count, with comment not _count < 0
>> — Am I missing something subtle here or should they be the same guarantees?
>> 
>> In ::enter _count is incremented when the thread is trying to acquire the monitor and decremented after the monitor has been acquired. The 0 < _count assertion is between those two point in the code. A thread acquiring a monitor and then calling wait will increment _count and then decrement _count as part of acquiring the monitor, thus _count can be 0 by the time the thread calls wait and when ReenterI is called.
>  
> I had a similar answer and I'm planning to tweak the comments
> and the guarantees a bit in the next round of code review (CR1);
> please see my reply to Karen's CR for the proposed changes.
Thanks again.
> 
> 
>> 
>> 9. install_displaced_markword_in_object
>> What happens if the cas_set_mark fails?
>> I get that today this handles the race with enter and deflate_monitor_using_JT. If we remove
>> the call from enter, is the expectation that we’ve blocked all others who did not set is_marked themselves?
>> If we remove the call from enter would it make sense to ensure that the cas_set_mark succeeds here?
>> 
>> I designed my original patch such that no thread would ever wait for the the deflating thread to finish deflating a monitor. If you remove install_displaced_markword_in_object from enter, then the entering thread can end up busy waiting by continuously reading the monitor pointer from the object mark word and then realizing that the monitor is being deflated and it should retry by going back to reading the object mark word. This bad behavior is completely avoided by calling install_displaced_markword_in_object.
>  
> Here's the code in question:
> 
> src/hotspot/share/runtime/objectMonitor.cpp:
> 
>> bool ObjectMonitor::enter(TRAPS) {
> <snip>
>>   // Prevent deflation. See ObjectSynchronizer::deflate_monitor() and is_busy().
>>   // Ensure the object-monitor relationship remains stable while there's contention.
>>   const jint count = Atomic::add(1, &_count);
>>   if (count <= 0 && _owner == DEFLATER_MARKER) {
>>     // Async deflation in progress. Help deflater thread install
>>     // the mark word (in case deflater thread is slow).
>>     install_displaced_markword_in_object();
>>     Self->_Stalled = 0;
>>     return false;  // Caller should retry. Never mind about _count as this monitor has been deflated.
>>   }
> 
> Our thread (T-enter) observes that the ObjectMonitor is being deflated
> by T-deflate, calls install_displaced_markword_in_object() and returns
> false to the caller which causes a retry.
> 
> Restoring the header/dmw from the ObjectMonitor to the object's header
> here isn't needed for correctness so it could be dropped (and would
> simplify the code). Your counterpoint is if we drop the call, then
> T-enter could do retry after retry if T-deflate is slow to get to its
> install_displaced_markword_in_object() call.
> 
> If T-enter calls install_displaced_markword_in_object(), then T-enter
> will do a single retry because the object T-enter is trying to lock
> will no longer have an ObjectMonitor. Okay I finally grok it...
> 
> I think we need to clarify the comment a bit:
> 
> >   if (count <= 0 && _owner == DEFLATER_MARKER) {
> >     // Async deflation is in progress. Attempt to restore the
> >     // header/dmw to the object's header so that we only retry once
> >     // if the deflater thread happens to be slow.
> >     install_displaced_markword_in_object();
> 
> 
>> In my original patch no thread would ever wait for a deflating thread to finish. This property got lost in FastHashCode as that function evolved since I wrote my patch, but I think this property is worth preserving where possible. It might even be worth looking at FastHashCode to see if we can re-establish this property.
>  
> Async Monitor Deflation causes races with FastHashCode() when the target
> object has an existing ObjectMonitor. Here's the base code:
> 
>>  768   } else if (mark->has_monitor()) {
>>  769     monitor = mark->monitor();
>>  770     temp = monitor->header();
>>  771     assert(temp->is_neutral(), "invariant");
>>  772     hash = temp->hash();
>>  773     if (hash) {
>>  774       return hash;
>>  775     }
>>  776     // Skip to the following code to reduce code size
> 
> The 'monitor' fetched on L769 is unstable due to Async Monitor
> Deflation and can cause an incorrect hash value to be returned.
> The solution is to protect the ObjectMonitor*:

> 
>>  775   } else if (mark->has_monitor()) {
>>  776     ObjectMonitorHandle omh;
>>  777     if (!omh.save_om_ptr(obj, mark)) {
>>  778       // Lost a race with async deflation so try again.
>>  779       assert(AsyncDeflateIdleMonitors, "sanity check");
>>  780       goto Retry;
>>  781     }
>>  782     monitor = omh.om_ptr();
>>  783     temp = monitor->header();
>>  784     assert(temp->is_neutral(), "invariant: header=" INTPTR_FORMAT, p2i((address)temp));
>>  785     hash = temp->hash();
>>  786     if (hash != 0) {
>>  787       return hash;
>>  788     }
>>  789     // Skip to the following code to reduce code size
> 
> where L776-L782 handle the protection duty and possible retry. So we
> have to protect the ObjectMonitor*, but, like enter(), we could call
> install_displaced_markword_in_object() when we retry which would limit
> T-hash to a single retry.

> 
> ObjectSynchronizer::inflate() has a similar collision and retry issue:
> 
>> 1456     // CASE: inflated
>> 1457     if (mark->has_monitor()) {
>> 1458       if (!omh_p->save_om_ptr(object, mark)) {
>> 1459         // Lost a race with async deflation so try again.
>> 1460         assert(AsyncDeflateIdleMonitors, "sanity check");
>> 1461         continue;
>> 1462       }
> 
> In this situation, inflate() discovers that the object already has an
> ObjectMonitor; the object may not have had one when inflate() was
> called, but it has one now. That particular race predates this project.
> 
> In any case, inflate() wants to return a stable ObjectMonitor* in the
> ObjectMonitorHandle, but if save_om_ptr() returns false, then inflate()
> has to retry. The only reason for save_om_ptr() to return false is due
> to a collision with Async Monitor Deflation. Like enter, we could call
> install_displaced_markword_in_object() when we retry which would limit
> inflate() to a single retry.
> 
> 
> Okay, I've evolved from thinking we could simplify the code by dropping
> install_displaced_markword_in_object() to thinking that I understand
> what install_displaced_markword_in_object() brings to the party. And now
> I'm proposing that we add 2 more install_displaced_markword_in_object()
> calls to limit retries on two more code paths.
> 
> Karen, are you convinced that install_displaced_markword_in_object() is
> useful?

Dan - between Carsten’s and your explanation - I get that 
1) there is value in being able to make forward progress on the object itself sooner, once we know
that DEFLATE_MARKER is trying to deflate it
2) since an enter() can CAS an _owner after DEFLATER_MARKER set but before _count -MAX_JINT,
then anyone else waiting for deflation to finish could wait a while until they actually get a turn
3) If the other callers were to use install_dmw… then they could have shorter retry cycles (although
with the enter, and now potentially other equivalent callers - more recontenders).

So I see the benefit in others trying this approach as well - to free the object from the not-currently-used
inflated monitor and retry.

You may have already sent out a new webrev with that approach - I have not worked through in my
head how that changes the details, since there are now more players, with different gates.

thanks for walking us through this,
Karen
> Dan
> 
> 
> 
>> 
>> I hope this helps.
>> 
>> Best,
>> Carsten
>>  
>>> On Apr 5, 2019, at 12:10 PM, Daniel D. Daugherty <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>> wrote:
>>> 
>>> Filed:
>>> 
>>>     JDK-8222034 Thread-SMR functions should be updated to remove work around
>>>     https://bugs.openjdk.java.net/browse/JDK-8222034 <https://bugs.openjdk.java.net/browse/JDK-8222034>
>>> 
>>> Martin and Robbin, please check it out and make sure that I captured
>>> things correctly...
>>> 
>>> Dan
>>> 
>>> 
>>> 
>>> On 4/5/19 12:01 PM, Daniel D. Daugherty wrote:
>>>> On 4/5/19 8:37 AM, Doerr, Martin wrote:
>>>>> Hi everybody,
>>>>> 
>>>>>> I think was fixed with:
>>>>>> 8202080: Introduce ordering semantics for Atomic::add/inc and other RMW atomics
>>>>>> You should get a leading sync and trailing one with the default conservative
>>>>>> model and thus get proper memory ordering.
>>>>>> Martin, I'm I correct?
>>>>> Exactly. Thanks for pointing this out. PPC uses the strongest possible ordering semantics with memory_order_conservative (default parameter).
>>>>> I've seen that comment about PPC in "void ThreadsList::inc_nested_handle_cnt()". This function could get replaced.
>>>> 
>>>> Okay so we need a new bug to update these two Thread-SMR functions:
>>>> 
>>>> src/hotspot/share/runtime/threadSMR.cpp:
>>>> 
>>>> void ThreadsList::dec_nested_handle_cnt() {
>>>>   // The decrement needs to be MO_ACQ_REL. At the moment, the Atomic::dec
>>>>   // backend on PPC does not yet conform to these requirements. Therefore
>>>>   // the decrement is simulated with an Atomic::sub(1, &addr).
>>>>   // Without this MO_ACQ_REL Atomic::dec simulation, the nested SMR mechanism
>>>>   // is not generally safe to use.
>>>>   Atomic::sub(1, &_nested_handle_cnt);
>>>> }
>>>> 
>>>> void ThreadsList::inc_nested_handle_cnt() {
>>>>   // The increment needs to be MO_SEQ_CST. At the moment, the Atomic::inc
>>>>   // backend on PPC does not yet conform to these requirements. Therefore
>>>>   // the increment is simulated with a load phi; cas phi + 1; loop.
>>>>   // Without this MO_SEQ_CST Atomic::inc simulation, the nested SMR mechanism
>>>>   // is not generally safe to use.
>>>>   intx sample = OrderAccess::load_acquire(&_nested_handle_cnt);
>>>>   for (;;) {
>>>>     if (Atomic::cmpxchg(sample + 1, &_nested_handle_cnt, sample) == sample) {
>>>>       return;
>>>>     } else {
>>>>       sample = OrderAccess::load_acquire(&_nested_handle_cnt);
>>>>     }
>>>>   }
>>>> }
>>>> 
>>>> I'll file a new bug, loop in Robbin, Erik O and Martin, and make
>>>> sure we're all in agreement. Once we decide that Thread-SMR's
>>>> functions look like, I'll adapt my Async Monitor Deflation
>>>> functions...
>>>> 
>>>> Dan
>>>> 
>>>> 
>>>>> 
>>>>> Best regards,
>>>>> Martin
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Robbin Ehn <robbin.ehn at oracle.com <mailto:robbin.ehn at oracle.com>>
>>>>> Sent: Freitag, 5. April 2019 14:07
>>>>> To: daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>; hotspot-runtime-dev at openjdk.java.net <mailto:hotspot-runtime-dev at openjdk.java.net>; Carsten Varming <varming at gmail.com <mailto:varming at gmail.com>>; Roman Kennke <rkennke at redhat.com <mailto:rkennke at redhat.com>>; Doerr, Martin <martin.doerr at sap.com <mailto:martin.doerr at sap.com>>
>>>>> Subject: Re: RFR(L) 8153224 Monitor deflation prolong safepoints
>>>>> 
>>>>> Hi Dan,
>>>>> 
>>>>> (Martin there is question for you last in this email)
>>>>> 
>>>>> After first pass I did not find any real issues.
>>>>> Considering what you had to work with, it looks good!
>>>>> 
>>>>> #1
>>>>> There are some assert which are redundant (to me at least) like:
>>>>> src/hotspot/share/runtime/objectMonitor.cpp
>>>>> L445
>>>>>     if (!dmw->is_marked() && dmw->hash() == 0) {
>>>>>       // This dmw is neutral and has not yet started the restoration
>>>>>       // protocol so we mark a copy of the dmw to begin the protocol.
>>>>>       markOop marked_dmw = dmw->set_marked();
>>>>>       assert(marked_dmw->is_marked() && marked_dmw->hash() == 0,
>>>>>              "sanity_check: is_marked=%d, hash=" INTPTR_FORMAT,
>>>>>              marked_dmw->is_marked(), marked_dmw->hash());
>>>>> 
>>>>> That assert is basically a test that set_marked worked?
>>>>> 
>>>>> L505
>>>>>       if (Atomic::cmpxchg(Self, &_owner, DEFLATER_MARKER) == DEFLATER_MARKER) {
>>>>>         assert(_succ != Self, "invariant");
>>>>>         assert(_owner == Self, "invariant");
>>>>> 
>>>>> Assert on _owner checks that our cmpxchg is not broken?
>>>>> 
>>>>> I think it's easier to read the code if some on the most obvious asserts are
>>>>> removed. Maybe comments instead.
>>>>> 
>>>>> #2
>>>>> Not your doing but I think we should remove TRAPS/Thread * Self and use
>>>>> JavaThread* instead.
>>>>> E.g. so we can change:
>>>>> void ObjectMonitor::EnterI(TRAPS) {
>>>>>     Thread * const Self = THREAD;
>>>>>     assert(Self->is_Java_thread(), "invariant");
>>>>>     assert(((JavaThread *) Self)->thread_state() == _thread_blocked, "invariant");
>>>>> 
>>>>> to:
>>>>> 
>>>>> void ObjectMonitor::EnterI(JavaThread* Self) {
>>>>>     assert(Self->thread_state() == _thread_blocked, "invariant");
>>>>> 
>>>>> #3
>>>>> src/hotspot/share/runtime/objectMonitor.inline.hpp
>>>>>    164 inline void ObjectMonitor::inc_ref_count() {
>>>>>    165   // The increment needs to be MO_SEQ_CST. At the moment, the Atomic::inc
>>>>>    166   // backend on PPC does not yet conform to these requirements. Therefore
>>>>>    167   // the increment is simulated with a load phi; cas phi + 1; loop.
>>>>>    168   // Without this MO_SEQ_CST Atomic::inc simulation, AsyncDeflateIdleMonitors
>>>>>    169   // is not safe.
>>>>> 
>>>>> I think was fixed with:
>>>>> 8202080: Introduce ordering semantics for Atomic::add/inc and other RMW atomics
>>>>> You should get a leading sync and trailing one with the default conservative
>>>>> model and thus get proper memory ordering.
>>>>> Martin, I'm I correct?
>>>>> 
>>>>> Thanks, Robbin
>>>>> 
>>>>> On 3/24/19 2:57 PM, Daniel D. Daugherty wrote:
>>>>>> Greetings,
>>>>>> 
>>>>>> Welcome to the OpenJDK review thread for my port of Carsten's work on:
>>>>>> 
>>>>>>       JDK-8153224 Monitor deflation prolong safepoints
>>>>>>       https://bugs.openjdk.java.net/browse/JDK-8153224 <https://bugs.openjdk.java.net/browse/JDK-8153224>
>>>>>> 
>>>>>> Here's a link to the OpenJDK wiki that describes my port:
>>>>>> 
>>>>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation <https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation>
>>>>>> 
>>>>>> Here's the webrev URL:
>>>>>> 
>>>>>> http://cr.openjdk.java.net/~dcubed/8153224-webrev/3-for-jdk13/ <http://cr.openjdk.java.net/~dcubed/8153224-webrev/3-for-jdk13/>
>>>>>> 
>>>>>> Here's a link to Carsten's original webrev:
>>>>>> 
>>>>>> http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/ <http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/>
>>>>>> 
>>>>>> Earlier versions of this patch have been through several rounds of
>>>>>> preliminary review. Many thanks to Carsten, Coleen, Robbin, and
>>>>>> Roman for their preliminary code review comments. A very special
>>>>>> thanks to Robbin and Roman for building and testing the patch in
>>>>>> their own environments (including specJBB2015).
>>>>>> 
>>>>>> This version of the patch has been thru Mach5 tier[1-8] testing on
>>>>>> Oracle's usual set of platforms. Earlier versions have been run
>>>>>> through my stress kit on my Linux-X64 and Solaris-X64 servers
>>>>>> (product, fastdebug, slowdebug).Earlier versions have run Kitchensink
>>>>>> for 12 hours on MacOSX, Linux-X64 and Solaris-X64 (product, fastdebug
>>>>>> and slowdebug). Earlier versions have run my monitor inflation stress
>>>>>> tests for 12 hours on MacOSX, Linux-X64 and Solaris-X64 (product,
>>>>>> fastdebug and slowdebug).
>>>>>> 
>>>>>> All of the testing done on earlier versions will be redone on the
>>>>>> latest version of the patch.
>>>>>> 
>>>>>> Thanks, in advance, for any questions, comments or suggestions.
>>>>>> 
>>>>>> Dan
>>>>>> 
>>>>>> P.S.
>>>>>> One subtest in gc/g1/humongousObjects/TestHumongousClassLoader.java
>>>>>> is currently failing in -Xcomp mode on Win* only. I've been trying
>>>>>> to characterize/analyze this failure for more than a week now. At
>>>>>> this point I'm convinced that Async Monitor Deflation is aggravating
>>>>>> an existing bug. However, I plan to have a better handle on that
>>>>>> failure before these bits are pushed to the jdk/jdk repo.
>>>> 
>>>> 
>>> 
>> 
>