JEP Draft: Concurrent Monitor Deflation

Wed Jun 21 21:33:06 UTC 2017

Hello everybody,

I just wanted to add some performance numbers that I got when I tried
Carsten's code.

I was testing a benchmark that exxagerates monitor usage, it's part of
gc-bench [0].

The following lists some pause times without and with concurrent
deflation, as tested with Shenandoah:

-AsyncDeflateIdleMonitors:
[15,278s][info][gc,stats         ] Total Pauses (G)            =    
4,18 s (a =   149239 us) (n =    28) (lvls, us =   107422,   128906,  
132812,   156250,   309391)
[15,278s][info][gc,stats         ] Total Pauses (N)            =    
1,32 s (a =    47289 us) (n =    28) (lvls, us =       23,       34,   
48438,    57031,   228483)
[15,278s][info][gc,stats         ] Initial Mark Pauses (G)     =    
1,13 s (a =   161446 us) (n =     7) (lvls, us =   130859,   130859,  
134766,   140625,   309388)
[15,278s][info][gc,stats         ] Initial Mark Pauses (N)     =    
0,62 s (a =    89278 us) (n =     7) (lvls, us =    63672,    63672,   
65039,    68164,   228482)
[15,278s][info][gc,stats         ]     S: Synchronizer Roots   =    
0,11 s (a =    15778 us) (n =     7) (lvls, us =    11523,    11523,   
13086,    16406,    24092)

+AsyncDeflateIdleMonitors:
[14,056s][info][gc,stats         ] Total Pauses (G)            =    
2,19 s (a =    78046 us) (n =    28) (lvls, us =    46484,    49805,   
67773,    85156,   154664)
[14,056s][info][gc,stats         ] Total Pauses (N)            =    
1,33 s (a =    47462 us) (n =    28) (lvls, us =       28,       31,   
51758,    66406,    85552)
[14,056s][info][gc,stats         ] Initial Mark Pauses (G)     =    
0,64 s (a =    90731 us) (n =     7) (lvls, us =    79492,    79492,   
79688,    80273,   154659)
[14,056s][info][gc,stats         ] Initial Mark Pauses (N)     =    
0,55 s (a =    78823 us) (n =     7) (lvls, us =    76172,    76172,   
77734,    78320,    85552)
[14,056s][info][gc,stats         ]     S: Synchronizer Roots   =    
0,17 s (a =    24402 us) (n =     7) (lvls, us =    23438,    23438,   
24023,    24805,    25238)

I.e. it half-ed overall pause time from avg 149ms to 78ms.
Interestingly, sync root scanning has degraded (not by much, related to
total pause time). But this is probably not surprising: since monitors
are no longer deflated before the GC can do its work, it now needs to
scan more monitors than it needed to scan before. It's still an overall
net win. And we do have some ideas how to mitigate this too.

If nobody has any objections to this JEP, I propose that Carsten submits
it before the end of this week. ?

Roman

Am 05.06.2017 um 19:29 schrieb Carsten Varming:
> Dear runtime-devs
>
> Below is a proposal for a JEP on deflating idle monitors while the
> java threads are running. I have attached the JEP in standard text
> format as well.
>
> Summary
>
> -------
>
>
> Java monitors are implemented in three different ways in the JVM and
> the JVM automatically switches from one implementation to the next as
> needed. Biased locking installs a thread pointer in the java object,
> we say that the object is biased towards a thread and that thread is
> the only thread that may lock the object. Once a second thread
> attempts to lock a biased object, biased locking is no longer
> sufficient. The JVM then switches to basic locking for that object.
> Basic locking uses compare-and-swap (CAS) operations to ensure mutual
> exclusion to an object. If a CAS fails due to contention, i.e., a
> second thread attempts to lock an object while another thread already
> holds the lock of that object, then the basic lock implementation is
> no longer sufficient and the JVM switches to a full-blown monitor.
> Unlike the basic lock implementation (and biased locking), monitors
> require storage in the native heap, and a pointer to the allocated
> storage is installed in the java object. We say the Java monitor gets
> inflated. As the basic locking implementation is preferred over the
> use of monitors, the JVM attempts to "deflate" idle monitors in
> stop-the-world (STW) pauses when the Java threads are stopped at
> safepoints by traversing all monitors (or a the subset of monitors
> currently "used" depending on the value of MonitorInUseLists)
> "deflating" the monitors not used. Monitor deflation currently happens
> in a STW phase where it can be determined that the VM thread is the
> only thread accessing the monitors.  However some programs use many
> monitors and this deflation phase can be a source of long STW pauses. 
> This JEP explores ways to perform the deflation concurrently with the
> Java threads running.
>
>
> Goals
>
> -----
>
>
> Decrease pause times by performing monitor deflation while the Java
> threads are running.
>
>
> Provide a fully-functional re-implementation of Java object monitor
> deflation to improve JVM-induced pauses. It is a goal for this JEP to
> trim safepoint cleanup duration, where monitor deflation is currently
> handled.
>
>
> Non-Goals
>
> ---------
>
>
> It is not a goal for this JEP to remove or disable the Java object
> monitor deflation mechanism. It is not the goal to reimplement any
> other Java object monitor handling machinery. It is not the goal to
> optimize safepoint cleanups generally.
>
>
> Success Metrics
>
> ---------------
>
>
> The JEP is considered successful if safepoint cleanup costs are
> significantly reduced without ballooning up the outstanding Java
> object monitor population in target applications or negatively
> affecting end to end throughput.
>
>
> Motivation
>
> ----------
>
>
> In its current implementation, monitor deflation is performed during
> every STW pause, while all Java threads are waiting at a safepoint. We
> have seen safepoint cleanup stalls up to 200ms on
> monitor-heavy-applications. SPECjbb2015 and Cassandra are known to
> cause excessive monitor inflation and deflation. This proposal aims to
> make monitor deflation happen concurrently with running Java threads
> and not depend on the Java threads being stopped at a safepoint.
> According to various measurements, this should result in significant
> pause time improvements.
>
>
> Description
>
> -----------
>
>
> Several improvements to the deflation of monitors have been proposed
> and prototyped (e.g. using parallel threads to process monitors
> faster, deflate monitors while scanning for GC roots, etc), but the
> real breakthrough would be if there is no need to deflate monitors
> during a safepoint at all.
>
>
> Initial work has been provided by Carsten Varming. A monitor uses a
> number of fields to determine the actions needed to acquire it. A
> monitor is free if _owner is NULL and owned by a thread T if _owner is
> T or if _owner points to a basic lock object on T's stack.
> Furthermore, when T fails to immediately acquire the lock, i.e., when
> T observes contention, T atomically incremented the _count field
> before putting itself on the queue to acquire the monitor when it
> becomes free in the future. The _count field is then decremented when
> T acquires the the monitor. As a thread always increments _count
> before decrementing it, _count is always an integer between 0 and the
> total number of threads. We propose to allow _owner to be set to a
> special marker value D ((void*) -1 in the initial implementation), and
> extend the range of _count to -2^31 + 1 to 2^31 - 1. If _owner is D
> and _count is negative, then the monitor is marked as deflatable and
> any thread should attempt to deflate the monitor by installing the
> displaced mark word into the mark word of the Java object associated
> with the monitor. A thread trying to acquire the monitor should after
> atomically incrementing _count check to see if _count was negative and
> check if _owner is D. If both checks succeeds in that order, then the
> monitor is deflatable and the thread should attempt to deflate the
> monitor. If any of the checks fail, then the thread should continue
> the current locking protocol with the slight modification that if
> _owner is D, then it should behave as if _owner is NULL. The thread
> trying to deflate idle monitors will attempt to make a monitor
> deflatable by atomically installing D in _owner if _owner is NULL,
> check that _waiters is 0, and atomically installing -2^31 - 1 in
> _count if _count is 0. It then checks that _owner is still D. If any
> check fails, then back up by decrementing _count by 2^31 - 1 if the
> previous increment of _count succeeded. There is no need to attempt to
> install NULL back in _owner as the other threads threat D and NULL as
> equivalent. If the monitor is successfully marked as deflatable, then
> any thread can safely attempt to install the displaced mark word back
> in the Java object associated with the monitor. After the monitor has
> been successfully deflated, it will be ready for recycling after the
> next STW pause. To install the displaced mark word back into the Java
> object a slight complication arises as another thread might try to
> atomically install a hash code in the displaced mark word. This can
> only happen if the hash code in the displaced mark word is 0. We
> propose to atomically set the mark bit in the displaced mark word to
> signal to other threads that they should not attempt to install a hash
> code in the displaced mark word in the monitor. This new use of the
> displaced mark word in a monitor is safe as the monitor is being
> recycled and thus at the beginning and after the next STW pause no
> thread will care about the values in the recycled monitor until it is
> once again in use.
>
> A prototype can be found at
> http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/
> <http://cr.openjdk.java.net/%7Ecvarming/monitor_deflate_conc/0/>
>
>
> To avoid endless inflation / deflation cycles in the prototype,
> monitor delfation is only attempted the second time a monitor is seen
> by the thread marking monitors deflatable: If the thread (the only
> thread marking monitors as deflatable; might be service thread or some
> GC related thread or even a dedicated thread) sees a monitor in state
> New, then the thread marks the monitor as Old and move on. So there is
> little interaction between a thread inflating a lock to a monitor and
> the deflating thread, the inflating thread just have to make sure the
> monitor is marked New and this marker is published using appropriate
> barriers.
>
>
> An interesting race is between a thread acquiring the monitor and the
> deflating thread trying to deflate the monitor. If the monitor is free
> (_owner == NULL), then the deflating thread attempts to installs D in
> _owner to let everyone else know that it will attempt to deflate the
> monitor. This causes other threads to use the slow path on this
> monitor, but otherwise general threads do not make a difference
> between _owner ==  D and _owner == NULL. The D marker is used to
> ensure that no other thread has acquired the monitor while reading
> _waiter and displacing _count by -2^31 + 1. If the deflater thread
> manages to install D in _owner, read _waiters == 0, and make _count
> very negative if _count == 0, and _owner is still -1, then _waiters
> must still be 0 as other threads has to acquire the monitor to
> increase _waiters, and _count is still negative (_count was displaced
> by a very large amount). That is the signal to other threads that the
> monitor is deflatable. The actual deflation (installing the displaced
> mark word in the java object) is an idempotent operation and both the
> deflater thread and general threads will attempt to complete the
> deflation. If any of the conditions mentioned above are not meet, then
> the monitor is in use. In that case we restore _count if needed. If
> the deflating thread managed to install D in _owner, but fail to make
> _count negative, then the next thread writing to _owner (acquiring the
> lock), erases the D installed by the deflater thread.
>
>
> The above patch works only when not using thread-local monitor in-use
> lists, i.e., the global monitor arrays are used. However, we don't
> think it will be difficult to extend the scheme to monitor in-use lists:
>
>
> - One could move the in-use lists from the java threads to a global
> list inside safepoints and let the async deflater thread use the
> global list as potential candidates for deflation. That way in-use
> lists are still maintained thread-locally outside STW pauses, and
> lists of monitors in-use are still maintaine,  reducing the cost of
> going through monitors for root scanning.
>
> - An alternative to thread local in-use lists to optimize root
> scanning is to pack the oops previously stored in monitors very
> closely by storing each oops in an array and store an array index in
> each monitor. Then you only have to traverse the tightly packed array
> of oops instead of all the monitors. The allocated monitors would have
> to be extremely underutilized for traversing the monitor in-use lists
> to be faster than traversing the tightly packed array.
>
>
> Discussions:
>
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2017-May/023337.html
>
>
> Relevant bug entry:
>
> https://bugs.openjdk.java.net/browse/JDK-8153224
>
> Alternatives
>
> ------------
>
>
> There are several alternatives for this proposal.
>
>
> Disable monitor deflation completely. This would leave inflated
> monitors on internal data structures forever. Those inflated monitor
> lists are GC roots, and therefore such object would never become
> unreachable and collected by the garbage collector. This will
> eventually result in out of memory (Java heap or native). Apart from
> that, we still need to scan all those monitors for marking at a pause,
> which means it would actually degrade pause times instead of improving
> them.
>
>
> Deflate monitors incrementally during safepoint cleanup. Monitor
> inflation could still outpace incremental deflation. This would
> require some adaptive heuristics or similar to make sure it can keep
> up. Excessive inflation would still have to be addressed by longer or
> more frequent pauses, none of which would change the situation
> fundamentally. Also, like above, it means we would retain monitors in
> GC roots longer than necessary and potentially degrade pause times
> instead of improving them.
>
>
> Store monitors in the Java heap. It would avoid treating monitors as
> GC roots in the first place. It would require to teach the GCs to
> check if the mark word is a pointer into the java heap, and
> potentially copy the monitor object and update the pointer in the mark
> word. When should monitors be deflated? In such a scheme we propose to
> deflate monitors when a java thread releases the lock using a simpler
> approach than outlined in this proposal (since the thread owns the
> monitor is just have to make _count very negative to signal that this
> monitor is being deflated), the safepoint counter could be used to
> prevent more than one deflation per safepoint. Moving monitors to the
> java heap would also require the displaced mark word to be moved from
> the first word in monitors in order to get an object layout compatible
> with java objects.
>
>
> Testing
>
> -------
>
>
> The code changes the heavily used VM code path, so the regular testing
> covers the testing needs. Additional stress tests for Object monitor
> inflation/deflation would need to be performed. The change is supposed
> to be architecture-independent, but minute differences between
> platforms could introduce subtle bugs. Putting all tests into the
> regular test directories would help to test it on all platforms.
>
>
> While we expect that throughput performance impact will be little to
> none, this can be verified on standard benchmarks. The safepoint time
> improvements need to be demonstrated on targeted and standard
> workloads to justify the performance improvements.
>
>
> Risks and Assumptions
>
> ---------------------
>
>
> *Correctness.* This proposal touches very sensitive and peculiar
> synchronization code. The change should be obviously correct and
> understandable to avoid surprises. If the implementation proves too
> complicated, then accepting the implementation risk would be ill-advised.
>
>
> *Exposure*. It may be the case that concurrent deflation penalizes
> some applications, or it works incorrectly in some overlooked corner
> cases. To avoid this, we would need a commandline option that restores
> the behavior to the legacy one.
>
>
>
> Dependencies
>
> ------------
>
>
> There are no dependencies for this JEP.
>