JEP Draft: Concurrent Monitor Deflation

Tue Jun 13 12:25:44 UTC 2017

Hi,

I think the suggested approach have a good ROI and the prototype code looks reasonable.
On the down side this complicates the monitor code even further, but I still think this is a good idea.

We have started an internal discussion how to solve the need for worker threads in the runtime.
That will probably show up as a JEP candidate after the summer.

Thanks for writing this up and for the prototype code!

/Robbin

On 06/05/2017 07:29 PM, Carsten Varming wrote:
> Dear runtime-devs
> 
> Below is a proposal for a JEP on deflating idle monitors while the java threads are running. I have attached the JEP in standard text format as well.
> 
> Summary
> 
> -------
> 
> 
> Java monitors are implemented in three different ways in the JVM and the JVM automatically switches from one implementation to the next as needed. Biased locking installs a 
> thread pointer in the java object, we say that the object is biased towards a thread and that thread is the only thread that may lock the object. Once a second thread 
> attempts to lock a biased object, biased locking is no longer sufficient. The JVM then switches to basic locking for that object. Basic locking uses compare-and-swap (CAS) 
> operations to ensure mutual exclusion to an object. If a CAS fails due to contention, i.e., a second thread attempts to lock an object while another thread already holds 
> the lock of that object, then the basic lock implementation is no longer sufficient and the JVM switches to a full-blown monitor. Unlike the basic lock implementation (and 
> biased locking), monitors require storage in the native heap, and a pointer to the allocated storage is installed in the java object. We say the Java monitor gets inflated. 
> As the basic locking implementation is preferred over the use of monitors, the JVM attempts to "deflate" idle monitors in stop-the-world (STW) pauses when the Java threads 
> are stopped at safepoints by traversing all monitors (or a the subset of monitors currently "used" depending on the value of MonitorInUseLists) "deflating" the monitors not 
> used. Monitor deflation currently happens in a STW phase where it can be determined that the VM thread is the only thread accessing the monitors.  However some programs use 
> many monitors and this deflation phase can be a source of long STW pauses.  This JEP explores ways to perform the deflation concurrently with the Java threads running.
> 
> 
> Goals
> 
> -----
> 
> 
> Decrease pause times by performing monitor deflation while the Java threads are running.
> 
> 
> Provide a fully-functional re-implementation of Java object monitor deflation to improve JVM-induced pauses. It is a goal for this JEP to trim safepoint cleanup duration, 
> where monitor deflation is currently handled.
> 
> 
> Non-Goals
> 
> ---------
> 
> 
> It is not a goal for this JEP to remove or disable the Java object monitor deflation mechanism. It is not the goal to reimplement any other Java object monitor handling 
> machinery. It is not the goal to optimize safepoint cleanups generally.
> 
> 
> Success Metrics
> 
> ---------------
> 
> 
> The JEP is considered successful if safepoint cleanup costs are significantly reduced without ballooning up the outstanding Java object monitor population in target 
> applications or negatively affecting end to end throughput.
> 
> 
> Motivation
> 
> ----------
> 
> 
> In its current implementation, monitor deflation is performed during every STW pause, while all Java threads are waiting at a safepoint. We have seen safepoint cleanup 
> stalls up to 200ms on monitor-heavy-applications. SPECjbb2015 and Cassandra are known to cause excessive monitor inflation and deflation. This proposal aims to make monitor 
> deflation happen concurrently with running Java threads and not depend on the Java threads being stopped at a safepoint. According to various measurements, this should 
> result in significant pause time improvements.
> 
> 
> Description
> 
> -----------
> 
> 
> Several improvements to the deflation of monitors have been proposed and prototyped (e.g. using parallel threads to process monitors faster, deflate monitors while scanning 
> for GC roots, etc), but the real breakthrough would be if there is no need to deflate monitors during a safepoint at all.
> 
> 
> Initial work has been provided by Carsten Varming. A monitor uses a number of fields to determine the actions needed to acquire it. A monitor is free if _owner is NULL and 
> owned by a thread T if _owner is T or if _owner points to a basic lock object on T's stack. Furthermore, when T fails to immediately acquire the lock, i.e., when T observes 
> contention, T atomically incremented the _count field before putting itself on the queue to acquire the monitor when it becomes free in the future. The _count field is then 
> decremented when T acquires the the monitor. As a thread always increments _count before decrementing it, _count is always an integer between 0 and the total number of 
> threads. We propose to allow _owner to be set to a special marker value D ((void*) -1 in the initial implementation), and extend the range of _count to -2^31 + 1 to 2^31 - 
> 1. If _owner is D and _count is negative, then the monitor is marked as deflatable and any thread should attempt to deflate the monitor by installing the displaced mark 
> word into the mark word of the Java object associated with the monitor. A thread trying to acquire the monitor should after atomically incrementing _count check to see if 
> _count was negative and check if _owner is D. If both checks succeeds in that order, then the monitor is deflatable and the thread should attempt to deflate the monitor. If 
> any of the checks fail, then the thread should continue the current locking protocol with the slight modification that if _owner is D, then it should behave as if _owner is 
> NULL. The thread trying to deflate idle monitors will attempt to make a monitor deflatable by atomically installing D in _owner if _owner is NULL, check that _waiters is 0, 
> and atomically installing -2^31 - 1 in _count if _count is 0. It then checks that _owner is still D. If any check fails, then back up by decrementing _count by 2^31 - 1 if 
> the previous increment of _count succeeded. There is no need to attempt to install NULL back in _owner as the other threads threat D and NULL as equivalent. If the monitor 
> is successfully marked as deflatable, then any thread can safely attempt to install the displaced mark word back in the Java object associated with the monitor. After the 
> monitor has been successfully deflated, it will be ready for recycling after the next STW pause. To install the displaced mark word back into the Java object a slight 
> complication arises as another thread might try to atomically install a hash code in the displaced mark word. This can only happen if the hash code in the displaced mark 
> word is 0. We propose to atomically set the mark bit in the displaced mark word to signal to other threads that they should not attempt to install a hash code in the 
> displaced mark word in the monitor. This new use of the displaced mark word in a monitor is safe as the monitor is being recycled and thus at the beginning and after the 
> next STW pause no thread will care about the values in the recycled monitor until it is once again in use.
> 
> A prototype can be found at http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/
> 
> 
> To avoid endless inflation / deflation cycles in the prototype, monitor delfation is only attempted the second time a monitor is seen by the thread marking monitors 
> deflatable: If the thread (the only thread marking monitors as deflatable; might be service thread or some GC related thread or even a dedicated thread) sees a monitor in 
> state New, then the thread marks the monitor as Old and move on. So there is little interaction between a thread inflating a lock to a monitor and the deflating thread, the 
> inflating thread just have to make sure the monitor is marked New and this marker is published using appropriate barriers.
> 
> 
> An interesting race is between a thread acquiring the monitor and the deflating thread trying to deflate the monitor. If the monitor is free (_owner == NULL), then the 
> deflating thread attempts to installs D in _owner to let everyone else know that it will attempt to deflate the monitor. This causes other threads to use the slow path on 
> this monitor, but otherwise general threads do not make a difference between _owner ==  D and _owner == NULL. The D marker is used to ensure that no other thread has 
> acquired the monitor while reading _waiter and displacing _count by -2^31 + 1. If the deflater thread manages to install D in _owner, read _waiters == 0, and make _count 
> very negative if _count == 0, and _owner is still -1, then _waiters must still be 0 as other threads has to acquire the monitor to increase _waiters, and _count is still 
> negative (_count was displaced by a very large amount). That is the signal to other threads that the monitor is deflatable. The actual deflation (installing the displaced 
> mark word in the java object) is an idempotent operation and both the deflater thread and general threads will attempt to complete the deflation. If any of the conditions 
> mentioned above are not meet, then the monitor is in use. In that case we restore _count if needed. If the deflating thread managed to install D in _owner, but fail to make 
> _count negative, then the next thread writing to _owner (acquiring the lock), erases the D installed by the deflater thread.
> 
> 
> The above patch works only when not using thread-local monitor in-use lists, i.e., the global monitor arrays are used. However, we don't think it will be difficult to 
> extend the scheme to monitor in-use lists:
> 
> 
> - One could move the in-use lists from the java threads to a global list inside safepoints and let the async deflater thread use the global list as potential candidates for 
> deflation. That way in-use lists are still maintained thread-locally outside STW pauses, and lists of monitors in-use are still maintaine,  reducing the cost of going 
> through monitors for root scanning.
> 
> - An alternative to thread local in-use lists to optimize root scanning is to pack the oops previously stored in monitors very closely by storing each oops in an array and 
> store an array index in each monitor. Then you only have to traverse the tightly packed array of oops instead of all the monitors. The allocated monitors would have to be 
> extremely underutilized for traversing the monitor in-use lists to be faster than traversing the tightly packed array.
> 
> 
> Discussions:
> 
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2017-May/023337.html
> 
> 
> Relevant bug entry:
> 
> https://bugs.openjdk.java.net/browse/JDK-8153224
> 
> Alternatives
> 
> ------------
> 
> 
> There are several alternatives for this proposal.
> 
> 
> Disable monitor deflation completely. This would leave inflated monitors on internal data structures forever. Those inflated monitor lists are GC roots, and therefore such 
> object would never become unreachable and collected by the garbage collector. This will eventually result in out of memory (Java heap or native). Apart from that, we still 
> need to scan all those monitors for marking at a pause, which means it would actually degrade pause times instead of improving them.
> 
> 
> Deflate monitors incrementally during safepoint cleanup. Monitor inflation could still outpace incremental deflation. This would require some adaptive heuristics or similar 
> to make sure it can keep up. Excessive inflation would still have to be addressed by longer or more frequent pauses, none of which would change the situation fundamentally. 
> Also, like above, it means we would retain monitors in GC roots longer than necessary and potentially degrade pause times instead of improving them.
> 
> 
> Store monitors in the Java heap. It would avoid treating monitors as GC roots in the first place. It would require to teach the GCs to check if the mark word is a pointer 
> into the java heap, and potentially copy the monitor object and update the pointer in the mark word. When should monitors be deflated? In such a scheme we propose to 
> deflate monitors when a java thread releases the lock using a simpler approach than outlined in this proposal (since the thread owns the monitor is just have to make _count 
> very negative to signal that this monitor is being deflated), the safepoint counter could be used to prevent more than one deflation per safepoint. Moving monitors to the 
> java heap would also require the displaced mark word to be moved from the first word in monitors in order to get an object layout compatible with java objects.
> 
> 
> Testing
> 
> -------
> 
> 
> The code changes the heavily used VM code path, so the regular testing covers the testing needs. Additional stress tests for Object monitor inflation/deflation would need 
> to be performed. The change is supposed to be architecture-independent, but minute differences between platforms could introduce subtle bugs. Putting all tests into the 
> regular test directories would help to test it on all platforms.
> 
> 
> While we expect that throughput performance impact will be little to none, this can be verified on standard benchmarks. The safepoint time improvements need to be 
> demonstrated on targeted and standard workloads to justify the performance improvements.
> 
> 
> Risks and Assumptions
> 
> ---------------------
> 
> 
> *Correctness.* This proposal touches very sensitive and peculiar synchronization code. The change should be obviously correct and understandable to avoid surprises. If the 
> implementation proves too complicated, then accepting the implementation risk would be ill-advised.
> 
> 
> *Exposure*. It may be the case that concurrent deflation penalizes some applications, or it works incorrectly in some overlooked corner cases. To avoid this, we would need 
> a commandline option that restores the behavior to the legacy one.
> 
> 
> 
> Dependencies
> 
> ------------
> 
> 
> There are no dependencies for this JEP.
>