JEP Draft: Concurrent Monitor Deflation

Mon Jun 5 17:29:11 UTC 2017

Dear runtime-devs

Below is a proposal for a JEP on deflating idle monitors while the java
threads are running. I have attached the JEP in standard text format as
well.

Summary

-------

Java monitors are implemented in three different ways in the JVM and the
JVM automatically switches from one implementation to the next as needed.
Biased locking installs a thread pointer in the java object, we say that
the object is biased towards a thread and that thread is the only thread
that may lock the object. Once a second thread attempts to lock a biased
object, biased locking is no longer sufficient. The JVM then switches to
basic locking for that object. Basic locking uses compare-and-swap (CAS)
operations to ensure mutual exclusion to an object. If a CAS fails due to
contention, i.e., a second thread attempts to lock an object while another
thread already holds the lock of that object, then the basic lock
implementation is no longer sufficient and the JVM switches to a full-blown
monitor. Unlike the basic lock implementation (and biased locking),
monitors require storage in the native heap, and a pointer to the allocated
storage is installed in the java object. We say the Java monitor gets
inflated. As the basic locking implementation is preferred over the use of
monitors, the JVM attempts to "deflate" idle monitors in stop-the-world
(STW) pauses when the Java threads are stopped at safepoints by traversing
all monitors (or a the subset of monitors currently "used" depending on the
value of MonitorInUseLists) "deflating" the monitors not used. Monitor
deflation currently happens in a STW phase where it can be determined that
the VM thread is the only thread accessing the monitors.  However some
programs use many monitors and this deflation phase can be a source of long
STW pauses.  This JEP explores ways to perform the deflation concurrently
with the Java threads running.

Goals

-----

Decrease pause times by performing monitor deflation while the Java threads
are running.

Provide a fully-functional re-implementation of Java object monitor
deflation to improve JVM-induced pauses. It is a goal for this JEP to trim
safepoint cleanup duration, where monitor deflation is currently handled.

Non-Goals

---------

It is not a goal for this JEP to remove or disable the Java object monitor
deflation mechanism. It is not the goal to reimplement any other Java
object monitor handling machinery. It is not the goal to optimize safepoint
cleanups generally.

Success Metrics

---------------

The JEP is considered successful if safepoint cleanup costs are
significantly reduced without ballooning up the outstanding Java object
monitor population in target applications or negatively affecting end to
end throughput.

Motivation

----------

In its current implementation, monitor deflation is performed during every
STW pause, while all Java threads are waiting at a safepoint. We have seen
safepoint cleanup stalls up to 200ms on monitor-heavy-applications.
SPECjbb2015 and Cassandra are known to cause excessive monitor inflation
and deflation. This proposal aims to make monitor deflation happen
concurrently with running Java threads and not depend on the Java threads
being stopped at a safepoint. According to various measurements, this
should result in significant pause time improvements.

Description

-----------

Several improvements to the deflation of monitors have been proposed and
prototyped (e.g. using parallel threads to process monitors faster, deflate
monitors while scanning for GC roots, etc), but the real breakthrough would
be if there is no need to deflate monitors during a safepoint at all.

Initial work has been provided by Carsten Varming. A monitor uses a number
of fields to determine the actions needed to acquire it. A monitor is free
if _owner is NULL and owned by a thread T if _owner is T or if _owner
points to a basic lock object on T's stack. Furthermore, when T fails to
immediately acquire the lock, i.e., when T observes contention, T
atomically incremented the _count field before putting itself on the queue
to acquire the monitor when it becomes free in the future. The _count field
is then decremented when T acquires the the monitor. As a thread always
increments _count before decrementing it, _count is always an integer
between 0 and the total number of threads. We propose to allow _owner to be
set to a special marker value D ((void*) -1 in the initial implementation),
and extend the range of _count to -2^31 + 1 to 2^31 - 1. If _owner is D and
_count is negative, then the monitor is marked as deflatable and any thread
should attempt to deflate the monitor by installing the displaced mark word
into the mark word of the Java object associated with the monitor. A thread
trying to acquire the monitor should after atomically incrementing _count
check to see if _count was negative and check if _owner is D. If both
checks succeeds in that order, then the monitor is deflatable and the
thread should attempt to deflate the monitor. If any of the checks fail,
then the thread should continue the current locking protocol with the
slight modification that if _owner is D, then it should behave as if _owner
is NULL. The thread trying to deflate idle monitors will attempt to make a
monitor deflatable by atomically installing D in _owner if _owner is NULL,
check that _waiters is 0, and atomically installing -2^31 - 1 in _count if
_count is 0. It then checks that _owner is still D. If any check fails,
then back up by decrementing _count by 2^31 - 1 if the previous increment
of _count succeeded. There is no need to attempt to install NULL back in
_owner as the other threads threat D and NULL as equivalent. If the monitor
is successfully marked as deflatable, then any thread can safely attempt to
install the displaced mark word back in the Java object associated with the
monitor. After the monitor has been successfully deflated, it will be ready
for recycling after the next STW pause. To install the displaced mark word
back into the Java object a slight complication arises as another thread
might try to atomically install a hash code in the displaced mark word.
This can only happen if the hash code in the displaced mark word is 0. We
propose to atomically set the mark bit in the displaced mark word to signal
to other threads that they should not attempt to install a hash code in the
displaced mark word in the monitor. This new use of the displaced mark word
in a monitor is safe as the monitor is being recycled and thus at the
beginning and after the next STW pause no thread will care about the values
in the recycled monitor until it is once again in use.

A prototype can be found at
http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/

To avoid endless inflation / deflation cycles in the prototype, monitor
delfation is only attempted the second time a monitor is seen by the thread
marking monitors deflatable: If the thread (the only thread marking
monitors as deflatable; might be service thread or some GC related thread
or even a dedicated thread) sees a monitor in state New, then the thread
marks the monitor as Old and move on. So there is little interaction
between a thread inflating a lock to a monitor and the deflating thread,
the inflating thread just have to make sure the monitor is marked New and
this marker is published using appropriate barriers.

An interesting race is between a thread acquiring the monitor and the
deflating thread trying to deflate the monitor. If the monitor is free
(_owner == NULL), then the deflating thread attempts to installs D in
_owner to let everyone else know that it will attempt to deflate the
monitor. This causes other threads to use the slow path on this monitor,
but otherwise general threads do not make a difference between _owner ==  D
and _owner == NULL. The D marker is used to ensure that no other thread has
acquired the monitor while reading _waiter and displacing _count by -2^31 +
1. If the deflater thread manages to install D in _owner, read _waiters ==
0, and make _count very negative if _count == 0, and _owner is still -1,
then _waiters must still be 0 as other threads has to acquire the monitor
to increase _waiters, and _count is still negative (_count was displaced by
a very large amount). That is the signal to other threads that the monitor
is deflatable. The actual deflation (installing the displaced mark word in
the java object) is an idempotent operation and both the deflater thread
and general threads will attempt to complete the deflation. If any of the
conditions mentioned above are not meet, then the monitor is in use. In
that case we restore _count if needed. If the deflating thread managed to
install D in _owner, but fail to make _count negative, then the next thread
writing to _owner (acquiring the lock), erases the D installed by the
deflater thread.

The above patch works only when not using thread-local monitor in-use
lists, i.e., the global monitor arrays are used. However, we don't think it
will be difficult to extend the scheme to monitor in-use lists:

- One could move the in-use lists from the java threads to a global list
inside safepoints and let the async deflater thread use the global list as
potential candidates for deflation. That way in-use lists are still
maintained thread-locally outside STW pauses, and lists of monitors in-use
are still maintaine,  reducing the cost of going through monitors for root
scanning.

- An alternative to thread local in-use lists to optimize root scanning is
to pack the oops previously stored in monitors very closely by storing each
oops in an array and store an array index in each monitor. Then you only
have to traverse the tightly packed array of oops instead of all the
monitors. The allocated monitors would have to be extremely underutilized
for traversing the monitor in-use lists to be faster than traversing the
tightly packed array.

Discussions:

http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2017-May/023337.html

Relevant bug entry:

https://bugs.openjdk.java.net/browse/JDK-8153224

Alternatives

------------

There are several alternatives for this proposal.

Disable monitor deflation completely. This would leave inflated monitors on
internal data structures forever. Those inflated monitor lists are GC
roots, and therefore such object would never become unreachable and
collected by the garbage collector. This will eventually result in out of
memory (Java heap or native). Apart from that, we still need to scan all
those monitors for marking at a pause, which means it would actually
degrade pause times instead of improving them.

Deflate monitors incrementally during safepoint cleanup. Monitor inflation
could still outpace incremental deflation. This would require some adaptive
heuristics or similar to make sure it can keep up. Excessive inflation
would still have to be addressed by longer or more frequent pauses, none of
which would change the situation fundamentally. Also, like above, it means
we would retain monitors in GC roots longer than necessary and potentially
degrade pause times instead of improving them.

Store monitors in the Java heap. It would avoid treating monitors as GC
roots in the first place. It would require to teach the GCs to check if the
mark word is a pointer into the java heap, and potentially copy the monitor
object and update the pointer in the mark word. When should monitors be
deflated? In such a scheme we propose to deflate monitors when a java
thread releases the lock using a simpler approach than outlined in this
proposal (since the thread owns the monitor is just have to make _count
very negative to signal that this monitor is being deflated), the safepoint
counter could be used to prevent more than one deflation per safepoint.
Moving monitors to the java heap would also require the displaced mark word
to be moved from the first word in monitors in order to get an object
layout compatible with java objects.

Testing

-------

The code changes the heavily used VM code path, so the regular testing
covers the testing needs. Additional stress tests for Object monitor
inflation/deflation would need to be performed. The change is supposed to
be architecture-independent, but minute differences between platforms could
introduce subtle bugs. Putting all tests into the regular test directories
would help to test it on all platforms.

While we expect that throughput performance impact will be little to none,
this can be verified on standard benchmarks. The safepoint time
improvements need to be demonstrated on targeted and standard workloads to
justify the performance improvements.

Risks and Assumptions

---------------------

*Correctness.* This proposal touches very sensitive and peculiar
synchronization code. The change should be obviously correct and
understandable to avoid surprises. If the implementation proves too
complicated, then accepting the implementation risk would be ill-advised.

*Exposure*. It may be the case that concurrent deflation penalizes some
applications, or it works incorrectly in some overlooked corner cases. To
avoid this, we would need a commandline option that restores the behavior
to the legacy one.

Dependencies

------------

There are no dependencies for this JEP.
-------------- next part --------------
JEP Draft: Concurrent Monitor Deflation

Summary
-------

Java monitors are implemented in three different ways in the JVM and the JVM automatically switches from one implementation to the next as needed. Biased locking installs a thread pointer in the java object, we say that the object is biased towards a thread and that thread is the only thread that may lock the object. Once a second thread attempts to lock a biased object, biased locking is no longer sufficient. The JVM then switches to basic locking for that object. Basic locking uses compare-and-swap (CAS) operations to ensure mutual exclusion to an object. If a CAS fails due to contention, i.e., a second thread attempts to lock an object while another thread already holds the lock of that object, then the basic lock implementation is no longer sufficient and the JVM switches to a full-blown monitor. Unlike the basic lock implementation (and biased locking), monitors require storage in the native heap, and a pointer to the allocated storage is installed in the java object. We say the Java monitor gets inflated. As the basic locking implementation is preferred over the use of monitors, the JVM attempts to "deflate" idle monitors in stop-the-world (STW) pauses when the Java threads are stopped at safepoints by traversing all monitors (or a the subset of monitors currently "used" depending on the value of MonitorInUseLists) "deflating" the monitors not used. Monitor deflation currently happens in a STW phase where it can be determined that the VM thread is the only thread accessing the monitors.  However some programs use many monitors and this deflation phase can be a source of long STW pauses.  This JEP explores ways to perform the deflation concurrently with the Java threads running.

Goals
-----

Decrease pause times by performing monitor deflation while the Java threads are running. 

Provide a fully-functional re-implementation of Java object monitor deflation to improve JVM-induced pauses. It is a goal for this JEP to trim safepoint cleanup duration, where monitor deflation is currently handled.

Non-Goals
---------

It is not a goal for this JEP to remove or disable the Java object monitor deflation mechanism. It is not the goal to reimplement any other Java object monitor handling machinery. It is not the goal to optimize safepoint cleanups generally.

Success Metrics
---------------

The JEP is considered successful if safepoint cleanup costs are significantly reduced without ballooning up the outstanding Java object monitor population in target applications or negatively affecting end to end throughput.

Motivation
----------

In its current implementation, monitor deflation is performed during every STW pause, while all Java threads are waiting at a safepoint. We have seen safepoint cleanup stalls up to 200ms on monitor-heavy-applications. SPECjbb2015 and Cassandra are known to cause excessive monitor inflation and deflation. This proposal aims to make monitor deflation happen concurrently with running Java threads and not depend on the Java threads being stopped at a safepoint. According to various measurements, this should result in significant pause time improvements.

Description
-----------

Several improvements to the deflation of monitors have been proposed and prototyped (e.g. using parallel threads to process monitors faster, deflate monitors while scanning for GC roots, etc), but the real breakthrough would be if there is no need to deflate monitors during a safepoint at all.

Initial work has been provided by Carsten Varming. A monitor uses a number of fields to determine the actions needed to acquire it. A monitor is free if _owner is NULL and owned by a thread T if _owner is T or if _owner points to a basic lock object on T's stack. Furthermore, when T fails to immediately acquire the lock, i.e., when T observes contention, T atomically incremented the _count field before putting itself on the queue to acquire the monitor when it becomes free in the future. The _count field is then decremented when T acquires the the monitor. As a thread always increments _count before decrementing it, _count is always an integer between 0 and the total number of threads. We propose to allow _owner to be set to a special marker value D ((void*) -1 in the initial implementation), and extend the range of _count to -2^31 + 1 to 2^31 - 1. If _owner is D and _count is negative, then the monitor is marked as deflatable and any thread should attempt to deflate the monitor by installing the displaced mark word into the mark word of the Java object associated with the monitor. A thread trying to acquire the monitor should after atomically incrementing _count check to see if _count was negative and check if _owner is D. If both checks succeeds in that order, then the monitor is deflatable and the thread should attempt to deflate the monitor. If any of the checks fail, then the thread should continue the current locking protocol with the slight modification that if _owner is D, then it should behave as if _owner is NULL. The thread trying to deflate idle monitors will attempt to make a monitor deflatable by atomically installing D in _owner if _owner is NULL, check that _waiters is 0, and atomically installing -2^31 - 1 in _count if _count is 0. It then checks that _owner is still D. If any check fails, then back up by decrementing _count by 2^31 - 1 if the previous increment of _count succeeded. There is no need to attempt to install NULL back in _owner as the other threads threat D and NULL as equivalent. If the monitor is successfully marked as deflatable, then any thread can safely attempt to install the displaced mark word back in the Java object associated with the monitor. After the monitor has been successfully deflated, it will be ready for recycling after the next STW pause. To install the displaced mark word back into the Java object a slight complication arises as another thread might try to atomically install a hash code in the displaced mark word. This can only happen if the hash code in the displaced mark word is 0. We propose to atomically set the mark bit in the displaced mark word to signal to other threads that they should not attempt to install a hash code in the displaced mark word in the monitor. This new use of the displaced mark word in a monitor is safe as the monitor is being recycled and thus at the beginning and after the next STW pause no thread will care about the values in the recycled monitor until it is once again in use.

A prototype can be found at http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/

To avoid endless inflation / deflation cycles in the prototype, monitor delfation is only attempted the second time a monitor is seen by the thread marking monitors deflatable: If the thread (the only thread marking monitors as deflatable; might be service thread or some GC related thread or even a dedicated thread) sees a monitor in state New, then the thread marks the monitor as Old and move on. So there is little interaction between a thread inflating a lock to a monitor and the deflating thread, the inflating thread just have to make sure the monitor is marked New and this marker is published using appropriate barriers.

An interesting race is between a thread acquiring the monitor and the deflating thread trying to deflate the monitor. If the monitor is free (_owner == NULL), then the deflating thread attempts to installs D in _owner to let everyone else know that it will attempt to deflate the monitor. This causes other threads to use the slow path on this monitor, but otherwise general threads do not make a difference between _owner ==  D and _owner == NULL. The D marker is used to ensure that no other thread has acquired the monitor while reading _waiter and displacing _count by -2^31 + 1. If the deflater thread manages to install D in _owner, read _waiters == 0, and make _count very negative if _count == 0, and _owner is still -1, then _waiters must still be 0 as other threads has to acquire the monitor to increase _waiters, and _count is still negative (_count was displaced by a very large amount). That is the signal to other threads that the monitor is deflatable. The actual deflation (installing the displaced mark word in the java object) is an idempotent operation and both the deflater thread and general threads will attempt to complete the deflation. If any of the conditions mentioned above are not meet, then the monitor is in use. In that case we restore _count if needed. If the deflating thread managed to install D in _owner, but fail to make _count negative, then the next thread writing to _owner (acquiring the lock), erases the D installed by the deflater thread.

The above patch works only when not using thread-local monitor in-use lists, i.e., the global monitor arrays are used. However, we don't think it will be difficult to extend the scheme to monitor in-use lists:

- One could move the in-use lists from the java threads to a global list inside safepoints and let the async deflater thread use the global list as potential candidates for deflation. That way in-use lists are still maintained thread-locally outside STW pauses, and lists of monitors in-use are still maintaine,  reducing the cost of going through monitors for root scanning.
- An alternative to thread local in-use lists to optimize root scanning is to pack the oops previously stored in monitors very closely by storing each oops in an array and store an array index in each monitor. Then you only have to traverse the tightly packed array of oops instead of all the monitors. The allocated monitors would have to be extremely underutilized for traversing the monitor in-use lists to be faster than traversing the tightly packed array.

Discussions:
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2017-May/023337.html

Relevant bug entry:
https://bugs.openjdk.java.net/browse/JDK-8153224

Alternatives
------------

There are several alternatives for this proposal.

Disable monitor deflation completely. This would leave inflated monitors on internal data structures forever. Those inflated monitor lists are GC roots, and therefore such object would never become unreachable and collected by the garbage collector. This will eventually result in out of memory (Java heap or native). Apart from that, we still need to scan all those monitors for marking at a pause, which means it would actually degrade pause times instead of improving them.

Deflate monitors incrementally during safepoint cleanup. Monitor inflation could still outpace incremental deflation. This would require some adaptive heuristics or similar to make sure it can keep up. Excessive inflation would still have to be addressed by longer or more frequent pauses, none of which would change the situation fundamentally. Also, like above, it means we would retain monitors in GC roots longer than necessary and potentially degrade pause times instead of improving them.

Store monitors in the Java heap. It would avoid treating monitors as GC roots in the first place. It would require to teach the GCs to check if the mark word is a pointer into the java heap, and potentially copy the monitor object and update the pointer in the mark word. When should monitors be deflated? In such a scheme we propose to deflate monitors when a java thread releases the lock using a simpler approach than outlined in this proposal (since the thread owns the monitor is just have to make _count very negative to signal that this monitor is being deflated), the safepoint counter could be used to prevent more than one deflation per safepoint. Moving monitors to the java heap would also require the displaced mark word to be moved from the first word in monitors in order to get an object layout compatible with java objects.

Testing
-------

The code changes the heavily used VM code path, so the regular testing covers the testing needs. Additional stress tests for Object monitor inflation/deflation would need to be performed. The change is supposed to be architecture-independent, but minute differences between platforms could introduce subtle bugs. Putting all tests into the regular test directories would help to test it on all platforms. 

While we expect that throughput performance impact will be little to none, this can be verified on standard benchmarks. The safepoint time improvements need to be demonstrated on targeted and standard workloads to justify the performance improvements.

Risks and Assumptions
---------------------

*Correctness.* This proposal touches very sensitive and peculiar synchronization code. The change should be obviously correct and understandable to avoid surprises. If the implementation proves too complicated, then accepting the implementation risk would be ill-advised.

*Exposure*. It may be the case that concurrent deflation penalizes some applications, or it works incorrectly in some overlooked corner cases. To avoid this, we would need a commandline option that restores the behavior to the legacy one.

Dependencies
------------

There are no dependencies for this JEP.