RFR(XL): 8203469: Faster safepoints
Robbin Ehn
robbin.ehn at oracle.com
Tue Jan 22 15:39:26 UTC 2019
Hi all, here is v01 and v02.
v01 contains update after comments from list:
http://cr.openjdk.java.net/~rehn/8203469/v01/
http://cr.openjdk.java.net/~rehn/8203469/v01/inc/
v02 contains a bug fix, explained below:
http://cr.openjdk.java.net/~rehn/8203469/v02/
http://cr.openjdk.java.net/~rehn/8203469/v02/inc/
Patricio had some good questions about try_stable_load_state.
In previous internal versions I have done the stable load by loading thread
state before and after safepoint id. For some reason I changed during a
refactoring to the reverse, which is incorrect. Consider the following:
JavaThread: state / safepoint id / poll |VMThread: global state / safepoint
counter / WaitBarrier
########################################|################################
_thread_in_native / 0 / disarmed | _not_synchronized / 0 / disarmed
| _not_synchronized / 0 / armed(1)
| _not_synchronized / 1 / armed(1)
| _synchronizing / 1 / armed(1)
_thread_in_native / 0 / armed |
| <LOAD JavaThread safepoint id:0>
| <LOAD JavaThread thread state
id:_thread_in_native>
| <LOAD JavaThread safepoint id:0>
| _synchonized / 1 / armed(1)
<JavaThread transistion to VM> |
_thread_in_native_trans / 0 / armed |
<LOAD safepoint counter(1)> |
<JavaThread goes off-proc> |
| _not_synchonized / 1 / armed(1)
| _not_synchonized / 2 / armed(1)
_thread_in_native_trans / 0 / disarmed |
| _not_synchonized / 2 / disarmed
Next safepoint starts:
| _not_synchronized / 2 / armed(3)
| _not_synchronized / 3 / armed(3)
| _synchronizing / 3 / armed(3)
_thread_in_native_trans / 0 / armed |
| <LOAD JavaThread safepoint id:0>
<JavaThread goes on-proc> |
<STORE loaded safepoint counter(1)> |
_thread_in_native_trans / 1 / armed |
_thread_blocked / 1 / armed |
<WaitBarrier not armed for 1> |
| <LOAD JavaThread thread state
id:_thread_blocked>
_thread_in_native_trans / 1 / armed |
_thread_in_native_trans / 0 / armed |
| <LOAD JavaThread safepoint id:0>
A false positive is read.
When do it the correct the safe matrix looks like:
State load 1 | Safepoint id | State load 2 | Result
##################|##############|##################|#######
any | !0/current | any | treat all as unsafe
any | any | !state1 | treat all as unsafe
any | 0/current | state1 | suspend flag is safe
thread_in_native | 0/current | thread_in_native | safe
thread_in_blocked | 0/current | thread_in_blocked| safe
!thread_in_blocked
&&
!thread_in_native | 0/current | state1 | unsafe
The case with blocked/0/blocked I added this comment for:
755 // To handle the thread_blocked state on the backedge of the WaitBarrier
from
756 // previous safepoint and reading the resetted
(0/InactiveSafepointCounter) we
757 // re-read state after we read thread safepoint id. The JavaThread
changes it
758 // state before resetting, the second read will either see a different
thread
759 // state making this an unsafe state or it can see blocked again.
760 // When we see blocked twice with a 0 safepoint id, either:
761 // - It is normally blocked, e.g. on Mutex, TBIVM.
762 // - It was in SS:block(), looped around to SS:block() and is blocked on
the WaitBarrier.
763 // - It was in SS:block() but now on a Mutex.
764 // Either case safe.
I hope above explains why loading state before and after safepoint id is
sufficient.
Passes, with flying colors, t1-5, stress test, KS 24h stress.
Thanks, Robbin
On 1/15/19 11:39 AM, Robbin Ehn wrote:
> Hi all, please review.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>
> Thanks to Dan for pre-reviewing a lot!
>
> Background:
> ZGC often does very short safepoint operations. For a perspective, in a
> specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
> operation it is. The time it takes to stop and start the JavaThreads is relative
> very large to a ZGC safepoint. With an operation that just takes 0.2ms the
> overhead of stopping and starting JavaThreads is several times the operation.
>
> High-level functionality change:
> Serializing the starting over Threads_lock takes time.
> - Don't wait on Threads_lock use the WaitBarrier.
> Serializing the stopping over Safepoint_lock takes time.
> - Let threads stop in parallel, remove Safepoint_lock.
>
> Details:
> JavaThreads have 2 abstract logical states: unsafe or safe.
> - Safe means the JavaThread will not touch Java heap or VM internal structures
> without doing a transition and block before doing so.
> - The safe states are:
> - When polls armed: _thread_in_native and _thread_blocked.
> - When Threads_lock is held: externally suspended flag is set.
> - VM Thread have polls armed and holds the Threads_lock during a
> safepoint.
> - Unsafe means that either Java heap or VM internal structures can be accessed
> by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
> - All combination that are not safe are unsafe.
>
> We cannot start a safepoint until all unsafe threads have transitioned to a safe
> state. To make them safe, we arm polls in compiled code and make sure any
> transition to another unsafe state will be blocked. JavaThreads which are unsafe
> with state _thread_in_Java may transition to _thread_in_native without being
> blocked, since it just became a safe thread and we can proceed. Any safe thread
> may try to transition at any time to an unsafe state, thus coming into the
> safepoint blocking code at any moment, e.g., after the safepoint is over, or
> even at the beginning of next safepoint.
>
> The VMThread cannot tolerate false positives from the JavaThread thread state
> because that would mean starting the safepoint without all JavaThreads being
> safe. The two locks (Threads_lock and Safepoint_lock) make sure we never observe
> false positives from the safepoint blocking code, if we remove them, how do we
> handle false positives?
>
> By first publishing which barrier tag (safepoint counter) we will call
> WaitBarrier.wait() with as the threads safepoint id and then change the state to
> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
> the state. A stable load of the thread state is successful if the thread
> safepoint id is the same both before and after the load of the state and
> safepoint id is current or InactiveSafepointCounter. If the stable load fails,
> the thread is considered safepoint unsafe. It's no longer enough that thread is
> have state _thread_blocked it must also have correct safepoint id before and
> after we read the state.
>
> Performance:
> The result of faster safepoints is that the average CPU time for JavaThreads
> between safepoints is higher, thus increasing the allocation rate. The thread
> that stops first waits shorter time until it gets started. Even the thread that
> stops last also have shorter stop since we start them faster. If your
> application is using a concurrent GC it may need re-tunning since each java
> worker thread have an increased CPU time/allocation rate. Often this means max
> performance is achieved using slightly less java worker threads than before.
> Also the increase allocation rate means shorter time between GC safepoints.
> - If you are using a non-concurrent GC, you should see improved latency and
> throughput.
> - After re-tunning with a concurrent GC throughput should be equal or better but
> with better latency. But bear in mind this is a latency patch, not a
> throughput one.
> With current code a java thread is not to guarantee to run between safepoint (in
> theory a java thread can be starved indefinitely), since the VM thread may
> re-grab the Threads_locks before it woke up from previous safepoint. If the
> GC/VM don't respect MMU (minimum mutator utilization) or if your machine is very
> over-provisioned this can happen.
> The current schema thus re-safepoint quickly if the java threads have not
> started yet at the cost of latency. Since the new code uses the WaitBarrier with
> the safepoint counter, all threads must roll forward to next safepoint by
> getting at least some CPU time between two safepoints. Meaning MMU violations
> are more obvious.
>
> Some examples on numbers:
> - On a 16 strand machine synchronization and un-synchronization/starting is at
> least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
> starting ~400->~100us.
> (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
> synchronization time on 16 strands and ~5% score increase. In this case the GC
> op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
> - specJBB2015 ParGC ~9% increase in critical-jops.
>
> Thanks, Robbin
More information about the hotspot-dev
mailing list