RFR(XL): 8203469: Faster safepoints
Patricio Chilano
patricio.chilano.mateo at oracle.com
Fri Feb 8 15:58:47 UTC 2019
Hi Robbin,
Version v06_2 looks good to me. One minor comment:
--- a/src/hotspot/share/runtime/interfaceSupport.inline.hpp
+++ b/src/hotspot/share/runtime/interfaceSupport.inline.hpp
@@ -314,9 +314,6 @@
// Once we are blocked vm expects stack to be walkable
thread->frame_anchor()->make_walkable(thread);
- thread->set_thread_state((JavaThreadState)(_thread_in_vm + 1));
- InterfaceSupport::serialize_thread_state_with_handler(thread);
-
thread->set_thread_state(_thread_blocked);
Since we are not calling SS::block() anymore in the TBIVMWDC constructor
we can remove setting the thread state to the temporary
_thread_in_vm_trans and also the fence after that.
Thanks,
Patricio
On 2/7/19 11:05 AM, Robbin Ehn wrote:
> Hi all, here is the promised re-base (v06) on
> 8210832: Remove sneaky locking in class Monitor.
>
> v06_1 is just a straight re-base.
>
> Full:
> http://cr.openjdk.java.net/~rehn/8203469/v06_1/full/
> Inc:
> http://cr.openjdk.java.net/~rehn/8203469/v06_1/inc/
>
> Passes stress test and t1-5.
>
> But there is a 'better' way.
> Before I added the more graceful "_vm_wait->wait();" semaphore in the
> while
> (_waiting_to_block > 0) { loop, it was a just a busy spin using the same
> back-off as the rolling forward loop. It turns out that we mostly
> never spin
> here at all, when all java threads are stop the callbacks is often
> already done.
> So the addition of the semaphore have no impact on our benchmarks and
> is mostly
> unused. This is because most threads are in java which we need to
> spin-wait
> since they can elide into native without doing a callback. My proposed
> re-base
> removes the the callbacks completely and let the vm thread do all thread
> accounting. All that the stopping threads needs to do is write state and
> safepoint id, everything else is handle by the vm thread. We trade 2
> atomics +
> a local store per thread against doing 2 stores per thread by the vm
> thread.
> This makes it possible for a thread in vm to transition into blocked
> WITHOUT
> safepoint poll. Just set thread_blocked and promise to do safepoint
> poll when
> leaving that state.
>
> v06_2
> Full:
> http://cr.openjdk.java.net/~rehn/8203469/v06_2/full/
> Inc against v05:
> http://cr.openjdk.java.net/~rehn/8203469/v06_2/inc/
> Inc against v06_1:
> http://cr.openjdk.java.net/~rehn/8203469/v06_2/rebase_inc/
>
> v06_2 simplifies and removes ~200 LOC with same performance.
> If there is a case with a thread in vm taking long time, it will already
> screw-up latency and thus should be fixed regardless of v06_1 vs
> v06_2. So I
> see no reason why we should not push v06_2.
>
> Passes stress test and t1-5.
>
> Thanks, Robbin
>
>
> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>> Hi all, please review.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>
>> Thanks to Dan for pre-reviewing a lot!
>>
>> Background:
>> ZGC often does very short safepoint operations. For a perspective, in a
>> specJBB2015 run, G1 can have young collection stops lasting about 170
>> ms. While
>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>> operation it is. The time it takes to stop and start the JavaThreads
>> is relative
>> very large to a ZGC safepoint. With an operation that just takes
>> 0.2ms the
>> overhead of stopping and starting JavaThreads is several times the
>> operation.
>>
>> High-level functionality change:
>> Serializing the starting over Threads_lock takes time.
>> - Don't wait on Threads_lock use the WaitBarrier.
>> Serializing the stopping over Safepoint_lock takes time.
>> - Let threads stop in parallel, remove Safepoint_lock.
>>
>> Details:
>> JavaThreads have 2 abstract logical states: unsafe or safe.
>> - Safe means the JavaThread will not touch Java heap or VM internal
>> structures
>> without doing a transition and block before doing so.
>> - The safe states are:
>> - When polls armed: _thread_in_native and
>> _thread_blocked.
>> - When Threads_lock is held: externally suspended
>> flag is set.
>> - VM Thread have polls armed and holds the Threads_lock
>> during a
>> safepoint.
>> - Unsafe means that either Java heap or VM internal structures can be
>> accessed
>> by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>> - All combination that are not safe are unsafe.
>>
>> We cannot start a safepoint until all unsafe threads have
>> transitioned to a safe
>> state. To make them safe, we arm polls in compiled code and make sure
>> any
>> transition to another unsafe state will be blocked. JavaThreads which
>> are unsafe
>> with state _thread_in_Java may transition to _thread_in_native
>> without being
>> blocked, since it just became a safe thread and we can proceed. Any
>> safe thread
>> may try to transition at any time to an unsafe state, thus coming
>> into the
>> safepoint blocking code at any moment, e.g., after the safepoint is
>> over, or
>> even at the beginning of next safepoint.
>>
>> The VMThread cannot tolerate false positives from the JavaThread
>> thread state
>> because that would mean starting the safepoint without all
>> JavaThreads being
>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we
>> never observe
>> false positives from the safepoint blocking code, if we remove them,
>> how do we
>> handle false positives?
>>
>> By first publishing which barrier tag (safepoint counter) we will call
>> WaitBarrier.wait() with as the threads safepoint id and then change
>> the state to
>> _thread_blocked, the VMThread can ignore JavaThreads by doing a
>> stable load of
>> the state. A stable load of the thread state is successful if the thread
>> safepoint id is the same both before and after the load of the state and
>> safepoint id is current or InactiveSafepointCounter. If the stable
>> load fails,
>> the thread is considered safepoint unsafe. It's no longer enough that
>> thread is
>> have state _thread_blocked it must also have correct safepoint id
>> before and
>> after we read the state.
>>
>> Performance:
>> The result of faster safepoints is that the average CPU time for
>> JavaThreads
>> between safepoints is higher, thus increasing the allocation rate.
>> The thread
>> that stops first waits shorter time until it gets started. Even the
>> thread that
>> stops last also have shorter stop since we start them faster. If your
>> application is using a concurrent GC it may need re-tunning since
>> each java
>> worker thread have an increased CPU time/allocation rate. Often this
>> means max
>> performance is achieved using slightly less java worker threads than
>> before.
>> Also the increase allocation rate means shorter time between GC
>> safepoints.
>> - If you are using a non-concurrent GC, you should see improved
>> latency and
>> throughput.
>> - After re-tunning with a concurrent GC throughput should be equal or
>> better but
>> with better latency. But bear in mind this is a latency patch, not a
>> throughput one.
>> With current code a java thread is not to guarantee to run between
>> safepoint (in
>> theory a java thread can be starved indefinitely), since the VM
>> thread may
>> re-grab the Threads_locks before it woke up from previous safepoint.
>> If the
>> GC/VM don't respect MMU (minimum mutator utilization) or if your
>> machine is very
>> over-provisioned this can happen.
>> The current schema thus re-safepoint quickly if the java threads have
>> not
>> started yet at the cost of latency. Since the new code uses the
>> WaitBarrier with
>> the safepoint counter, all threads must roll forward to next
>> safepoint by
>> getting at least some CPU time between two safepoints. Meaning MMU
>> violations
>> are more obvious.
>>
>> Some examples on numbers:
>> - On a 16 strand machine synchronization and
>> un-synchronization/starting is at
>> least 3x faster (in non-trivial test). Synchronization ~600 ->
>> ~100us and
>> starting ~400->~100us.
>> (Semaphore path is a bit slower than futex in the WaitBarrier on
>> Linux).
>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>> synchronization time on 16 strands and ~5% score increase. In this
>> case the GC
>> op is 1ms, so we reduce the overhead of synchronization from 100%
>> to 10%.
>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>
>> Thanks, Robbin
More information about the hotspot-dev
mailing list