RFR(XL): 8203469: Faster safepoints
Patricio Chilano
patricio.chilano.mateo at oracle.com
Fri Feb 8 16:46:55 UTC 2019
Correction about the fence, I think we actually need it to avoid the
change of state to _thread_blocked to float above
frame_anchor()->make_walkable. So it should be:
--- a/src/hotspot/share/runtime/interfaceSupport.inline.hpp
+++ b/src/hotspot/share/runtime/interfaceSupport.inline.hpp
@@ -314,8 +314,7 @@
// Once we are blocked vm expects stack to be walkable
thread->frame_anchor()->make_walkable(thread);
- thread->set_thread_state((JavaThreadState)(_thread_in_vm + 1));
- InterfaceSupport::serialize_thread_state_with_handler(thread);
+ OrderAccess::storestore();
thread->set_thread_state(_thread_blocked);
Otherwise if we keep
"InterfaceSupport::serialize_thread_state_with_handler(thread);" maybe
we should also change the comment "// Make sure new state is seen by VM
thread".
Thanks,
Patricio
On 2/8/19 10:58 AM, Patricio Chilano wrote:
> Hi Robbin,
>
> Version v06_2 looks good to me. One minor comment:
>
> --- a/src/hotspot/share/runtime/interfaceSupport.inline.hpp
> +++ b/src/hotspot/share/runtime/interfaceSupport.inline.hpp
> @@ -314,9 +314,6 @@
> // Once we are blocked vm expects stack to be walkable
> thread->frame_anchor()->make_walkable(thread);
>
> - thread->set_thread_state((JavaThreadState)(_thread_in_vm + 1));
> - InterfaceSupport::serialize_thread_state_with_handler(thread);
> -
> thread->set_thread_state(_thread_blocked);
>
> Since we are not calling SS::block() anymore in the TBIVMWDC
> constructor we can remove setting the thread state to the temporary
> _thread_in_vm_trans and also the fence after that.
>
>
> Thanks,
> Patricio
>
> On 2/7/19 11:05 AM, Robbin Ehn wrote:
>> Hi all, here is the promised re-base (v06) on
>> 8210832: Remove sneaky locking in class Monitor.
>>
>> v06_1 is just a straight re-base.
>>
>> Full:
>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/full/
>> Inc:
>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/inc/
>>
>> Passes stress test and t1-5.
>>
>> But there is a 'better' way.
>> Before I added the more graceful "_vm_wait->wait();" semaphore in the
>> while
>> (_waiting_to_block > 0) { loop, it was a just a busy spin using the same
>> back-off as the rolling forward loop. It turns out that we mostly
>> never spin
>> here at all, when all java threads are stop the callbacks is often
>> already done.
>> So the addition of the semaphore have no impact on our benchmarks and
>> is mostly
>> unused. This is because most threads are in java which we need to
>> spin-wait
>> since they can elide into native without doing a callback. My
>> proposed re-base
>> removes the the callbacks completely and let the vm thread do all thread
>> accounting. All that the stopping threads needs to do is write state and
>> safepoint id, everything else is handle by the vm thread. We trade 2
>> atomics +
>> a local store per thread against doing 2 stores per thread by the vm
>> thread.
>> This makes it possible for a thread in vm to transition into blocked
>> WITHOUT
>> safepoint poll. Just set thread_blocked and promise to do safepoint
>> poll when
>> leaving that state.
>>
>> v06_2
>> Full:
>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/full/
>> Inc against v05:
>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/inc/
>> Inc against v06_1:
>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/rebase_inc/
>>
>> v06_2 simplifies and removes ~200 LOC with same performance.
>> If there is a case with a thread in vm taking long time, it will already
>> screw-up latency and thus should be fixed regardless of v06_1 vs
>> v06_2. So I
>> see no reason why we should not push v06_2.
>>
>> Passes stress test and t1-5.
>>
>> Thanks, Robbin
>>
>>
>> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>>> Hi all, please review.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>>
>>> Thanks to Dan for pre-reviewing a lot!
>>>
>>> Background:
>>> ZGC often does very short safepoint operations. For a perspective, in a
>>> specJBB2015 run, G1 can have young collection stops lasting about
>>> 170 ms. While
>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on
>>> which
>>> operation it is. The time it takes to stop and start the JavaThreads
>>> is relative
>>> very large to a ZGC safepoint. With an operation that just takes
>>> 0.2ms the
>>> overhead of stopping and starting JavaThreads is several times the
>>> operation.
>>>
>>> High-level functionality change:
>>> Serializing the starting over Threads_lock takes time.
>>> - Don't wait on Threads_lock use the WaitBarrier.
>>> Serializing the stopping over Safepoint_lock takes time.
>>> - Let threads stop in parallel, remove Safepoint_lock.
>>>
>>> Details:
>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>> - Safe means the JavaThread will not touch Java heap or VM internal
>>> structures
>>> without doing a transition and block before doing so.
>>> - The safe states are:
>>> - When polls armed: _thread_in_native and
>>> _thread_blocked.
>>> - When Threads_lock is held: externally suspended
>>> flag is set.
>>> - VM Thread have polls armed and holds the Threads_lock
>>> during a
>>> safepoint.
>>> - Unsafe means that either Java heap or VM internal structures can
>>> be accessed
>>> by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>> - All combination that are not safe are unsafe.
>>>
>>> We cannot start a safepoint until all unsafe threads have
>>> transitioned to a safe
>>> state. To make them safe, we arm polls in compiled code and make
>>> sure any
>>> transition to another unsafe state will be blocked. JavaThreads
>>> which are unsafe
>>> with state _thread_in_Java may transition to _thread_in_native
>>> without being
>>> blocked, since it just became a safe thread and we can proceed. Any
>>> safe thread
>>> may try to transition at any time to an unsafe state, thus coming
>>> into the
>>> safepoint blocking code at any moment, e.g., after the safepoint is
>>> over, or
>>> even at the beginning of next safepoint.
>>>
>>> The VMThread cannot tolerate false positives from the JavaThread
>>> thread state
>>> because that would mean starting the safepoint without all
>>> JavaThreads being
>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we
>>> never observe
>>> false positives from the safepoint blocking code, if we remove them,
>>> how do we
>>> handle false positives?
>>>
>>> By first publishing which barrier tag (safepoint counter) we will call
>>> WaitBarrier.wait() with as the threads safepoint id and then change
>>> the state to
>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a
>>> stable load of
>>> the state. A stable load of the thread state is successful if the
>>> thread
>>> safepoint id is the same both before and after the load of the state
>>> and
>>> safepoint id is current or InactiveSafepointCounter. If the stable
>>> load fails,
>>> the thread is considered safepoint unsafe. It's no longer enough
>>> that thread is
>>> have state _thread_blocked it must also have correct safepoint id
>>> before and
>>> after we read the state.
>>>
>>> Performance:
>>> The result of faster safepoints is that the average CPU time for
>>> JavaThreads
>>> between safepoints is higher, thus increasing the allocation rate.
>>> The thread
>>> that stops first waits shorter time until it gets started. Even the
>>> thread that
>>> stops last also have shorter stop since we start them faster. If your
>>> application is using a concurrent GC it may need re-tunning since
>>> each java
>>> worker thread have an increased CPU time/allocation rate. Often this
>>> means max
>>> performance is achieved using slightly less java worker threads than
>>> before.
>>> Also the increase allocation rate means shorter time between GC
>>> safepoints.
>>> - If you are using a non-concurrent GC, you should see improved
>>> latency and
>>> throughput.
>>> - After re-tunning with a concurrent GC throughput should be equal
>>> or better but
>>> with better latency. But bear in mind this is a latency patch, not a
>>> throughput one.
>>> With current code a java thread is not to guarantee to run between
>>> safepoint (in
>>> theory a java thread can be starved indefinitely), since the VM
>>> thread may
>>> re-grab the Threads_locks before it woke up from previous safepoint.
>>> If the
>>> GC/VM don't respect MMU (minimum mutator utilization) or if your
>>> machine is very
>>> over-provisioned this can happen.
>>> The current schema thus re-safepoint quickly if the java threads
>>> have not
>>> started yet at the cost of latency. Since the new code uses the
>>> WaitBarrier with
>>> the safepoint counter, all threads must roll forward to next
>>> safepoint by
>>> getting at least some CPU time between two safepoints. Meaning MMU
>>> violations
>>> are more obvious.
>>>
>>> Some examples on numbers:
>>> - On a 16 strand machine synchronization and
>>> un-synchronization/starting is at
>>> least 3x faster (in non-trivial test). Synchronization ~600 ->
>>> ~100us and
>>> starting ~400->~100us.
>>> (Semaphore path is a bit slower than futex in the WaitBarrier on
>>> Linux).
>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>> synchronization time on 16 strands and ~5% score increase. In
>>> this case the GC
>>> op is 1ms, so we reduce the overhead of synchronization from 100%
>>> to 10%.
>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>>
>>> Thanks, Robbin
>
More information about the hotspot-dev
mailing list