RFR(XL): 8203469: Faster safepoints

Mon Feb 11 13:13:06 UTC 2019

Hi Patricio,

Fixed and added a comment.

Thanks, Robbin

On 2/8/19 5:46 PM, Patricio Chilano wrote:
> Correction about the fence, I think we actually need it to avoid the change of 
> state to _thread_blocked to float above frame_anchor()->make_walkable. So it 
> should be:
> 
> --- a/src/hotspot/share/runtime/interfaceSupport.inline.hpp
> +++ b/src/hotspot/share/runtime/interfaceSupport.inline.hpp
> @@ -314,8 +314,7 @@
>       // Once we are blocked vm expects stack to be walkable
>       thread->frame_anchor()->make_walkable(thread);
> 
> -    thread->set_thread_state((JavaThreadState)(_thread_in_vm + 1));
> -    InterfaceSupport::serialize_thread_state_with_handler(thread);
> +    OrderAccess::storestore();
> 
>       thread->set_thread_state(_thread_blocked);
> 
> 
> Otherwise if we keep 
> "InterfaceSupport::serialize_thread_state_with_handler(thread);" maybe we should 
> also change the comment "// Make sure new state is seen by VM thread".
> 
> 
> Thanks,
> Patricio
> 
> On 2/8/19 10:58 AM, Patricio Chilano wrote:
>> Hi Robbin,
>>
>> Version v06_2 looks good to me. One minor comment:
>>
>> --- a/src/hotspot/share/runtime/interfaceSupport.inline.hpp
>> +++ b/src/hotspot/share/runtime/interfaceSupport.inline.hpp
>> @@ -314,9 +314,6 @@
>>      // Once we are blocked vm expects stack to be walkable
>>      thread->frame_anchor()->make_walkable(thread);
>>
>> -    thread->set_thread_state((JavaThreadState)(_thread_in_vm + 1));
>> - InterfaceSupport::serialize_thread_state_with_handler(thread);
>> -
>>      thread->set_thread_state(_thread_blocked);
>>
>> Since we are not calling SS::block() anymore in the TBIVMWDC constructor we 
>> can remove setting the thread state to the temporary _thread_in_vm_trans and 
>> also the fence after that.
>>
>>
>> Thanks,
>> Patricio
>>
>> On 2/7/19 11:05 AM, Robbin Ehn wrote:
>>> Hi all, here is the promised re-base (v06) on
>>> 8210832: Remove sneaky locking in class Monitor.
>>>
>>> v06_1 is just a straight re-base.
>>>
>>> Full:
>>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/full/
>>> Inc:
>>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/inc/
>>>
>>> Passes stress test and t1-5.
>>>
>>> But there is a 'better' way.
>>> Before I added the more graceful "_vm_wait->wait();" semaphore in the while
>>> (_waiting_to_block > 0) { loop, it was a just a busy spin using the same
>>> back-off as the rolling forward loop. It turns out that we mostly never spin
>>> here at all, when all java threads are stop the callbacks is often already done.
>>> So the addition of the semaphore have no impact on our benchmarks and is mostly
>>> unused. This is because most threads are in java which we need to spin-wait
>>> since they can elide into native without doing a callback. My proposed re-base
>>> removes the the callbacks completely and let the vm thread do all thread
>>> accounting. All that the stopping threads needs to do is write state and
>>> safepoint id, everything else is handle by the vm thread. We trade 2 atomics +
>>> a local store per thread against doing 2 stores per thread by the vm thread.
>>> This makes it possible for a thread in vm to transition into blocked WITHOUT
>>> safepoint poll. Just set thread_blocked and promise to do safepoint poll when
>>> leaving that state.
>>>
>>> v06_2
>>> Full:
>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/full/
>>> Inc against v05:
>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/inc/
>>> Inc against v06_1:
>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/rebase_inc/
>>>
>>> v06_2 simplifies and removes ~200 LOC with same performance.
>>> If there is a case with a thread in vm taking long time, it will already
>>> screw-up latency and thus should be fixed regardless of v06_1 vs v06_2. So I
>>> see no reason why we should not push v06_2.
>>>
>>> Passes stress test and t1-5.
>>>
>>> Thanks, Robbin
>>>
>>>
>>> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>>>> Hi all, please review.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>>>
>>>> Thanks to Dan for pre-reviewing a lot!
>>>>
>>>> Background:
>>>> ZGC often does very short safepoint operations. For a perspective, in a
>>>> specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
>>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>>>> operation it is. The time it takes to stop and start the JavaThreads is 
>>>> relative
>>>> very large to a ZGC safepoint. With an operation that just takes 0.2ms the
>>>> overhead of stopping and starting JavaThreads is several times the operation.
>>>>
>>>> High-level functionality change:
>>>> Serializing the starting over Threads_lock takes time.
>>>> - Don't wait on Threads_lock use the WaitBarrier.
>>>> Serializing the stopping over Safepoint_lock takes time.
>>>> - Let threads stop in parallel, remove Safepoint_lock.
>>>>
>>>> Details:
>>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>>> - Safe means the JavaThread will not touch Java heap or VM internal structures
>>>>    without doing a transition and block before doing so.
>>>>          - The safe states are:
>>>>                  - When polls armed: _thread_in_native and _thread_blocked.
>>>>                  - When Threads_lock is held: externally suspended flag is set.
>>>>          - VM Thread have polls armed and holds the Threads_lock during a
>>>>            safepoint.
>>>> - Unsafe means that either Java heap or VM internal structures can be accessed
>>>>    by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>>>          - All combination that are not safe are unsafe.
>>>>
>>>> We cannot start a safepoint until all unsafe threads have transitioned to a 
>>>> safe
>>>> state. To make them safe, we arm polls in compiled code and make sure any
>>>> transition to another unsafe state will be blocked. JavaThreads which are 
>>>> unsafe
>>>> with state _thread_in_Java may transition to _thread_in_native without being
>>>> blocked, since it just became a safe thread and we can proceed. Any safe thread
>>>> may try to transition at any time to an unsafe state, thus coming into the
>>>> safepoint blocking code at any moment, e.g., after the safepoint is over, or
>>>> even at the beginning of next safepoint.
>>>>
>>>> The VMThread cannot tolerate false positives from the JavaThread thread state
>>>> because that would mean starting the safepoint without all JavaThreads being
>>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we never 
>>>> observe
>>>> false positives from the safepoint blocking code, if we remove them, how do we
>>>> handle false positives?
>>>>
>>>> By first publishing which barrier tag (safepoint counter) we will call
>>>> WaitBarrier.wait() with as the threads safepoint id and then change the 
>>>> state to
>>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
>>>> the state. A stable load of the thread state is successful if the thread
>>>> safepoint id is the same both before and after the load of the state and
>>>> safepoint id is current or InactiveSafepointCounter. If the stable load fails,
>>>> the thread is considered safepoint unsafe. It's no longer enough that thread is
>>>> have state _thread_blocked it must also have correct safepoint id before and
>>>> after we read the state.
>>>>
>>>> Performance:
>>>> The result of faster safepoints is that the average CPU time for JavaThreads
>>>> between safepoints is higher, thus increasing the allocation rate. The thread
>>>> that stops first waits shorter time until it gets started. Even the thread that
>>>> stops last also have shorter stop since we start them faster. If your
>>>> application is using a concurrent GC it may need re-tunning since each java
>>>> worker thread have an increased CPU time/allocation rate. Often this means max
>>>> performance is achieved using slightly less java worker threads than before.
>>>> Also the increase allocation rate means shorter time between GC safepoints.
>>>> - If you are using a non-concurrent GC, you should see improved latency and
>>>>    throughput.
>>>> - After re-tunning with a concurrent GC throughput should be equal or better 
>>>> but
>>>>    with better latency. But bear in mind this is a latency patch, not a
>>>>    throughput one.
>>>> With current code a java thread is not to guarantee to run between safepoint 
>>>> (in
>>>> theory a java thread can be starved indefinitely), since the VM thread may
>>>> re-grab the Threads_locks before it woke up from previous safepoint. If the
>>>> GC/VM don't respect MMU (minimum mutator utilization) or if your machine is 
>>>> very
>>>> over-provisioned this can happen.
>>>> The current schema thus re-safepoint quickly if the java threads have not
>>>> started yet at the cost of latency. Since the new code uses the WaitBarrier 
>>>> with
>>>> the safepoint counter, all threads must roll forward to next safepoint by
>>>> getting at least some CPU time between two safepoints. Meaning MMU violations
>>>> are more obvious.
>>>>
>>>> Some examples on numbers:
>>>> - On a 16 strand machine synchronization and un-synchronization/starting is at
>>>>    least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
>>>>    starting ~400->~100us.
>>>>    (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
>>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>>>    synchronization time on 16 strands and ~5% score increase. In this case 
>>>> the GC
>>>>    op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
>>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>>>
>>>> Thanks, Robbin
>>
>