RFR(XL): 8203469: Faster safepoints

Fri Feb 15 09:07:52 UTC 2019

Hi Karen,

On 2/14/19 11:11 PM, Karen Kinnear wrote:
> Robbin,
> 
> Went over V06_2_u1 and it looks good to me too. This is a major improvement! Many thanks!
> Thank you for adding so many assertions and comments.
> I don’t need to see a new webrev
> 
> Minor comments
> 
> interfaceSupport.inline.hpp
> 319: “enought” -> “enough”
> 
> Safepoint.cpp line 760
> “should have already have” -> “should already have”

Fixed!

> 
> thank you so much!
> Karen

Thanks, Robbin

> 
>> On Feb 12, 2019, at 7:38 AM, David Holmes <david.holmes at oracle.com> wrote:
>>
>> Hi Robbin,
>>
>> I've gone through v06_2_u1 one more time and overall I think things generally look good.
>>
>> One or two nits on naming but nothing worth haggling over :)
>>
>> Thanks,
>> David
>>
>> On 12/02/2019 7:28 am, David Holmes wrote:
>>> On 12/02/2019 6:14 am, Robbin Ehn wrote:
>>>> Hi all,
>>>>
>>>> Updated of v2:
>>>> Full:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2_u1/full/
>>>> (open.changeset still two patches, e.g. if you look at interfaceSupport.inline.hpp it's patched twice)
>>> Simplified version:
>>> http://cr.openjdk.java.net/~dholmes/8203469/webrev.v06_2_u1/
>>> David
>>> -----
>>>> Inc:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2_u1/inc/
>>>>
>>>> Passes several hours more stress testing and t1-5, KS 24H stress still running.
>>>>
>>>> I did update alternative one also with Dan's feedback, and it also still passes stress tests and t1-5.
>>>> I'll leave that unpublished since we are focusing on this version where we can get some simplifications.
>>>>
>>>> Thanks, Robbin
>>>>
>>>> On 2019-02-07 17:05, Robbin Ehn wrote:
>>>>> Hi all, here is the promised re-base (v06) on
>>>>> 8210832: Remove sneaky locking in class Monitor.
>>>>>
>>>>> v06_1 is just a straight re-base.
>>>>>
>>>>> Full:
>>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/full/
>>>>> Inc:
>>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/inc/
>>>>>
>>>>> Passes stress test and t1-5.
>>>>>
>>>>> But there is a 'better' way.
>>>>> Before I added the more graceful "_vm_wait->wait();" semaphore in the while
>>>>> (_waiting_to_block > 0) { loop, it was a just a busy spin using the same
>>>>> back-off as the rolling forward loop. It turns out that we mostly never spin
>>>>> here at all, when all java threads are stop the callbacks is often already done.
>>>>> So the addition of the semaphore have no impact on our benchmarks and is mostly
>>>>> unused. This is because most threads are in java which we need to spin-wait
>>>>> since they can elide into native without doing a callback. My proposed re-base
>>>>> removes the the callbacks completely and let the vm thread do all thread
>>>>> accounting. All that the stopping threads needs to do is write state and
>>>>> safepoint id, everything else is handle by the vm thread. We trade 2 atomics +
>>>>> a local store per thread against doing 2 stores per thread by the vm thread.
>>>>> This makes it possible for a thread in vm to transition into blocked WITHOUT
>>>>> safepoint poll. Just set thread_blocked and promise to do safepoint poll when
>>>>> leaving that state.
>>>>>
>>>>> v06_2
>>>>> Full:
>>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/full/
>>>>> Inc against v05:
>>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/inc/
>>>>> Inc against v06_1:
>>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/rebase_inc/
>>>>>
>>>>> v06_2 simplifies and removes ~200 LOC with same performance.
>>>>> If there is a case with a thread in vm taking long time, it will already
>>>>> screw-up latency and thus should be fixed regardless of v06_1 vs v06_2. So I
>>>>> see no reason why we should not push v06_2.
>>>>>
>>>>> Passes stress test and t1-5.
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>
>>>>> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>>>>>> Hi all, please review.
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>>>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>>>>>
>>>>>> Thanks to Dan for pre-reviewing a lot!
>>>>>>
>>>>>> Background:
>>>>>> ZGC often does very short safepoint operations. For a perspective, in a
>>>>>> specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
>>>>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>>>>>> operation it is. The time it takes to stop and start the JavaThreads is relative
>>>>>> very large to a ZGC safepoint. With an operation that just takes 0.2ms the
>>>>>> overhead of stopping and starting JavaThreads is several times the operation.
>>>>>>
>>>>>> High-level functionality change:
>>>>>> Serializing the starting over Threads_lock takes time.
>>>>>> - Don't wait on Threads_lock use the WaitBarrier.
>>>>>> Serializing the stopping over Safepoint_lock takes time.
>>>>>> - Let threads stop in parallel, remove Safepoint_lock.
>>>>>>
>>>>>> Details:
>>>>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>>>>> - Safe means the JavaThread will not touch Java heap or VM internal structures
>>>>>>     without doing a transition and block before doing so.
>>>>>>           - The safe states are:
>>>>>>                   - When polls armed: _thread_in_native and _thread_blocked.
>>>>>>                   - When Threads_lock is held: externally suspended flag is set.
>>>>>>           - VM Thread have polls armed and holds the Threads_lock during a
>>>>>>             safepoint.
>>>>>> - Unsafe means that either Java heap or VM internal structures can be accessed
>>>>>>     by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>>>>>           - All combination that are not safe are unsafe.
>>>>>>
>>>>>> We cannot start a safepoint until all unsafe threads have transitioned to a safe
>>>>>> state. To make them safe, we arm polls in compiled code and make sure any
>>>>>> transition to another unsafe state will be blocked. JavaThreads which are unsafe
>>>>>> with state _thread_in_Java may transition to _thread_in_native without being
>>>>>> blocked, since it just became a safe thread and we can proceed. Any safe thread
>>>>>> may try to transition at any time to an unsafe state, thus coming into the
>>>>>> safepoint blocking code at any moment, e.g., after the safepoint is over, or
>>>>>> even at the beginning of next safepoint.
>>>>>>
>>>>>> The VMThread cannot tolerate false positives from the JavaThread thread state
>>>>>> because that would mean starting the safepoint without all JavaThreads being
>>>>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we never observe
>>>>>> false positives from the safepoint blocking code, if we remove them, how do we
>>>>>> handle false positives?
>>>>>>
>>>>>> By first publishing which barrier tag (safepoint counter) we will call
>>>>>> WaitBarrier.wait() with as the threads safepoint id and then change the state to
>>>>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
>>>>>> the state. A stable load of the thread state is successful if the thread
>>>>>> safepoint id is the same both before and after the load of the state and
>>>>>> safepoint id is current or InactiveSafepointCounter. If the stable load fails,
>>>>>> the thread is considered safepoint unsafe. It's no longer enough that thread is
>>>>>> have state _thread_blocked it must also have correct safepoint id before and
>>>>>> after we read the state.
>>>>>>
>>>>>> Performance:
>>>>>> The result of faster safepoints is that the average CPU time for JavaThreads
>>>>>> between safepoints is higher, thus increasing the allocation rate. The thread
>>>>>> that stops first waits shorter time until it gets started. Even the thread that
>>>>>> stops last also have shorter stop since we start them faster. If your
>>>>>> application is using a concurrent GC it may need re-tunning since each java
>>>>>> worker thread have an increased CPU time/allocation rate. Often this means max
>>>>>> performance is achieved using slightly less java worker threads than before.
>>>>>> Also the increase allocation rate means shorter time between GC safepoints.
>>>>>> - If you are using a non-concurrent GC, you should see improved latency and
>>>>>>     throughput.
>>>>>> - After re-tunning with a concurrent GC throughput should be equal or better but
>>>>>>     with better latency. But bear in mind this is a latency patch, not a
>>>>>>     throughput one.
>>>>>> With current code a java thread is not to guarantee to run between safepoint (in
>>>>>> theory a java thread can be starved indefinitely), since the VM thread may
>>>>>> re-grab the Threads_locks before it woke up from previous safepoint. If the
>>>>>> GC/VM don't respect MMU (minimum mutator utilization) or if your machine is very
>>>>>> over-provisioned this can happen.
>>>>>> The current schema thus re-safepoint quickly if the java threads have not
>>>>>> started yet at the cost of latency. Since the new code uses the WaitBarrier with
>>>>>> the safepoint counter, all threads must roll forward to next safepoint by
>>>>>> getting at least some CPU time between two safepoints. Meaning MMU violations
>>>>>> are more obvious.
>>>>>>
>>>>>> Some examples on numbers:
>>>>>> - On a 16 strand machine synchronization and un-synchronization/starting is at
>>>>>>     least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
>>>>>>     starting ~400->~100us.
>>>>>>     (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
>>>>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>>>>>     synchronization time on 16 strands and ~5% score increase. In this case the GC
>>>>>>     op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
>>>>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>>>>>
>>>>>> Thanks, Robbin
>