RFR(XL): 8203469: Faster safepoints

Fri Feb 8 21:27:50 UTC 2019

On 2/8/19 3:17 PM, Patricio Chilano wrote:
> Hi Dan,
>
> On 2/8/19 3:01 PM, Daniel D. Daugherty wrote:
>> On 2/8/19 2:33 PM, Daniel D. Daugherty wrote:
>>> On 2/7/19 11:05 AM, Robbin Ehn wrote:
>>>> Hi all, here is the promised re-base (v06) on
>>>> 8210832: Remove sneaky locking in class Monitor.
>>>>
>>>> v06_1 is just a straight re-base.
>>>>
>>>> Full:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/full/
>>>> Inc:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_1/inc/
>>>
>>> Something is wrong with this incremental webrev. I was expecting an
>>> incremental webrev relative to the v05 version, but that's not
>>> what I see in src/hotspot/share/runtime/safepoint.cpp. In that file
>>> I'm seeing changes relative to the baseline, e.g., deletion of the
>>> Safepoint_lock, etc.
>>
>> So I dropped back to the full patch for v05 and tried to compare that
>> to the v06_1/full patch above using jfilemerge. That's not working for
>> me either.
>>
>> Next I'm going to just look at the v06_1/full webrev and see if that
>> makes sense.
> Incremental webrev for version v06_1 contains also the conflict 
> resolutions after hg merge failed between 8210832 and this change. 
> Yesterday I noticed that too with safepoint.hpp and asked Robbin. Hope 
> that makes more sense.

I read that yesterday I think... It makes more sense, but I still found it
too difficult/jarring to review the 'inc' webrev...

Dan

>
>
> Patricio
>> Dan
>>
>>
>>>
>>> Dan
>>>
>>>
>>>
>>>>
>>>> Passes stress test and t1-5.
>>>>
>>>> But there is a 'better' way.
>>>> Before I added the more graceful "_vm_wait->wait();" semaphore in 
>>>> the while
>>>> (_waiting_to_block > 0) { loop, it was a just a busy spin using the 
>>>> same
>>>> back-off as the rolling forward loop. It turns out that we mostly 
>>>> never spin
>>>> here at all, when all java threads are stop the callbacks is often 
>>>> already done.
>>>> So the addition of the semaphore have no impact on our benchmarks 
>>>> and is mostly
>>>> unused. This is because most threads are in java which we need to 
>>>> spin-wait
>>>> since they can elide into native without doing a callback. My 
>>>> proposed re-base
>>>> removes the the callbacks completely and let the vm thread do all 
>>>> thread
>>>> accounting. All that the stopping threads needs to do is write 
>>>> state and
>>>> safepoint id, everything else is handle by the vm thread. We trade 
>>>> 2 atomics +
>>>> a local store per thread against doing 2 stores per thread by the 
>>>> vm thread.
>>>> This makes it possible for a thread in vm to transition into 
>>>> blocked WITHOUT
>>>> safepoint poll. Just set thread_blocked and promise to do safepoint 
>>>> poll when
>>>> leaving that state.
>>>>
>>>> v06_2
>>>> Full:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/full/
>>>> Inc against v05:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/inc/
>>>> Inc against v06_1:
>>>> http://cr.openjdk.java.net/~rehn/8203469/v06_2/rebase_inc/
>>>>
>>>> v06_2 simplifies and removes ~200 LOC with same performance.
>>>> If there is a case with a thread in vm taking long time, it will 
>>>> already
>>>> screw-up latency and thus should be fixed regardless of v06_1 vs 
>>>> v06_2. So I
>>>> see no reason why we should not push v06_2.
>>>>
>>>> Passes stress test and t1-5.
>>>>
>>>> Thanks, Robbin
>>>>
>>>>
>>>> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>>>>> Hi all, please review.
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>>>>
>>>>> Thanks to Dan for pre-reviewing a lot!
>>>>>
>>>>> Background:
>>>>> ZGC often does very short safepoint operations. For a perspective, 
>>>>> in a
>>>>> specJBB2015 run, G1 can have young collection stops lasting about 
>>>>> 170 ms. While
>>>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on 
>>>>> which
>>>>> operation it is. The time it takes to stop and start the 
>>>>> JavaThreads is relative
>>>>> very large to a ZGC safepoint. With an operation that just takes 
>>>>> 0.2ms the
>>>>> overhead of stopping and starting JavaThreads is several times the 
>>>>> operation.
>>>>>
>>>>> High-level functionality change:
>>>>> Serializing the starting over Threads_lock takes time.
>>>>> - Don't wait on Threads_lock use the WaitBarrier.
>>>>> Serializing the stopping over Safepoint_lock takes time.
>>>>> - Let threads stop in parallel, remove Safepoint_lock.
>>>>>
>>>>> Details:
>>>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>>>> - Safe means the JavaThread will not touch Java heap or VM 
>>>>> internal structures
>>>>>    without doing a transition and block before doing so.
>>>>>          - The safe states are:
>>>>>                  - When polls armed: _thread_in_native and 
>>>>> _thread_blocked.
>>>>>                  - When Threads_lock is held: externally suspended 
>>>>> flag is set.
>>>>>          - VM Thread have polls armed and holds the Threads_lock 
>>>>> during a
>>>>>            safepoint.
>>>>> - Unsafe means that either Java heap or VM internal structures can 
>>>>> be accessed
>>>>>    by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>>>>          - All combination that are not safe are unsafe.
>>>>>
>>>>> We cannot start a safepoint until all unsafe threads have 
>>>>> transitioned to a safe
>>>>> state. To make them safe, we arm polls in compiled code and make 
>>>>> sure any
>>>>> transition to another unsafe state will be blocked. JavaThreads 
>>>>> which are unsafe
>>>>> with state _thread_in_Java may transition to _thread_in_native 
>>>>> without being
>>>>> blocked, since it just became a safe thread and we can proceed. 
>>>>> Any safe thread
>>>>> may try to transition at any time to an unsafe state, thus coming 
>>>>> into the
>>>>> safepoint blocking code at any moment, e.g., after the safepoint 
>>>>> is over, or
>>>>> even at the beginning of next safepoint.
>>>>>
>>>>> The VMThread cannot tolerate false positives from the JavaThread 
>>>>> thread state
>>>>> because that would mean starting the safepoint without all 
>>>>> JavaThreads being
>>>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we 
>>>>> never observe
>>>>> false positives from the safepoint blocking code, if we remove 
>>>>> them, how do we
>>>>> handle false positives?
>>>>>
>>>>> By first publishing which barrier tag (safepoint counter) we will 
>>>>> call
>>>>> WaitBarrier.wait() with as the threads safepoint id and then 
>>>>> change the state to
>>>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a 
>>>>> stable load of
>>>>> the state. A stable load of the thread state is successful if the 
>>>>> thread
>>>>> safepoint id is the same both before and after the load of the 
>>>>> state and
>>>>> safepoint id is current or InactiveSafepointCounter. If the stable 
>>>>> load fails,
>>>>> the thread is considered safepoint unsafe. It's no longer enough 
>>>>> that thread is
>>>>> have state _thread_blocked it must also have correct safepoint id 
>>>>> before and
>>>>> after we read the state.
>>>>>
>>>>> Performance:
>>>>> The result of faster safepoints is that the average CPU time for 
>>>>> JavaThreads
>>>>> between safepoints is higher, thus increasing the allocation rate. 
>>>>> The thread
>>>>> that stops first waits shorter time until it gets started. Even 
>>>>> the thread that
>>>>> stops last also have shorter stop since we start them faster. If your
>>>>> application is using a concurrent GC it may need re-tunning since 
>>>>> each java
>>>>> worker thread have an increased CPU time/allocation rate. Often 
>>>>> this means max
>>>>> performance is achieved using slightly less java worker threads 
>>>>> than before.
>>>>> Also the increase allocation rate means shorter time between GC 
>>>>> safepoints.
>>>>> - If you are using a non-concurrent GC, you should see improved 
>>>>> latency and
>>>>>    throughput.
>>>>> - After re-tunning with a concurrent GC throughput should be equal 
>>>>> or better but
>>>>>    with better latency. But bear in mind this is a latency patch, 
>>>>> not a
>>>>>    throughput one.
>>>>> With current code a java thread is not to guarantee to run between 
>>>>> safepoint (in
>>>>> theory a java thread can be starved indefinitely), since the VM 
>>>>> thread may
>>>>> re-grab the Threads_locks before it woke up from previous 
>>>>> safepoint. If the
>>>>> GC/VM don't respect MMU (minimum mutator utilization) or if your 
>>>>> machine is very
>>>>> over-provisioned this can happen.
>>>>> The current schema thus re-safepoint quickly if the java threads 
>>>>> have not
>>>>> started yet at the cost of latency. Since the new code uses the 
>>>>> WaitBarrier with
>>>>> the safepoint counter, all threads must roll forward to next 
>>>>> safepoint by
>>>>> getting at least some CPU time between two safepoints. Meaning MMU 
>>>>> violations
>>>>> are more obvious.
>>>>>
>>>>> Some examples on numbers:
>>>>> - On a 16 strand machine synchronization and 
>>>>> un-synchronization/starting is at
>>>>>    least 3x faster (in non-trivial test). Synchronization ~600 -> 
>>>>> ~100us and
>>>>>    starting ~400->~100us.
>>>>>    (Semaphore path is a bit slower than futex in the WaitBarrier 
>>>>> on Linux).
>>>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>>>>    synchronization time on 16 strands and ~5% score increase. In 
>>>>> this case the GC
>>>>>    op is 1ms, so we reduce the overhead of synchronization from 
>>>>> 100% to 10%.
>>>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>>>>
>>>>> Thanks, Robbin
>>>
>>
>