RFR(XL): 8203469: Faster safepoints

Thu Jan 24 11:40:51 UTC 2019

Hi Karen,

On 1/23/19 10:34 PM, Karen Kinnear wrote:
> This looks really good. Delighted with performance and cleaner logic.

Thanks!

> 
> Couple of minor questions/comments:
> 
> 1. SafepointMechanism.inline.hpp
>    added an OrderAccess::loadload() in block_if_requested_local_poll()
>    do you also need one in block_if_requested() ?

Yes, thanks.

> 
> 2. Tested on ARM? Stress test the OrderAccess
>     Thank you for comments on OrderAccess lines - will help in future

AndrewH was going to test it, I have not heard from him.

> 
> 3. minor safepoint.cpp 749: resetted -> reset

Fixed.

> 
> 4. While you are in there
> Thank you for cleaning up CMS comments
> safepoint.hpp line 58 _synchronized // All Java threads are stopped at a 
> safepoint. Only VM thread in running
>     -> All Java threads are running in native, blocked in OS or stopped at safepoint
>     What other threads an run besides the VM thead at this point?
>      e.g. safepoint cleanup threads
>      e.g. any GC threads that can run during a safepoint?

Updated.

> 
> 5. Would it make sense to split the safepoint_safe and try_stable_load_state
> into code that works during a safepoint and separate logic that works not
> at a safepoint, for the InactiveSafepoint state?

The safepoint_safe() primary user is handshakes. It have one second use-case in
an assert in jfr, if previously use-age was correct it still should be.
That piece of jfr code should only be run inside a safepoint/handshake.
It's not used by the safepointing code at all. It only works when asking a
thread with poll armed, thus only handshake and safepoint should ask this
_after_ arming.
(IMHO the jfr assert should be change)

v04 to RFR mail coming.

Thanks, Robbin

> 
> thanks,
> Karen
> 
>> On Jan 23, 2019, at 8:33 AM, Robbin Ehn <robbin.ehn at oracle.com 
>> <mailto:robbin.ehn at oracle.com>> wrote:
>>
>> Hi all, here is v03.
>>
>> It's contains the update from comments and:
>> I notice safepoint.hpp contained wrong/not need inline keyword for methods.
>> Those method are either default inline because they are defined in the
>> declaration (header) or since they are defined in the same cpp unit as callers
>> and thus can be inlined any way.
>>
>> http://cr.openjdk.java.net/~rehn/8203469/v03/inc/
>> http://cr.openjdk.java.net/~rehn/8203469/v03/
>>
>> Passes t1.
>>
>> Thanks, Robbin
>>
>> On 2019-01-15 11:39, Robbin Ehn wrote:
>>> Hi all, please review.
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>> Thanks to Dan for pre-reviewing a lot!
>>> Background:
>>> ZGC often does very short safepoint operations. For a perspective, in a
>>> specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>>> operation it is. The time it takes to stop and start the JavaThreads is relative
>>> very large to a ZGC safepoint. With an operation that just takes 0.2ms the
>>> overhead of stopping and starting JavaThreads is several times the operation.
>>> High-level functionality change:
>>> Serializing the starting over Threads_lock takes time.
>>> - Don't wait on Threads_lock use the WaitBarrier.
>>> Serializing the stopping over Safepoint_lock takes time.
>>> - Let threads stop in parallel, remove Safepoint_lock.
>>> Details:
>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>> - Safe means the JavaThread will not touch Java heap or VM internal structures
>>>   without doing a transition and block before doing so.
>>>         - The safe states are:
>>>                 - When polls armed: _thread_in_native and _thread_blocked.
>>>                 - When Threads_lock is held: externally suspended flag is set.
>>>         - VM Thread have polls armed and holds the Threads_lock during a
>>>           safepoint.
>>> - Unsafe means that either Java heap or VM internal structures can be accessed
>>>   by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>>         - All combination that are not safe are unsafe.
>>> We cannot start a safepoint until all unsafe threads have transitioned to a safe
>>> state. To make them safe, we arm polls in compiled code and make sure any
>>> transition to another unsafe state will be blocked. JavaThreads which are unsafe
>>> with state _thread_in_Java may transition to _thread_in_native without being
>>> blocked, since it just became a safe thread and we can proceed. Any safe thread
>>> may try to transition at any time to an unsafe state, thus coming into the
>>> safepoint blocking code at any moment, e.g., after the safepoint is over, or
>>> even at the beginning of next safepoint.
>>> The VMThread cannot tolerate false positives from the JavaThread thread state
>>> because that would mean starting the safepoint without all JavaThreads being
>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we never observe
>>> false positives from the safepoint blocking code, if we remove them, how do we
>>> handle false positives?
>>> By first publishing which barrier tag (safepoint counter) we will call
>>> WaitBarrier.wait() with as the threads safepoint id and then change the state to
>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
>>> the state. A stable load of the thread state is successful if the thread
>>> safepoint id is the same both before and after the load of the state and
>>> safepoint id is current or InactiveSafepointCounter. If the stable load fails,
>>> the thread is considered safepoint unsafe. It's no longer enough that thread is
>>> have state _thread_blocked it must also have correct safepoint id before and
>>> after we read the state.
>>> Performance:
>>> The result of faster safepoints is that the average CPU time for JavaThreads
>>> between safepoints is higher, thus increasing the allocation rate. The thread
>>> that stops first waits shorter time until it gets started. Even the thread that
>>> stops last also have shorter stop since we start them faster. If your
>>> application is using a concurrent GC it may need re-tunning since each java
>>> worker thread have an increased CPU time/allocation rate. Often this means max
>>> performance is achieved using slightly less java worker threads than before.
>>> Also the increase allocation rate means shorter time between GC safepoints.
>>> - If you are using a non-concurrent GC, you should see improved latency and
>>>   throughput.
>>> - After re-tunning with a concurrent GC throughput should be equal or better but
>>>   with better latency. But bear in mind this is a latency patch, not a
>>>   throughput one.
>>> With current code a java thread is not to guarantee to run between safepoint (in
>>> theory a java thread can be starved indefinitely), since the VM thread may
>>> re-grab the Threads_locks before it woke up from previous safepoint. If the
>>> GC/VM don't respect MMU (minimum mutator utilization) or if your machine is very
>>> over-provisioned this can happen.
>>> The current schema thus re-safepoint quickly if the java threads have not
>>> started yet at the cost of latency. Since the new code uses the WaitBarrier with
>>> the safepoint counter, all threads must roll forward to next safepoint by
>>> getting at least some CPU time between two safepoints. Meaning MMU violations
>>> are more obvious.
>>> Some examples on numbers:
>>> - On a 16 strand machine synchronization and un-synchronization/starting is at
>>>   least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
>>>   starting ~400->~100us.
>>>   (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>>   synchronization time on 16 strands and ~5% score increase. In this case the GC
>>>   op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>> Thanks, Robbin
>