RFR(XL): 8203469: Faster safepoints

Wed Jan 23 09:42:21 UTC 2019

Hi Robbin,

Thanks for all the work on this! This is looking really good.

I have one concern and that is the fact the WaitBarrier only uses an int 
as a tag, but the _safepoint_counter is a uint64_t. Seems to me that 
once the _safepoint_counter rolls over to need 33-bits then casting it 
to an int for the tag is going to give the dis-allowed zero value. ??

Specific comments below.

Oh and can you add all the high-level description in your initial RFR 
email to the bug report please. Thanks.

On 23/01/2019 1:39 am, Robbin Ehn wrote:
> Hi all, here is v01 and v02.
> 
> v01 contains update after comments from list:
> http://cr.openjdk.java.net/~rehn/8203469/v01/
> http://cr.openjdk.java.net/~rehn/8203469/v01/inc/
> 
> v02 contains a bug fix, explained below:
> http://cr.openjdk.java.net/~rehn/8203469/v02/
> http://cr.openjdk.java.net/~rehn/8203469/v02/inc/

Minor comments:

src/hotspot/share/runtime/safepoint.cpp

check_thread_safepoint_state needs a better name - what is it checking 
the state for?

---

  368   Thread* myThread = Thread::current();
  369   assert(myThread->is_VM_thread(), "Only VM thread may execute a 
safepoint");

  558   DEBUG_ONLY(Thread* myThread = Thread::current();)
  559   assert(myThread->is_VM_thread(), "Only VM thread can execute a 
safepoint");

You don't need the myThread local variables.

That's it! :)  (Thanks to Dan for tackling the updates to the commentary 
;-) ).

Thanks,
David
-----

> Patricio had some good questions about try_stable_load_state.
> In previous internal versions I have done the stable load by loading 
> thread state before and after safepoint id. For some reason I changed 
> during a
> refactoring to the reverse, which is incorrect. Consider the following:
> 
> JavaThread: state / safepoint id / poll |VMThread: global state / 
> safepoint counter / WaitBarrier
> ########################################|################################
> _thread_in_native       / 0 / disarmed  | _not_synchronized / 0 / disarmed
>                                          | _not_synchronized / 0 / armed(1)
>                                          | _not_synchronized / 1 / armed(1)
>                                          | _synchronizing    / 1 / armed(1)
> _thread_in_native       / 0 / armed     |
>                                          | <LOAD JavaThread safepoint id:0>
>                                          | <LOAD JavaThread thread state 
> id:_thread_in_native>
>                                          | <LOAD JavaThread safepoint id:0>
>                                          | _synchonized      / 1 / armed(1)
> <JavaThread transistion to VM>          |
> _thread_in_native_trans / 0 / armed     |
> <LOAD safepoint counter(1)>             |
> <JavaThread goes off-proc>              |
>                                          | _not_synchonized  / 1 / armed(1)
>                                          | _not_synchonized  / 2 / armed(1)
> _thread_in_native_trans / 0 / disarmed  |
>                                          | _not_synchonized  / 2 / disarmed
> Next safepoint starts:
>                                          | _not_synchronized / 2 / armed(3)
>                                          | _not_synchronized / 3 / armed(3)
>                                          | _synchronizing    / 3 / armed(3)
> _thread_in_native_trans / 0 / armed     |
>                                          | <LOAD JavaThread safepoint id:0>
> <JavaThread goes on-proc>               |
> <STORE loaded safepoint counter(1)>     |
> _thread_in_native_trans / 1 / armed     |
> _thread_blocked         / 1 / armed     |
> <WaitBarrier not armed for 1>           |
>                                          | <LOAD JavaThread thread state 
> id:_thread_blocked>
> _thread_in_native_trans / 1 / armed     |
> _thread_in_native_trans / 0 / armed     |
>                                          | <LOAD JavaThread safepoint id:0>
> 
> A false positive is read.
> 
> When do it the correct the safe matrix looks like:
> State load 1      | Safepoint id | State load 2     | Result
> ##################|##############|##################|#######
> any               | !0/current   | any              | treat all as unsafe
> any               | any          | !state1          | treat all as unsafe
> any               | 0/current    | state1           | suspend flag is safe
> thread_in_native  | 0/current    | thread_in_native | safe
> thread_in_blocked | 0/current    | thread_in_blocked| safe
> !thread_in_blocked
> &&
> !thread_in_native | 0/current    | state1           | unsafe
> 
> The case with blocked/0/blocked I added this comment for:
> 
>   755   // To handle the thread_blocked state on the backedge of the 
> WaitBarrier from
>   756   // previous safepoint and reading the resetted 
> (0/InactiveSafepointCounter) we
>   757   // re-read state after we read thread safepoint id. The 
> JavaThread changes it
>   758   // state before resetting, the second read will either see a 
> different thread
>   759   // state making this an unsafe state or it can see blocked again.
>   760   // When we see blocked twice with a 0 safepoint id, either:
>   761   // - It is normally blocked, e.g. on Mutex, TBIVM.
>   762   // - It was in SS:block(), looped around to SS:block() and is 
> blocked on the WaitBarrier.
>   763   // - It was in SS:block() but now on a Mutex.
>   764   // Either case safe.
> 
> I hope above explains why loading state before and after safepoint id is
> sufficient.
> 
> Passes, with flying colors, t1-5, stress test, KS 24h stress.
> 
> Thanks, Robbin
> 
> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>> Hi all, please review.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>
>> Thanks to Dan for pre-reviewing a lot!
>>
>> Background:
>> ZGC often does very short safepoint operations. For a perspective, in a
>> specJBB2015 run, G1 can have young collection stops lasting about 170 
>> ms. While
>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>> operation it is. The time it takes to stop and start the JavaThreads 
>> is relative
>> very large to a ZGC safepoint. With an operation that just takes 0.2ms 
>> the
>> overhead of stopping and starting JavaThreads is several times the 
>> operation.
>>
>> High-level functionality change:
>> Serializing the starting over Threads_lock takes time.
>> - Don't wait on Threads_lock use the WaitBarrier.
>> Serializing the stopping over Safepoint_lock takes time.
>> - Let threads stop in parallel, remove Safepoint_lock.
>>
>> Details:
>> JavaThreads have 2 abstract logical states: unsafe or safe.
>> - Safe means the JavaThread will not touch Java heap or VM internal 
>> structures
>>    without doing a transition and block before doing so.
>>          - The safe states are:
>>                  - When polls armed: _thread_in_native and 
>> _thread_blocked.
>>                  - When Threads_lock is held: externally suspended 
>> flag is set.
>>          - VM Thread have polls armed and holds the Threads_lock during a
>>            safepoint.
>> - Unsafe means that either Java heap or VM internal structures can be 
>> accessed
>>    by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>          - All combination that are not safe are unsafe.
>>
>> We cannot start a safepoint until all unsafe threads have transitioned 
>> to a safe
>> state. To make them safe, we arm polls in compiled code and make sure any
>> transition to another unsafe state will be blocked. JavaThreads which 
>> are unsafe
>> with state _thread_in_Java may transition to _thread_in_native without 
>> being
>> blocked, since it just became a safe thread and we can proceed. Any 
>> safe thread
>> may try to transition at any time to an unsafe state, thus coming into 
>> the
>> safepoint blocking code at any moment, e.g., after the safepoint is 
>> over, or
>> even at the beginning of next safepoint.
>>
>> The VMThread cannot tolerate false positives from the JavaThread 
>> thread state
>> because that would mean starting the safepoint without all JavaThreads 
>> being
>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we 
>> never observe
>> false positives from the safepoint blocking code, if we remove them, 
>> how do we
>> handle false positives?
>>
>> By first publishing which barrier tag (safepoint counter) we will call
>> WaitBarrier.wait() with as the threads safepoint id and then change 
>> the state to
>> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable 
>> load of
>> the state. A stable load of the thread state is successful if the thread
>> safepoint id is the same both before and after the load of the state and
>> safepoint id is current or InactiveSafepointCounter. If the stable 
>> load fails,
>> the thread is considered safepoint unsafe. It's no longer enough that 
>> thread is
>> have state _thread_blocked it must also have correct safepoint id 
>> before and
>> after we read the state.
>>
>> Performance:
>> The result of faster safepoints is that the average CPU time for 
>> JavaThreads
>> between safepoints is higher, thus increasing the allocation rate. The 
>> thread
>> that stops first waits shorter time until it gets started. Even the 
>> thread that
>> stops last also have shorter stop since we start them faster. If your
>> application is using a concurrent GC it may need re-tunning since each 
>> java
>> worker thread have an increased CPU time/allocation rate. Often this 
>> means max
>> performance is achieved using slightly less java worker threads than 
>> before.
>> Also the increase allocation rate means shorter time between GC 
>> safepoints.
>> - If you are using a non-concurrent GC, you should see improved 
>> latency and
>>    throughput.
>> - After re-tunning with a concurrent GC throughput should be equal or 
>> better but
>>    with better latency. But bear in mind this is a latency patch, not a
>>    throughput one.
>> With current code a java thread is not to guarantee to run between 
>> safepoint (in
>> theory a java thread can be starved indefinitely), since the VM thread 
>> may
>> re-grab the Threads_locks before it woke up from previous safepoint. 
>> If the
>> GC/VM don't respect MMU (minimum mutator utilization) or if your 
>> machine is very
>> over-provisioned this can happen.
>> The current schema thus re-safepoint quickly if the java threads have not
>> started yet at the cost of latency. Since the new code uses the 
>> WaitBarrier with
>> the safepoint counter, all threads must roll forward to next safepoint by
>> getting at least some CPU time between two safepoints. Meaning MMU 
>> violations
>> are more obvious.
>>
>> Some examples on numbers:
>> - On a 16 strand machine synchronization and 
>> un-synchronization/starting is at
>>    least 3x faster (in non-trivial test). Synchronization ~600 -> 
>> ~100us and
>>    starting ~400->~100us.
>>    (Semaphore path is a bit slower than futex in the WaitBarrier on 
>> Linux).
>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>    synchronization time on 16 strands and ~5% score increase. In this 
>> case the GC
>>    op is 1ms, so we reduce the overhead of synchronization from 100% 
>> to 10%.
>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>
>> Thanks, Robbin