RFR(XL): 8203469: Faster safepoints

Wed Jan 23 11:29:00 UTC 2019

Hi David,

On 2019-01-23 10:42, David Holmes wrote:
> Hi Robbin,
> 
> Thanks for all the work on this! This is looking really good.

Thanks!

> 
> I have one concern and that is the fact the WaitBarrier only uses an int as a 
> tag, but the _safepoint_counter is a uint64_t. Seems to me that once the 
> _safepoint_counter rolls over to need 33-bits then casting it to an int for the 
> tag is going to give the dis-allowed zero value. ??

I had same concern, but since safepoint only happens during odd counters, we 
only arm the WaitBarrier with odd numbers.
(even == no safepoint, odd = active safepoint)
There is an assert on ~L520 checking that the safepoint counter was odd during 
the safepoint.
Roll-over manually tested.

> 
> Specific comments below.
> 
> Oh and can you add all the high-level description in your initial RFR email to 
> the bug report please. Thanks.

Fixed, updated with the correct text for the stable load.

> 
> On 23/01/2019 1:39 am, Robbin Ehn wrote:
>> Hi all, here is v01 and v02.
>>
>> v01 contains update after comments from list:
>> http://cr.openjdk.java.net/~rehn/8203469/v01/
>> http://cr.openjdk.java.net/~rehn/8203469/v01/inc/
>>
>> v02 contains a bug fix, explained below:
>> http://cr.openjdk.java.net/~rehn/8203469/v02/
>> http://cr.openjdk.java.net/~rehn/8203469/v02/inc/
> 
> Minor comments:
> 
> src/hotspot/share/runtime/safepoint.cpp
> 
> check_thread_safepoint_state needs a better name - what is it checking the state 
> for?

Changed to thread_not_running.

> 
> ---
> 
>   368   Thread* myThread = Thread::current();
>   369   assert(myThread->is_VM_thread(), "Only VM thread may execute a safepoint");
> 
>   558   DEBUG_ONLY(Thread* myThread = Thread::current();)
>   559   assert(myThread->is_VM_thread(), "Only VM thread can execute a safepoint");
> 
> You don't need the myThread local variables.

Removed

Sending out a v04 soon.

Thanks, Robbin

> 
> 
> That's it! :)  (Thanks to Dan for tackling the updates to the commentary ;-) ).
> 
> 
> Thanks,
> David
> -----
> 
> 
>> Patricio had some good questions about try_stable_load_state.
>> In previous internal versions I have done the stable load by loading thread 
>> state before and after safepoint id. For some reason I changed during a
>> refactoring to the reverse, which is incorrect. Consider the following:
>>
>> JavaThread: state / safepoint id / poll |VMThread: global state / safepoint 
>> counter / WaitBarrier
>> ########################################|################################
>> _thread_in_native       / 0 / disarmed  | _not_synchronized / 0 / disarmed
>>                                          | _not_synchronized / 0 / armed(1)
>>                                          | _not_synchronized / 1 / armed(1)
>>                                          | _synchronizing    / 1 / armed(1)
>> _thread_in_native       / 0 / armed     |
>>                                          | <LOAD JavaThread safepoint id:0>
>>                                          | <LOAD JavaThread thread state 
>> id:_thread_in_native>
>>                                          | <LOAD JavaThread safepoint id:0>
>>                                          | _synchonized      / 1 / armed(1)
>> <JavaThread transistion to VM>          |
>> _thread_in_native_trans / 0 / armed     |
>> <LOAD safepoint counter(1)>             |
>> <JavaThread goes off-proc>              |
>>                                          | _not_synchonized  / 1 / armed(1)
>>                                          | _not_synchonized  / 2 / armed(1)
>> _thread_in_native_trans / 0 / disarmed  |
>>                                          | _not_synchonized  / 2 / disarmed
>> Next safepoint starts:
>>                                          | _not_synchronized / 2 / armed(3)
>>                                          | _not_synchronized / 3 / armed(3)
>>                                          | _synchronizing    / 3 / armed(3)
>> _thread_in_native_trans / 0 / armed     |
>>                                          | <LOAD JavaThread safepoint id:0>
>> <JavaThread goes on-proc>               |
>> <STORE loaded safepoint counter(1)>     |
>> _thread_in_native_trans / 1 / armed     |
>> _thread_blocked         / 1 / armed     |
>> <WaitBarrier not armed for 1>           |
>>                                          | <LOAD JavaThread thread state 
>> id:_thread_blocked>
>> _thread_in_native_trans / 1 / armed     |
>> _thread_in_native_trans / 0 / armed     |
>>                                          | <LOAD JavaThread safepoint id:0>
>>
>> A false positive is read.
>>
>> When do it the correct the safe matrix looks like:
>> State load 1      | Safepoint id | State load 2     | Result
>> ##################|##############|##################|#######
>> any               | !0/current   | any              | treat all as unsafe
>> any               | any          | !state1          | treat all as unsafe
>> any               | 0/current    | state1           | suspend flag is safe
>> thread_in_native  | 0/current    | thread_in_native | safe
>> thread_in_blocked | 0/current    | thread_in_blocked| safe
>> !thread_in_blocked
>> &&
>> !thread_in_native | 0/current    | state1           | unsafe
>>
>> The case with blocked/0/blocked I added this comment for:
>>
>>   755   // To handle the thread_blocked state on the backedge of the 
>> WaitBarrier from
>>   756   // previous safepoint and reading the resetted 
>> (0/InactiveSafepointCounter) we
>>   757   // re-read state after we read thread safepoint id. The JavaThread 
>> changes it
>>   758   // state before resetting, the second read will either see a different 
>> thread
>>   759   // state making this an unsafe state or it can see blocked again.
>>   760   // When we see blocked twice with a 0 safepoint id, either:
>>   761   // - It is normally blocked, e.g. on Mutex, TBIVM.
>>   762   // - It was in SS:block(), looped around to SS:block() and is blocked 
>> on the WaitBarrier.
>>   763   // - It was in SS:block() but now on a Mutex.
>>   764   // Either case safe.
>>
>> I hope above explains why loading state before and after safepoint id is
>> sufficient.
>>
>> Passes, with flying colors, t1-5, stress test, KS 24h stress.
>>
>> Thanks, Robbin
>>
>> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>>> Hi all, please review.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>>
>>> Thanks to Dan for pre-reviewing a lot!
>>>
>>> Background:
>>> ZGC often does very short safepoint operations. For a perspective, in a
>>> specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>>> operation it is. The time it takes to stop and start the JavaThreads is relative
>>> very large to a ZGC safepoint. With an operation that just takes 0.2ms the
>>> overhead of stopping and starting JavaThreads is several times the operation.
>>>
>>> High-level functionality change:
>>> Serializing the starting over Threads_lock takes time.
>>> - Don't wait on Threads_lock use the WaitBarrier.
>>> Serializing the stopping over Safepoint_lock takes time.
>>> - Let threads stop in parallel, remove Safepoint_lock.
>>>
>>> Details:
>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>> - Safe means the JavaThread will not touch Java heap or VM internal structures
>>>    without doing a transition and block before doing so.
>>>          - The safe states are:
>>>                  - When polls armed: _thread_in_native and _thread_blocked.
>>>                  - When Threads_lock is held: externally suspended flag is set.
>>>          - VM Thread have polls armed and holds the Threads_lock during a
>>>            safepoint.
>>> - Unsafe means that either Java heap or VM internal structures can be accessed
>>>    by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>>          - All combination that are not safe are unsafe.
>>>
>>> We cannot start a safepoint until all unsafe threads have transitioned to a safe
>>> state. To make them safe, we arm polls in compiled code and make sure any
>>> transition to another unsafe state will be blocked. JavaThreads which are unsafe
>>> with state _thread_in_Java may transition to _thread_in_native without being
>>> blocked, since it just became a safe thread and we can proceed. Any safe thread
>>> may try to transition at any time to an unsafe state, thus coming into the
>>> safepoint blocking code at any moment, e.g., after the safepoint is over, or
>>> even at the beginning of next safepoint.
>>>
>>> The VMThread cannot tolerate false positives from the JavaThread thread state
>>> because that would mean starting the safepoint without all JavaThreads being
>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we never observe
>>> false positives from the safepoint blocking code, if we remove them, how do we
>>> handle false positives?
>>>
>>> By first publishing which barrier tag (safepoint counter) we will call
>>> WaitBarrier.wait() with as the threads safepoint id and then change the state to
>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
>>> the state. A stable load of the thread state is successful if the thread
>>> safepoint id is the same both before and after the load of the state and
>>> safepoint id is current or InactiveSafepointCounter. If the stable load fails,
>>> the thread is considered safepoint unsafe. It's no longer enough that thread is
>>> have state _thread_blocked it must also have correct safepoint id before and
>>> after we read the state.
>>>
>>> Performance:
>>> The result of faster safepoints is that the average CPU time for JavaThreads
>>> between safepoints is higher, thus increasing the allocation rate. The thread
>>> that stops first waits shorter time until it gets started. Even the thread that
>>> stops last also have shorter stop since we start them faster. If your
>>> application is using a concurrent GC it may need re-tunning since each java
>>> worker thread have an increased CPU time/allocation rate. Often this means max
>>> performance is achieved using slightly less java worker threads than before.
>>> Also the increase allocation rate means shorter time between GC safepoints.
>>> - If you are using a non-concurrent GC, you should see improved latency and
>>>    throughput.
>>> - After re-tunning with a concurrent GC throughput should be equal or better but
>>>    with better latency. But bear in mind this is a latency patch, not a
>>>    throughput one.
>>> With current code a java thread is not to guarantee to run between safepoint (in
>>> theory a java thread can be starved indefinitely), since the VM thread may
>>> re-grab the Threads_locks before it woke up from previous safepoint. If the
>>> GC/VM don't respect MMU (minimum mutator utilization) or if your machine is very
>>> over-provisioned this can happen.
>>> The current schema thus re-safepoint quickly if the java threads have not
>>> started yet at the cost of latency. Since the new code uses the WaitBarrier with
>>> the safepoint counter, all threads must roll forward to next safepoint by
>>> getting at least some CPU time between two safepoints. Meaning MMU violations
>>> are more obvious.
>>>
>>> Some examples on numbers:
>>> - On a 16 strand machine synchronization and un-synchronization/starting is at
>>>    least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
>>>    starting ~400->~100us.
>>>    (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>>    synchronization time on 16 strands and ~5% score increase. In this case 
>>> the GC
>>>    op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>>
>>> Thanks, Robbin