RFR(XL): 8203469: Faster safepoints

Wed Jan 23 11:43:42 UTC 2019

On 23/01/2019 9:29 pm, Robbin Ehn wrote:
> Hi David,
> 
> On 2019-01-23 10:42, David Holmes wrote:
>> Hi Robbin,
>>
>> Thanks for all the work on this! This is looking really good.
> 
> Thanks!
> 
>>
>> I have one concern and that is the fact the WaitBarrier only uses an 
>> int as a tag, but the _safepoint_counter is a uint64_t. Seems to me 
>> that once the _safepoint_counter rolls over to need 33-bits then 
>> casting it to an int for the tag is going to give the dis-allowed zero 
>> value. ??
> 
> I had same concern, but since safepoint only happens during odd 
> counters, we only arm the WaitBarrier with odd numbers.
> (even == no safepoint, odd = active safepoint)
> There is an assert on ~L520 checking that the safepoint counter was odd 
> during the safepoint.
> Roll-over manually tested.

Ah I see. Thanks for clarifying.

David
-----

>>
>> Specific comments below.
>>
>> Oh and can you add all the high-level description in your initial RFR 
>> email to the bug report please. Thanks.
> 
> Fixed, updated with the correct text for the stable load.
> 
>>
>> On 23/01/2019 1:39 am, Robbin Ehn wrote:
>>> Hi all, here is v01 and v02.
>>>
>>> v01 contains update after comments from list:
>>> http://cr.openjdk.java.net/~rehn/8203469/v01/
>>> http://cr.openjdk.java.net/~rehn/8203469/v01/inc/
>>>
>>> v02 contains a bug fix, explained below:
>>> http://cr.openjdk.java.net/~rehn/8203469/v02/
>>> http://cr.openjdk.java.net/~rehn/8203469/v02/inc/
>>
>> Minor comments:
>>
>> src/hotspot/share/runtime/safepoint.cpp
>>
>> check_thread_safepoint_state needs a better name - what is it checking 
>> the state for?
> 
> Changed to thread_not_running.
> 
>>
>> ---
>>
>>   368   Thread* myThread = Thread::current();
>>   369   assert(myThread->is_VM_thread(), "Only VM thread may execute a 
>> safepoint");
>>
>>   558   DEBUG_ONLY(Thread* myThread = Thread::current();)
>>   559   assert(myThread->is_VM_thread(), "Only VM thread can execute a 
>> safepoint");
>>
>> You don't need the myThread local variables.
> 
> Removed
> 
> Sending out a v04 soon.
> 
> Thanks, Robbin
> 
>>
>>
>> That's it! :)  (Thanks to Dan for tackling the updates to the 
>> commentary ;-) ).
>>
>>
>> Thanks,
>> David
>> -----
>>
>>
>>> Patricio had some good questions about try_stable_load_state.
>>> In previous internal versions I have done the stable load by loading 
>>> thread state before and after safepoint id. For some reason I changed 
>>> during a
>>> refactoring to the reverse, which is incorrect. Consider the following:
>>>
>>> JavaThread: state / safepoint id / poll |VMThread: global state / 
>>> safepoint counter / WaitBarrier
>>> ########################################|################################ 
>>>
>>> _thread_in_native       / 0 / disarmed  | _not_synchronized / 0 / 
>>> disarmed
>>>                                          | _not_synchronized / 0 / 
>>> armed(1)
>>>                                          | _not_synchronized / 1 / 
>>> armed(1)
>>>                                          | _synchronizing    / 1 / 
>>> armed(1)
>>> _thread_in_native       / 0 / armed     |
>>>                                          | <LOAD JavaThread safepoint 
>>> id:0>
>>>                                          | <LOAD JavaThread thread 
>>> state id:_thread_in_native>
>>>                                          | <LOAD JavaThread safepoint 
>>> id:0>
>>>                                          | _synchonized      / 1 / 
>>> armed(1)
>>> <JavaThread transistion to VM>          |
>>> _thread_in_native_trans / 0 / armed     |
>>> <LOAD safepoint counter(1)>             |
>>> <JavaThread goes off-proc>              |
>>>                                          | _not_synchonized  / 1 / 
>>> armed(1)
>>>                                          | _not_synchonized  / 2 / 
>>> armed(1)
>>> _thread_in_native_trans / 0 / disarmed  |
>>>                                          | _not_synchonized  / 2 / 
>>> disarmed
>>> Next safepoint starts:
>>>                                          | _not_synchronized / 2 / 
>>> armed(3)
>>>                                          | _not_synchronized / 3 / 
>>> armed(3)
>>>                                          | _synchronizing    / 3 / 
>>> armed(3)
>>> _thread_in_native_trans / 0 / armed     |
>>>                                          | <LOAD JavaThread safepoint 
>>> id:0>
>>> <JavaThread goes on-proc>               |
>>> <STORE loaded safepoint counter(1)>     |
>>> _thread_in_native_trans / 1 / armed     |
>>> _thread_blocked         / 1 / armed     |
>>> <WaitBarrier not armed for 1>           |
>>>                                          | <LOAD JavaThread thread 
>>> state id:_thread_blocked>
>>> _thread_in_native_trans / 1 / armed     |
>>> _thread_in_native_trans / 0 / armed     |
>>>                                          | <LOAD JavaThread safepoint 
>>> id:0>
>>>
>>> A false positive is read.
>>>
>>> When do it the correct the safe matrix looks like:
>>> State load 1      | Safepoint id | State load 2     | Result
>>> ##################|##############|##################|#######
>>> any               | !0/current   | any              | treat all as 
>>> unsafe
>>> any               | any          | !state1          | treat all as 
>>> unsafe
>>> any               | 0/current    | state1           | suspend flag is 
>>> safe
>>> thread_in_native  | 0/current    | thread_in_native | safe
>>> thread_in_blocked | 0/current    | thread_in_blocked| safe
>>> !thread_in_blocked
>>> &&
>>> !thread_in_native | 0/current    | state1           | unsafe
>>>
>>> The case with blocked/0/blocked I added this comment for:
>>>
>>>   755   // To handle the thread_blocked state on the backedge of the 
>>> WaitBarrier from
>>>   756   // previous safepoint and reading the resetted 
>>> (0/InactiveSafepointCounter) we
>>>   757   // re-read state after we read thread safepoint id. The 
>>> JavaThread changes it
>>>   758   // state before resetting, the second read will either see a 
>>> different thread
>>>   759   // state making this an unsafe state or it can see blocked 
>>> again.
>>>   760   // When we see blocked twice with a 0 safepoint id, either:
>>>   761   // - It is normally blocked, e.g. on Mutex, TBIVM.
>>>   762   // - It was in SS:block(), looped around to SS:block() and is 
>>> blocked on the WaitBarrier.
>>>   763   // - It was in SS:block() but now on a Mutex.
>>>   764   // Either case safe.
>>>
>>> I hope above explains why loading state before and after safepoint id is
>>> sufficient.
>>>
>>> Passes, with flying colors, t1-5, stress test, KS 24h stress.
>>>
>>> Thanks, Robbin
>>>
>>> On 1/15/19 11:39 AM, Robbin Ehn wrote:
>>>> Hi all, please review.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>>>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>>>
>>>> Thanks to Dan for pre-reviewing a lot!
>>>>
>>>> Background:
>>>> ZGC often does very short safepoint operations. For a perspective, in a
>>>> specJBB2015 run, G1 can have young collection stops lasting about 
>>>> 170 ms. While
>>>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on 
>>>> which
>>>> operation it is. The time it takes to stop and start the JavaThreads 
>>>> is relative
>>>> very large to a ZGC safepoint. With an operation that just takes 
>>>> 0.2ms the
>>>> overhead of stopping and starting JavaThreads is several times the 
>>>> operation.
>>>>
>>>> High-level functionality change:
>>>> Serializing the starting over Threads_lock takes time.
>>>> - Don't wait on Threads_lock use the WaitBarrier.
>>>> Serializing the stopping over Safepoint_lock takes time.
>>>> - Let threads stop in parallel, remove Safepoint_lock.
>>>>
>>>> Details:
>>>> JavaThreads have 2 abstract logical states: unsafe or safe.
>>>> - Safe means the JavaThread will not touch Java heap or VM internal 
>>>> structures
>>>>    without doing a transition and block before doing so.
>>>>          - The safe states are:
>>>>                  - When polls armed: _thread_in_native and 
>>>> _thread_blocked.
>>>>                  - When Threads_lock is held: externally suspended 
>>>> flag is set.
>>>>          - VM Thread have polls armed and holds the Threads_lock 
>>>> during a
>>>>            safepoint.
>>>> - Unsafe means that either Java heap or VM internal structures can 
>>>> be accessed
>>>>    by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>>>          - All combination that are not safe are unsafe.
>>>>
>>>> We cannot start a safepoint until all unsafe threads have 
>>>> transitioned to a safe
>>>> state. To make them safe, we arm polls in compiled code and make 
>>>> sure any
>>>> transition to another unsafe state will be blocked. JavaThreads 
>>>> which are unsafe
>>>> with state _thread_in_Java may transition to _thread_in_native 
>>>> without being
>>>> blocked, since it just became a safe thread and we can proceed. Any 
>>>> safe thread
>>>> may try to transition at any time to an unsafe state, thus coming 
>>>> into the
>>>> safepoint blocking code at any moment, e.g., after the safepoint is 
>>>> over, or
>>>> even at the beginning of next safepoint.
>>>>
>>>> The VMThread cannot tolerate false positives from the JavaThread 
>>>> thread state
>>>> because that would mean starting the safepoint without all 
>>>> JavaThreads being
>>>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we 
>>>> never observe
>>>> false positives from the safepoint blocking code, if we remove them, 
>>>> how do we
>>>> handle false positives?
>>>>
>>>> By first publishing which barrier tag (safepoint counter) we will call
>>>> WaitBarrier.wait() with as the threads safepoint id and then change 
>>>> the state to
>>>> _thread_blocked, the VMThread can ignore JavaThreads by doing a 
>>>> stable load of
>>>> the state. A stable load of the thread state is successful if the 
>>>> thread
>>>> safepoint id is the same both before and after the load of the state 
>>>> and
>>>> safepoint id is current or InactiveSafepointCounter. If the stable 
>>>> load fails,
>>>> the thread is considered safepoint unsafe. It's no longer enough 
>>>> that thread is
>>>> have state _thread_blocked it must also have correct safepoint id 
>>>> before and
>>>> after we read the state.
>>>>
>>>> Performance:
>>>> The result of faster safepoints is that the average CPU time for 
>>>> JavaThreads
>>>> between safepoints is higher, thus increasing the allocation rate. 
>>>> The thread
>>>> that stops first waits shorter time until it gets started. Even the 
>>>> thread that
>>>> stops last also have shorter stop since we start them faster. If your
>>>> application is using a concurrent GC it may need re-tunning since 
>>>> each java
>>>> worker thread have an increased CPU time/allocation rate. Often this 
>>>> means max
>>>> performance is achieved using slightly less java worker threads than 
>>>> before.
>>>> Also the increase allocation rate means shorter time between GC 
>>>> safepoints.
>>>> - If you are using a non-concurrent GC, you should see improved 
>>>> latency and
>>>>    throughput.
>>>> - After re-tunning with a concurrent GC throughput should be equal 
>>>> or better but
>>>>    with better latency. But bear in mind this is a latency patch, not a
>>>>    throughput one.
>>>> With current code a java thread is not to guarantee to run between 
>>>> safepoint (in
>>>> theory a java thread can be starved indefinitely), since the VM 
>>>> thread may
>>>> re-grab the Threads_locks before it woke up from previous safepoint. 
>>>> If the
>>>> GC/VM don't respect MMU (minimum mutator utilization) or if your 
>>>> machine is very
>>>> over-provisioned this can happen.
>>>> The current schema thus re-safepoint quickly if the java threads 
>>>> have not
>>>> started yet at the cost of latency. Since the new code uses the 
>>>> WaitBarrier with
>>>> the safepoint counter, all threads must roll forward to next 
>>>> safepoint by
>>>> getting at least some CPU time between two safepoints. Meaning MMU 
>>>> violations
>>>> are more obvious.
>>>>
>>>> Some examples on numbers:
>>>> - On a 16 strand machine synchronization and 
>>>> un-synchronization/starting is at
>>>>    least 3x faster (in non-trivial test). Synchronization ~600 -> 
>>>> ~100us and
>>>>    starting ~400->~100us.
>>>>    (Semaphore path is a bit slower than futex in the WaitBarrier on 
>>>> Linux).
>>>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>>>    synchronization time on 16 strands and ~5% score increase. In 
>>>> this case the GC
>>>>    op is 1ms, so we reduce the overhead of synchronization from 100% 
>>>> to 10%.
>>>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>>>
>>>> Thanks, Robbin