RFR(XL): 8203469: Faster safepoints

Fri Jan 18 13:45:45 UTC 2019

Hi Patricio,

On 1/18/19 2:00 AM, Patricio Chilano wrote:
> Hi Robbin,
> 
> Nice work! Some minor comments:
> 
> 1) In method SafepointSynchronize::decrement_waiting_to_block(), what's the 
> argument post_if_needed for?

Left-over, removed.

> 
> 2) In method SafepointSynchronize::try_stable_load_state(), why do you have to 
> load the safepoint_id again after loading the thread state? The safepoint_id 
> could change right after you read it the second time so it doesn't seem to be a 
> question of correctness. Since the state is only returned back to the caller 
> when you read a sid equal to InactiveSafepointCounter or to the safepoint_count, 
> could't you do something more simple like:
> 
>    uint64_t sid = thread->safepoint_state()->get_safepoint_id();  // Load acquire
> 
>    if (sid == InactiveSafepointCounter || sid == safepoint_count) {
>      *state = thread->thread_state();
>      return true;
>    }
>    return false;
> 
> Or in other words, which problematic scenario are you covering by reading it 
> twice as it is now?

The WaitBarrier is armed for current safepoint id, if the thread is in correct
safepoint the loaded safepoint id in SS::block() is current. Then it cannot
change since the WaitBarrier is armed for that id.

To separate threads blocked in SS:block from other blocked threads (since we do
not have a safepoint check when leaving SS:block()) the java thread may never
publish thread_blocked with a zero thread safepoint id in the SS:block code.
So we should set the safepoint id before going to blocked and go from blocked
before resetting (zeroing) the thread safepoint id.
We could do it the other way around but it would just create another type of
false positives and we still need to do a 'stable' load.

Normally stores are seen as:
- Javathread have non blocked + 0 safepoint id.
- Store thread safepoint id (next).
- Store thread state. (blocked)
-> waitbarrier

Meaning we read them in reverse-order:
- Load state
- Load safepoint id

Since a new safepoint can be started directly after _wait_barrier->wait();
We can see a thread leaving previous safepoint, in case stores are seen as:
- Leaving previous waitbarrier.
- Javathread have blocked + previous safepoint id.
- Store thread state. (non blocked)
- Store thread safepoint id (0).
- Store thread safepoint id (next).
- Store thread state. (blocked) (here it is safe)
-> waitbarrier

Thus the loading can see this as:
- Leaving previous waitbarrier.
- Javathread have blocked + previous safepoint id. <--- Load state blocked
- Store thread state. (non blocked)
- Store thread safepoint id (0). <---- Load thread safepoint id 0
- Store thread safepoint id (next).
- Store thread state. (blocked) (here it is safe)
-> waitbarrier

Now this is a false positive, resulting in blocked with safepoint id 0, not good.

By loading the thread safepoint id before and after we can notice this:
We would thus load:
- Load safepoint id => previous safepoint id
- Load state        => blocked
- Load safepoint id => previous safepoint id / 0 / next safepoint id

The stable load say that not only must they be the same, they also must be 0 or
_current_. Now we can say this thread is still unsafe!

> 
> 3) In method SafepointSynchronize::end, it seems the if-else conditional based 
> on "SafepointMechanism::uses_thread_local_poll()" is executing almost the same 
> code in both cases, except for two asserts in the "else" branch which seem to 
> apply to the "if" one too, a storestore barrier against a full fence(is it 
> needed?) and the actual disarm_local_poll(current) for the "if" case which maybe 
> could be replaced by a if(_disarm_local_poll_needed) disarm_local_poll(current) 
> statement. (I see that it is like that too in the current safepoint code though).
> Also that whole if-else conditional is inside a {} block which was needed 
> because of "MutexLocker mu(Safepoint_lock);" but not anymore.

Re-factored this to a disarm method.

I'll post v01 to initial RFR mail, just need some more testing.

Thanks, Robbin

> 
> 
> Thanks!
> Patricio
> 
> On 1/15/19 5:39 AM, Robbin Ehn wrote:
>> Hi all, please review.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
>> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>>
>> Thanks to Dan for pre-reviewing a lot!
>>
>> Background:
>> ZGC often does very short safepoint operations. For a perspective, in a
>> specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
>> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
>> operation it is. The time it takes to stop and start the JavaThreads is relative
>> very large to a ZGC safepoint. With an operation that just takes 0.2ms the
>> overhead of stopping and starting JavaThreads is several times the operation.
>>
>> High-level functionality change:
>> Serializing the starting over Threads_lock takes time.
>> - Don't wait on Threads_lock use the WaitBarrier.
>> Serializing the stopping over Safepoint_lock takes time.
>> - Let threads stop in parallel, remove Safepoint_lock.
>>
>> Details:
>> JavaThreads have 2 abstract logical states: unsafe or safe.
>> - Safe means the JavaThread will not touch Java heap or VM internal structures
>>   without doing a transition and block before doing so.
>>         - The safe states are:
>>                 - When polls armed: _thread_in_native and _thread_blocked.
>>                 - When Threads_lock is held: externally suspended flag is set.
>>         - VM Thread have polls armed and holds the Threads_lock during a
>>           safepoint.
>> - Unsafe means that either Java heap or VM internal structures can be accessed
>>   by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>>         - All combination that are not safe are unsafe.
>>
>> We cannot start a safepoint until all unsafe threads have transitioned to a safe
>> state. To make them safe, we arm polls in compiled code and make sure any
>> transition to another unsafe state will be blocked. JavaThreads which are unsafe
>> with state _thread_in_Java may transition to _thread_in_native without being
>> blocked, since it just became a safe thread and we can proceed. Any safe thread
>> may try to transition at any time to an unsafe state, thus coming into the
>> safepoint blocking code at any moment, e.g., after the safepoint is over, or
>> even at the beginning of next safepoint.
>>
>> The VMThread cannot tolerate false positives from the JavaThread thread state
>> because that would mean starting the safepoint without all JavaThreads being
>> safe. The two locks (Threads_lock and Safepoint_lock) make sure we never observe
>> false positives from the safepoint blocking code, if we remove them, how do we
>> handle false positives?
>>
>> By first publishing which barrier tag (safepoint counter) we will call
>> WaitBarrier.wait() with as the threads safepoint id and then change the state to
>> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
>> the state. A stable load of the thread state is successful if the thread
>> safepoint id is the same both before and after the load of the state and
>> safepoint id is current or InactiveSafepointCounter. If the stable load fails,
>> the thread is considered safepoint unsafe. It's no longer enough that thread is
>> have state _thread_blocked it must also have correct safepoint id before and
>> after we read the state.
>>
>> Performance:
>> The result of faster safepoints is that the average CPU time for JavaThreads
>> between safepoints is higher, thus increasing the allocation rate. The thread
>> that stops first waits shorter time until it gets started. Even the thread that
>> stops last also have shorter stop since we start them faster. If your
>> application is using a concurrent GC it may need re-tunning since each java
>> worker thread have an increased CPU time/allocation rate. Often this means max
>> performance is achieved using slightly less java worker threads than before.
>> Also the increase allocation rate means shorter time between GC safepoints.
>> - If you are using a non-concurrent GC, you should see improved latency and
>>   throughput.
>> - After re-tunning with a concurrent GC throughput should be equal or better but
>>   with better latency. But bear in mind this is a latency patch, not a
>>   throughput one.
>> With current code a java thread is not to guarantee to run between safepoint (in
>> theory a java thread can be starved indefinitely), since the VM thread may
>> re-grab the Threads_locks before it woke up from previous safepoint. If the
>> GC/VM don't respect MMU (minimum mutator utilization) or if your machine is very
>> over-provisioned this can happen.
>> The current schema thus re-safepoint quickly if the java threads have not
>> started yet at the cost of latency. Since the new code uses the WaitBarrier with
>> the safepoint counter, all threads must roll forward to next safepoint by
>> getting at least some CPU time between two safepoints. Meaning MMU violations
>> are more obvious.
>>
>> Some examples on numbers:
>> - On a 16 strand machine synchronization and un-synchronization/starting is at
>>   least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
>>   starting ~400->~100us.
>>   (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
>> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>>   synchronization time on 16 strands and ~5% score increase. In this case the GC
>>   op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
>> - specJBB2015 ParGC ~9% increase in critical-jops.
>>
>> Thanks, Robbin
>