RFR(XL): 8203469: Faster safepoints

Fri Jan 18 01:00:55 UTC 2019

Hi Robbin,

Nice work! Some minor comments:

1) In method SafepointSynchronize::decrement_waiting_to_block(), what's 
the argument post_if_needed for?

2) In method SafepointSynchronize::try_stable_load_state(), why do you 
have to load the safepoint_id again after loading the thread state? The 
safepoint_id could change right after you read it the second time so it 
doesn't seem to be a question of correctness. Since the state is only 
returned back to the caller when you read a sid equal to 
InactiveSafepointCounter or to the safepoint_count, could't you do 
something more simple like:

   uint64_t sid = thread->safepoint_state()->get_safepoint_id();  // 
Load acquire

   if (sid == InactiveSafepointCounter || sid == safepoint_count) {
     *state = thread->thread_state();
     return true;
   }
   return false;

Or in other words, which problematic scenario are you covering by 
reading it twice as it is now?

3) In method SafepointSynchronize::end, it seems the if-else conditional 
based on "SafepointMechanism::uses_thread_local_poll()" is executing 
almost the same code in both cases, except for two asserts in the "else" 
branch which seem to apply to the "if" one too, a storestore barrier 
against a full fence(is it needed?) and the actual 
disarm_local_poll(current) for the "if" case which maybe could be 
replaced by a if(_disarm_local_poll_needed) disarm_local_poll(current) 
statement. (I see that it is like that too in the current safepoint code 
though).
Also that whole if-else conditional is inside a {} block which was 
needed because of "MutexLocker mu(Safepoint_lock);" but not anymore.

Thanks!
Patricio

On 1/15/19 5:39 AM, Robbin Ehn wrote:
> Hi all, please review.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8203469
> Code: http://cr.openjdk.java.net/~rehn/8203469/v00/webrev/
>
> Thanks to Dan for pre-reviewing a lot!
>
> Background:
> ZGC often does very short safepoint operations. For a perspective, in a
> specJBB2015 run, G1 can have young collection stops lasting about 170 
> ms. While
> in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
> operation it is. The time it takes to stop and start the JavaThreads 
> is relative
> very large to a ZGC safepoint. With an operation that just takes 0.2ms 
> the
> overhead of stopping and starting JavaThreads is several times the 
> operation.
>
> High-level functionality change:
> Serializing the starting over Threads_lock takes time.
> - Don't wait on Threads_lock use the WaitBarrier.
> Serializing the stopping over Safepoint_lock takes time.
> - Let threads stop in parallel, remove Safepoint_lock.
>
> Details:
> JavaThreads have 2 abstract logical states: unsafe or safe.
> - Safe means the JavaThread will not touch Java heap or VM internal 
> structures
>   without doing a transition and block before doing so.
>         - The safe states are:
>                 - When polls armed: _thread_in_native and 
> _thread_blocked.
>                 - When Threads_lock is held: externally suspended flag 
> is set.
>         - VM Thread have polls armed and holds the Threads_lock during a
>           safepoint.
> - Unsafe means that either Java heap or VM internal structures can be 
> accessed
>   by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
>         - All combination that are not safe are unsafe.
>
> We cannot start a safepoint until all unsafe threads have transitioned 
> to a safe
> state. To make them safe, we arm polls in compiled code and make sure any
> transition to another unsafe state will be blocked. JavaThreads which 
> are unsafe
> with state _thread_in_Java may transition to _thread_in_native without 
> being
> blocked, since it just became a safe thread and we can proceed. Any 
> safe thread
> may try to transition at any time to an unsafe state, thus coming into 
> the
> safepoint blocking code at any moment, e.g., after the safepoint is 
> over, or
> even at the beginning of next safepoint.
>
> The VMThread cannot tolerate false positives from the JavaThread 
> thread state
> because that would mean starting the safepoint without all JavaThreads 
> being
> safe. The two locks (Threads_lock and Safepoint_lock) make sure we 
> never observe
> false positives from the safepoint blocking code, if we remove them, 
> how do we
> handle false positives?
>
> By first publishing which barrier tag (safepoint counter) we will call
> WaitBarrier.wait() with as the threads safepoint id and then change 
> the state to
> _thread_blocked, the VMThread can ignore JavaThreads by doing a stable 
> load of
> the state. A stable load of the thread state is successful if the thread
> safepoint id is the same both before and after the load of the state and
> safepoint id is current or InactiveSafepointCounter. If the stable 
> load fails,
> the thread is considered safepoint unsafe. It's no longer enough that 
> thread is
> have state _thread_blocked it must also have correct safepoint id 
> before and
> after we read the state.
>
> Performance:
> The result of faster safepoints is that the average CPU time for 
> JavaThreads
> between safepoints is higher, thus increasing the allocation rate. The 
> thread
> that stops first waits shorter time until it gets started. Even the 
> thread that
> stops last also have shorter stop since we start them faster. If your
> application is using a concurrent GC it may need re-tunning since each 
> java
> worker thread have an increased CPU time/allocation rate. Often this 
> means max
> performance is achieved using slightly less java worker threads than 
> before.
> Also the increase allocation rate means shorter time between GC 
> safepoints.
> - If you are using a non-concurrent GC, you should see improved 
> latency and
>   throughput.
> - After re-tunning with a concurrent GC throughput should be equal or 
> better but
>   with better latency. But bear in mind this is a latency patch, not a
>   throughput one.
> With current code a java thread is not to guarantee to run between 
> safepoint (in
> theory a java thread can be starved indefinitely), since the VM thread 
> may
> re-grab the Threads_locks before it woke up from previous safepoint. 
> If the
> GC/VM don't respect MMU (minimum mutator utilization) or if your 
> machine is very
> over-provisioned this can happen.
> The current schema thus re-safepoint quickly if the java threads have not
> started yet at the cost of latency. Since the new code uses the 
> WaitBarrier with
> the safepoint counter, all threads must roll forward to next safepoint by
> getting at least some CPU time between two safepoints. Meaning MMU 
> violations
> are more obvious.
>
> Some examples on numbers:
> - On a 16 strand machine synchronization and 
> un-synchronization/starting is at
>   least 3x faster (in non-trivial test). Synchronization ~600 -> 
> ~100us and
>   starting ~400->~100us.
>   (Semaphore path is a bit slower than futex in the WaitBarrier on 
> Linux).
> - SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
>   synchronization time on 16 strands and ~5% score increase. In this 
> case the GC
>   op is 1ms, so we reduce the overhead of synchronization from 100% to 
> 10%.
> - specJBB2015 ParGC ~9% increase in critical-jops.
>
> Thanks, Robbin