[External] : Re: Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Fri Apr 16 09:48:49 UTC 2021

Hi Man,

On 2021-03-31 04:43, Man Cao wrote:
> Hi all,
> 
> I finally managed to allocate more time to make progress on this, and 
> resolved most issues since the last discussion.
> I've updated the description in 
> https://bugs.openjdk.java.net/browse/JDK-8236485 
> <https://bugs.openjdk.java.net/browse/JDK-8236485>, and the current 
> prototype is the HEAD commit at 
> https://github.com/caoman/jdk/tree/g1EpochSync 
> <https://urldefense.com/v3/__https://github.com/caoman/jdk/tree/g1EpochSync__;!!GqivPVa7Brio!MXEyaVysyn6eC7wTHQFe_MGMXyR5OTqo3nVI3r0un9JiCeDfEk6aRSUyfwnD0usDHQE$>.
> Notable changes include:
> - The protocol uses async handshake from JDK-8238761 
> <https://bugs.openjdk.java.net/browse/JDK-8238761> to resolve the 
> blocking issue from normal handshake.
> - In order to support async refinement due to async handshake, added 
> support for _deferred global queue to G1DirtyCardQueueSet. Buffers 
> rarely get enqueued to _deferred at run-time.
> - The async handshake only executes for a subset of threads.
> 
> I have a couple of questions:
> 
> 1. Code review and patch size.
> Should I start a pull request for this change, so it is easier to give 
> feedback?
> 
> What is the recommended approach to deal with large changes? Currently 
> the patch is about 1200 lines, without changing the write barrier itself 
> (JDK-8226731).
> Since pushing only this patch will add some overhead to refinement, but 
> not bring any performance improvement from removing the write barrier's 
> fence, do you recommend I also implement the write barrier change in the 
> same patch?
> Keeping the epoch sync patch separate from the write barrier patch has 
> some benefit for testing, in case the epoch patch introduces any bugs.
> Currently the new code is mostly guarded by a flag 
> -XX:+G1TestEpochSyncInConcRefinement, and it will be removed after the 
> write barrier change. It could be used as an emergency flag to 
> workaround bugs, instead of backing out of the entire change. We 
> probably cannot have such a flag if we bundle the changes in one patch 
> (it's too ugly to have a flag in interpreter and compilers).

Since the second part is so small, I'd definitely contribute this as a 
single patch where the complexity vs gain is more possible to evaluate.

> 2. Checking if a remote thread is in _thread_in_Java state.
> eosterlund@ pointed out it was incorrect to do 
> JavaThread::thread_state() == _thread_in_Java.
> I looked into thread state transition, and revised it to also compare 
> with _thread_in_native_trans and _thread_in_vm_trans.
> I think it is correct now for the purpose of epoch synchronization. 
> I.e., it never "misses" a remote thread that is actually in in_Java state.
> A detailed comment is here: 
> https://github.com/caoman/jdk/blob/2047ecefceb074e80d73e0d521d64a220fdc5779/src/hotspot/share/gc/g1/g1EpochSynchronizer.cpp#L67-L90 
> <https://urldefense.com/v3/__https://github.com/caoman/jdk/blob/2047ecefceb074e80d73e0d521d64a220fdc5779/src/hotspot/share/gc/g1/g1EpochSynchronizer.cpp*L67-L90__;Iw!!GqivPVa7Brio!MXEyaVysyn6eC7wTHQFe_MGMXyR5OTqo3nVI3r0un9JiCeDfEk6aRSUyfwnDwucEr9I$>.
> *Erik, could you take a look and decide if it is correct? If it is still 
> incorrect, could you advise a proper way to do this?*

This is still incorrect. Let's start with the why and then what to do 
about it.

A thread that has not armed the poll value of remote JavaThreads, may 
not read said thread's state to draw any conclusions from it whatsoever. 
Even if you had, the way you can reason about remote thread states is 
typically only to prove threads are in blocked or native, not to prove 
they are in Java (excluding the vm state). And the way that the 
different transitions are fenced reflect that, so that only that can be 
determined. The reasoning is that the states are there mostly for 
safepoint/handshake synchronization. And they can wait for responsive 
threads, but not threads that won't respond (blocked and native).

I don't get why you want this filtering though. Seems like a premature 
optimization to in practice not poke threads that are blocked or in 
native. But those threads are handled efficiently by the handshaking 
mechanism anyway, as the handshaking mechanism already filters out such 
threads (but in a correct way), and when they come back they will 
quickly disarm themselves and continue. So I think the solution is to 
just handshake all threads and be done with it, letting the handshaking 
mechanism perform the proper filtering in a context where it is valid. I 
don't think it will make any perf difference.

> 3. Native write barrier (G1BarrierSet::write_ref_field_post).
> The epoch synchronization protocol does not synchronize with threads in 
> _thread_in_native or _thread_in_vm state. It is much slower if we 
> synchronize with such threads.

Is it really though? Threads are in VM for a short time by design (not 
to impact time to safepoint), and threads in native will have the 
handshake operation performed by the handshaking thread (a no-op), and 
then continue. The handshakee will just see it needs to flip a bit when 
coming back from native. I can't understand why it would make a 
difference. Do you have numbers showing this?

> Moreover, there are non-Java threads (e.g. concurrent mark worker) that 
> could execute the native write barrier.
> As a result, it's probably best to keep the StoreLoad fence in the 
> native write barrier.

Yes, runtime barriers are stone cold, so shouldn't matter if it performs 
a fence.

> The final write post-barrier for JDK-8226731 would be:
> Given:
> x.a = q
> and
> p = @x.a
> 
> For Interpreter/C1/C2:
> if (p and q in same regions or q == NULL) -> exit
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
> 
> For the native barrier:
> if (p and q in same regions or q == NULL) -> exit
> StoreLoad;
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
> 
> Does the above look good? Do we need to micro-benchmark the potential 
> overhead added to the native barrier? Or are typical macro-benchmarks 
> sufficient?
> 
> 4. Performance benchmarking.
> I did some preliminary benchmarking with DaCapo and BigRamTester, 
> without changing the write barrier. This is to measure the overhead 
> added by the epoch synchronization protocol.
> Under the default JVM setting, I didn't see any performance difference. 
> When I tuned the JVM to refine cards aggressively, there was 
> still no difference in BigRamTester (probably because it only has 2 
> threads). Some DaCapo benchmarks saw 10-15% more CPU usage due to doing 
> more work in the refinement threads, and 2-5% total throughput 
> regression for these benchmarks.
> The aggressive refinement flags are "-XX:-G1UseAdaptiveConcRefinement 
> -XX:G1UpdateBufferSize=4 -XX:G1ConcRefinementGreenZone=0 
> -XX:G1ConcRefinementYellowZone=1".
> 
> I wonder how important we should treat the aggressive refinement case. 
> Regression in this case is likely unavoidable, so how much regression is 
> tolerable?
> Also, does anyone know a better benchmark to test refinement with 
> default JVM settings? Ideally it (1) has many mutator threads; (2) 
> triggers concurrent refinement frequently; (3) runs with a sizable Xmx 
> (e.g., 2GiB or above).

I'll probably leave the rest of the discussion to other G1 folks.

/Erik

> -Man
> 
> 
> On Thu, Jan 16, 2020 at 4:06 AM Florian Weimer <fweimer at redhat.com 
> <mailto:fweimer at redhat.com>> wrote:
> 
>     * Man Cao:
> 
>      > We had an offline discussion on this. To keep the community in
>     the loop,
>      > here is what we discussed.
>      >
>      > a. Using Linux membarrier syscall or equivalent on other OSes seems a
>      > cleaner solution than thread-local handshake (TLH). But we need
>     to have a
>      > backup mechanism for OSes and older Linuxes that do not have such a
>      > syscall.
> 
>     Can you do with a membarrier call that doesn't require registration?
> 
>     The usual fallback for membarrier is sending a special signal to all
>     threads, and make sure that they have run code in a signal handler
>     (possibly using a CPU barrier there).  But of course this is rather
>     slow.
> 
>     membarrier has seen some backporting activity, but as far as I can see,
>     that hasn't been consistent across architectures.
> 
>     Thanks,
>     Florian
>