Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Wed Mar 31 02:43:13 UTC 2021

Hi all,

I finally managed to allocate more time to make progress on this, and
resolved most issues since the last discussion.
I've updated the description in
https://bugs.openjdk.java.net/browse/JDK-8236485, and the current prototype
is the HEAD commit at https://github.com/caoman/jdk/tree/g1EpochSync.
Notable changes include:
- The protocol uses async handshake from JDK-8238761
<https://bugs.openjdk.java.net/browse/JDK-8238761> to resolve the blocking
issue from normal handshake.
- In order to support async refinement due to async handshake, added
support for _deferred global queue to G1DirtyCardQueueSet. Buffers rarely
get enqueued to _deferred at run-time.
- The async handshake only executes for a subset of threads.

I have a couple of questions:

1. Code review and patch size.
Should I start a pull request for this change, so it is easier to give
feedback?

What is the recommended approach to deal with large changes? Currently the
patch is about 1200 lines, without changing the write barrier itself
(JDK-8226731).
Since pushing only this patch will add some overhead to refinement, but not
bring any performance improvement from removing the write barrier's fence,
do you recommend I also implement the write barrier change in the same
patch?
Keeping the epoch sync patch separate from the write barrier patch has some
benefit for testing, in case the epoch patch introduces any bugs.
Currently the new code is mostly guarded by a flag
-XX:+G1TestEpochSyncInConcRefinement, and it will be removed after the
write barrier change. It could be used as an emergency flag to workaround
bugs, instead of backing out of the entire change. We probably cannot have
such a flag if we bundle the changes in one patch (it's too ugly to have a
flag in interpreter and compilers).

2. Checking if a remote thread is in _thread_in_Java state.
eosterlund@ pointed out it was incorrect to do JavaThread::thread_state()
== _thread_in_Java.
I looked into thread state transition, and revised it to also compare with
_thread_in_native_trans and _thread_in_vm_trans.
I think it is correct now for the purpose of epoch synchronization. I.e.,
it never "misses" a remote thread that is actually in in_Java state.
A detailed comment is here:
https://github.com/caoman/jdk/blob/2047ecefceb074e80d73e0d521d64a220fdc5779/src/hotspot/share/gc/g1/g1EpochSynchronizer.cpp#L67-L90
.
*Erik, could you take a look and decide if it is correct? If it is still
incorrect, could you advise a proper way to do this?*

3. Native write barrier (G1BarrierSet::write_ref_field_post).
The epoch synchronization protocol does not synchronize with threads in
_thread_in_native or _thread_in_vm state. It is much slower if we
synchronize with such threads.
Moreover, there are non-Java threads (e.g. concurrent mark worker) that
could execute the native write barrier.
As a result, it's probably best to keep the StoreLoad fence in the native
write barrier.
The final write post-barrier for JDK-8226731 would be:
Given:
x.a = q
and
p = @x.a

For Interpreter/C1/C2:
if (p and q in same regions or q == NULL) -> exit
if (card(p) == Dirty) -> exit
card(p) = Dirty;
enqueue(card(p))

For the native barrier:
if (p and q in same regions or q == NULL) -> exit
StoreLoad;
if (card(p) == Dirty) -> exit
card(p) = Dirty;
enqueue(card(p))

Does the above look good? Do we need to micro-benchmark the potential
overhead added to the native barrier? Or are typical macro-benchmarks
sufficient?

4. Performance benchmarking.
I did some preliminary benchmarking with DaCapo and BigRamTester, without
changing the write barrier. This is to measure the overhead added by the
epoch synchronization protocol.
Under the default JVM setting, I didn't see any performance difference.
When I tuned the JVM to refine cards aggressively, there was
still no difference in BigRamTester (probably because it only has 2
threads). Some DaCapo benchmarks saw 10-15% more CPU usage due to doing
more work in the refinement threads, and 2-5% total throughput regression
for these benchmarks.
The aggressive refinement flags are "-XX:-G1UseAdaptiveConcRefinement
-XX:G1UpdateBufferSize=4 -XX:G1ConcRefinementGreenZone=0
-XX:G1ConcRefinementYellowZone=1".

I wonder how important we should treat the aggressive refinement case.
Regression in this case is likely unavoidable, so how much regression is
tolerable?
Also, does anyone know a better benchmark to test refinement with default
JVM settings? Ideally it (1) has many mutator threads; (2) triggers
concurrent refinement frequently; (3) runs with a sizable Xmx (e.g., 2GiB
or above).

-Man

On Thu, Jan 16, 2020 at 4:06 AM Florian Weimer <fweimer at redhat.com> wrote:

> * Man Cao:
>
> > We had an offline discussion on this. To keep the community in the loop,
> > here is what we discussed.
> >
> > a. Using Linux membarrier syscall or equivalent on other OSes seems a
> > cleaner solution than thread-local handshake (TLH). But we need to have a
> > backup mechanism for OSes and older Linuxes that do not have such a
> > syscall.
>
> Can you do with a membarrier call that doesn't require registration?
>
> The usual fallback for membarrier is sending a special signal to all
> threads, and make sure that they have run code in a signal handler
> (possibly using a CPU barrier there).  But of course this is rather
> slow.
>
> membarrier has seen some backporting activity, but as far as I can see,
> that hasn't been consistent across architectures.
>
> Thanks,
> Florian
>
>