Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Wed Apr 7 18:19:27 UTC 2021

Has anyone got a chance to take a look?

I also implemented the fence removal in interpreter and C1/C2. The patch is
quite small, so perhaps it is better to merge them into one pull request:
https://github.com/caoman/jdk/tree/8226731fenceRemoval

-Man

On Tue, Mar 30, 2021 at 7:43 PM Man Cao <manc at google.com> wrote:

> Hi all,
>
> I finally managed to allocate more time to make progress on this, and
> resolved most issues since the last discussion.
> I've updated the description in
> https://bugs.openjdk.java.net/browse/JDK-8236485, and the current
> prototype is the HEAD commit at
> https://github.com/caoman/jdk/tree/g1EpochSync.
> Notable changes include:
> - The protocol uses async handshake from JDK-8238761
> <https://bugs.openjdk.java.net/browse/JDK-8238761> to resolve the
> blocking issue from normal handshake.
> - In order to support async refinement due to async handshake, added
> support for _deferred global queue to G1DirtyCardQueueSet. Buffers rarely
> get enqueued to _deferred at run-time.
> - The async handshake only executes for a subset of threads.
>
> I have a couple of questions:
>
> 1. Code review and patch size.
> Should I start a pull request for this change, so it is easier to give
> feedback?
>
> What is the recommended approach to deal with large changes? Currently the
> patch is about 1200 lines, without changing the write barrier itself
> (JDK-8226731).
> Since pushing only this patch will add some overhead to refinement, but
> not bring any performance improvement from removing the write barrier's
> fence, do you recommend I also implement the write barrier change in the
> same patch?
> Keeping the epoch sync patch separate from the write barrier patch has
> some benefit for testing, in case the epoch patch introduces any bugs.
> Currently the new code is mostly guarded by a flag
> -XX:+G1TestEpochSyncInConcRefinement, and it will be removed after the
> write barrier change. It could be used as an emergency flag to workaround
> bugs, instead of backing out of the entire change. We probably cannot have
> such a flag if we bundle the changes in one patch (it's too ugly to have a
> flag in interpreter and compilers).
>
> 2. Checking if a remote thread is in _thread_in_Java state.
> eosterlund@ pointed out it was incorrect to do JavaThread::thread_state()
> == _thread_in_Java.
> I looked into thread state transition, and revised it to also compare with
> _thread_in_native_trans and _thread_in_vm_trans.
> I think it is correct now for the purpose of epoch synchronization. I.e.,
> it never "misses" a remote thread that is actually in in_Java state.
> A detailed comment is here:
> https://github.com/caoman/jdk/blob/2047ecefceb074e80d73e0d521d64a220fdc5779/src/hotspot/share/gc/g1/g1EpochSynchronizer.cpp#L67-L90
> .
> *Erik, could you take a look and decide if it is correct? If it is still
> incorrect, could you advise a proper way to do this?*
>
> 3. Native write barrier (G1BarrierSet::write_ref_field_post).
> The epoch synchronization protocol does not synchronize with threads in
> _thread_in_native or _thread_in_vm state. It is much slower if we
> synchronize with such threads.
> Moreover, there are non-Java threads (e.g. concurrent mark worker) that
> could execute the native write barrier.
> As a result, it's probably best to keep the StoreLoad fence in the native
> write barrier.
> The final write post-barrier for JDK-8226731 would be:
> Given:
> x.a = q
> and
> p = @x.a
>
> For Interpreter/C1/C2:
> if (p and q in same regions or q == NULL) -> exit
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
>
> For the native barrier:
> if (p and q in same regions or q == NULL) -> exit
> StoreLoad;
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
>
> Does the above look good? Do we need to micro-benchmark the potential
> overhead added to the native barrier? Or are typical macro-benchmarks
> sufficient?
>
> 4. Performance benchmarking.
> I did some preliminary benchmarking with DaCapo and BigRamTester, without
> changing the write barrier. This is to measure the overhead added by the
> epoch synchronization protocol.
> Under the default JVM setting, I didn't see any performance difference.
> When I tuned the JVM to refine cards aggressively, there was
> still no difference in BigRamTester (probably because it only has 2
> threads). Some DaCapo benchmarks saw 10-15% more CPU usage due to doing
> more work in the refinement threads, and 2-5% total throughput regression
> for these benchmarks.
> The aggressive refinement flags are "-XX:-G1UseAdaptiveConcRefinement
> -XX:G1UpdateBufferSize=4 -XX:G1ConcRefinementGreenZone=0
> -XX:G1ConcRefinementYellowZone=1".
>
> I wonder how important we should treat the aggressive refinement case.
> Regression in this case is likely unavoidable, so how much regression is
> tolerable?
> Also, does anyone know a better benchmark to test refinement with default
> JVM settings? Ideally it (1) has many mutator threads; (2) triggers
> concurrent refinement frequently; (3) runs with a sizable Xmx (e.g., 2GiB
> or above).
>
> -Man
>
>
> On Thu, Jan 16, 2020 at 4:06 AM Florian Weimer <fweimer at redhat.com> wrote:
>
>> * Man Cao:
>>
>> > We had an offline discussion on this. To keep the community in the loop,
>> > here is what we discussed.
>> >
>> > a. Using Linux membarrier syscall or equivalent on other OSes seems a
>> > cleaner solution than thread-local handshake (TLH). But we need to have
>> a
>> > backup mechanism for OSes and older Linuxes that do not have such a
>> > syscall.
>>
>> Can you do with a membarrier call that doesn't require registration?
>>
>> The usual fallback for membarrier is sending a special signal to all
>> threads, and make sure that they have run code in a signal handler
>> (possibly using a CPU barrier there).  But of course this is rather
>> slow.
>>
>> membarrier has seen some backporting activity, but as far as I can see,
>> that hasn't been consistent across architectures.
>>
>> Thanks,
>> Florian
>>
>>