Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement
Thomas Schatzl
thomas.schatzl at oracle.com
Mon Apr 19 13:14:15 UTC 2021
Hi Man,
On 31.03.21 04:43, Man Cao wrote:
> Hi all,
>
> I finally managed to allocate more time to make progress on this, and
> resolved most issues since the last discussion.
> I've updated the description in
> https://bugs.openjdk.java.net/browse/JDK-8236485
> <https://bugs.openjdk.java.net/browse/JDK-8236485>, and the current
> prototype is the HEAD commit at
> https://github.com/caoman/jdk/tree/g1EpochSync
> Notable changes include:
> - The protocol uses async handshake from JDK-8238761
> <https://bugs.openjdk.java.net/browse/JDK-8238761> to resolve the
> blocking issue from normal handshake.
> - In order to support async refinement due to async handshake, added
> support for _deferred global queue to G1DirtyCardQueueSet. Buffers
> rarely get enqueued to _deferred at run-time.
> - The async handshake only executes for a subset of threads.
>
> I have a couple of questions:
>
> 1. Code review and patch size.
> Should I start a pull request for this change, so it is easier to give
> feedback?
I think email is easy enough.
>
> What is the recommended approach to deal with large changes? Currently
> the patch is about 1200 lines, without changing the write barrier itself
> (JDK-8226731).
> Since pushing only this patch will add some overhead to refinement, but
> not bring any performance improvement from removing the write barrier's
> fence, do you recommend I also implement the write barrier change in the
> same patch?
> Keeping the epoch sync patch separate from the write barrier patch has
> some benefit for testing, in case the epoch patch introduces any bugs.
> Currently the new code is mostly guarded by a flag
> -XX:+G1TestEpochSyncInConcRefinement, and it will be removed after the
> write barrier change. It could be used as an emergency flag to
> workaround bugs, instead of backing out of the entire change. We
> probably cannot have such a flag if we bundle the changes in one patch
> (it's too ugly to have a flag in interpreter and compilers).
Erik's suggestion is fine with me.
>
> 3. Native write barrier (G1BarrierSet::write_ref_field_post).
> The epoch synchronization protocol does not synchronize with threads in
> _thread_in_native or _thread_in_vm state. It is much slower if we
> synchronize with such threads.
> Moreover, there are non-Java threads (e.g. concurrent mark worker) that
> could execute the native write barrier.
> As a result, it's probably best to keep the StoreLoad fence in the
> native write barrier.
I agree.
> The final write post-barrier for JDK-8226731 would be:
> Given:
> x.a = q
> and
> p = @x.a
>
> For Interpreter/C1/C2:
> if (p and q in same regions or q == NULL) -> exit
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
>
> For the native barrier:
> if (p and q in same regions or q == NULL) -> exit
> StoreLoad;
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
>
> Does the above look good? Do we need to micro-benchmark the potential
> overhead added to the native barrier? Or are typical macro-benchmarks
> sufficient?
This is a bit worse than before (it could be optimized to filter out
from-young regions using the heap region table or querying HeapRegion
directly to get the same effect wrt to number of cards attempted to
enqueue) for the native barrier, buut since these are "stone cold" we do
not care too much.
> 4. Performance benchmarking.
> I did some preliminary benchmarking with DaCapo and BigRamTester,
> without changing the write barrier. This is to measure the overhead
> added by the epoch synchronization protocol.
> Under the default JVM setting, I didn't see any performance difference.
> When I tuned the JVM to refine cards aggressively, there was
> still no difference in BigRamTester (probably because it only has 2
> threads). Some DaCapo benchmarks saw 10-15% more CPU usage due to doing
> more work in the refinement threads, and 2-5% total throughput
> regression for these benchmarks.
> The aggressive refinement flags are "-XX:-G1UseAdaptiveConcRefinement
> -XX:G1UpdateBufferSize=4 -XX:G1ConcRefinementGreenZone=0
> -XX:G1ConcRefinementYellowZone=1".
>
> I wonder how important we should treat the aggressive refinement case.
> Regression in this case is likely unavoidable, so how much regression is
> tolerable?
This case is interesting to know about but not important.
> Also, does anyone know a better benchmark to test refinement with
> default JVM settings? Ideally it (1) has many mutator threads; (2)
> triggers concurrent refinement frequently; (3) runs with a sizable Xmx
> (e.g., 2GiB or above).
Let me see if I can find something.
Thomas
More information about the hotspot-gc-dev
mailing list