RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5]

Mon Mar 3 20:18:58 UTC 2025

On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix comment (trailing whitespace)
>   * another assert when snapshotting at a safepoint.

src/hotspot/share/gc/g1/g1CardTable.cpp line 44:

> 42:       if (!failures) {
> 43:         G1CollectedHeap* g1h = G1CollectedHeap::heap();
> 44:         G1HeapRegion* r = g1h->heap_region_containing(mr.start());

Probably we can move this outside the loop, and assert that `mr` does not cross region boundaries

src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 916:

> 914:   void safepoint_synchronize_end() override;
> 915: 
> 916:   jlong synchronized_duration() const { return _safepoint_duration; }

safepoint_duration() seems easier to comprehend.

src/hotspot/share/gc/g1/g1CollectionSet.cpp line 310:

> 308:   verify_young_cset_indices();
> 309: 
> 310:   size_t card_rs_length = _policy->analytics()->predict_card_rs_length(in_young_only_phase);

Why are we using a prediction here? Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 42:

> 40: class G1HeapRegion;
> 41: class G1Policy;
> 42: class G1CardTableClaimTable;

Nit: ordering of the declarations

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84:

> 82: // Tracks the current refinement state from idle to completion (and reset back
> 83: // to idle).
> 84: class G1ConcurrentRefineWorkState {

G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 113:

> 111:   // Current epoch the work has been started; used to determine if there has been
> 112:   // a forced card table swap due to a garbage collection while doing work.
> 113:   size_t _refine_work_epoch;

same as previous comment, why `refine_work` instead of `refinement`?

src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43:

> 41:   size_t _cards_clean;                // Number of cards found clean.
> 42:   size_t _cards_not_parsable;         // Number of cards we could not parse and left unrefined.
> 43:   size_t _cards_still_refer_to_cset;  // Number of cards marked still young.

`_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977688778
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977969470
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977982999
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977991124
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978017843
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978019093
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978119476