RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2]

Thu Feb 27 18:34:14 UTC 2025

On Tue, 25 Feb 2025 15:13:43 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * remove unnecessarily added logging

src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54:

> 52: // them, keeping the write barrier simple.
> 53: //
> 54: // The refinement threads mark cards in the the current collection set specially on the

"the the" typo.

src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 47:

> 45: 
> 46: // Returns bits from a where mask is 0, and bits from b where mask is 1.
> 47: inline size_t blend(size_t a, size_t b, size_t mask) {

Can you provide some input/output examples in the doc?

src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 45:

> 43: }
> 44: 
> 45: void G1CardTableClaimTable::initialize(size_t max_reserved_regions) {

Should the arg be `uint`?

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 280:

> 278:   assert_state(State::SweepRT);
> 279: 
> 280:   set_state_start_time();

This method is called in a loop; would that skew the state-starting time?

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 344:

> 342:     size_t _num_clean;
> 343:     size_t _num_dirty;
> 344:     size_t _num_to_cset;

Seem never read.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349:

> 347: 
> 348:     bool do_heap_region(G1HeapRegion* r) override {
> 349:       if (!r->is_free()) {

I am a bit lost on this closure; the intention seems to set unclaimed to all non-free regions, why can't this be done in one go, instead of first setting all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones unclaimed?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:

> 114: 
> 115:   // Current heap snapshot.
> 116:   G1CardTableClaimTable* _sweep_state;

Since this is a table, I wonder if we can name it "x_table" instead of "x_state".

src/hotspot/share/gc/g1/g1RemSet.cpp line 147:

> 145:     if (_contains[region]) {
> 146:       return;
> 147:     }

Indentation seems broken.

src/hotspot/share/gc/g1/g1RemSet.cpp line 830:

> 828:         size_t const start_idx = region_card_base_idx + claim.value();
> 829: 
> 830:         size_t* card_cur_card = (size_t*)card_table->byte_for_index(start_idx);

This var name should end with "_word", instead of "_card".

src/hotspot/share/gc/g1/g1RemSet.cpp line 1252:

> 1250:     G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed);
> 1251:     claim = &constructed;
> 1252:   }

It's not super obvious to me why the "has_sweep_claims" checking needs to be on this level. Can `G1ConcurrentRefineWorkState` return a valid `G1CardTableClaimTable*` directly?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974124792
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1971426039
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973435950
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974083760
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973447654
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973452168
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974056492
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973423400
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974108760
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974134441