RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30]

Wed Apr 9 10:36:44 UTC 2025

On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits:
> 
>  - * missing file from merge
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
>  - * make young gen length revising independent of refinement thread
>      * use a service task
>      * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update
>  - * fix IR code generation tests that change due to barrier cost changes
>  - * factor out card table and refinement table merging into a single
>      method
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
>  - * obsolete G1UpdateBufferSize
>    
>    G1UpdateBufferSize has previously been used to size the refinement
>    buffers and impose a minimum limit on the number of cards per thread
>    that need to be pending before refinement starts.
>    
>    The former function is now obsolete with the removal of the dirty
>    card queues, the latter functionality has been taken over by the new
>    diagnostic option `G1PerThreadPendingCardThreshold`.
>    
>    I prefer to make this a diagnostic option is better than a product option
>    because it is something that is only necessary for some test cases to
>    produce some otherwise unwanted behavior (continuous refinement).
>    
>    CSR is pending.
>  - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 170:

> 168:   }
> 169:   return result;
> 170: }

I see in `G1ConcurrentRefineThread::do_refinement`:

      // The yielding may have completed the task, check.
      if (!state.is_in_progress()) {

I wonder if it's simpler to use `is_in_progress` consistently to detect whether we should restart sweep, instead of `_sweep_start_epoch`.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349:

> 347:   }
> 348: 
> 349:   bool has_sweep_rt_work = is_in_progress() && _state == State::SweepRT;

Why `is_in_progress()`?

src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 79:

> 77: 
> 78:   void inc_cards_scanned(size_t increment = 1) { _cards_scanned += increment; }
> 79:   void inc_cards_clean(size_t increment = 1) { _cards_clean += increment; }

The sole caller always passes in arg, so no need for default-arg-value.

src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 87:

> 85:   void add_atomic(G1ConcurrentRefineStats* other);
> 86: 
> 87:   G1ConcurrentRefineStats& operator+=(const G1ConcurrentRefineStats& other);

Seems that these operators are not used after this PR.

src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 83:

> 81:         break;
> 82:       }
> 83:       case G1RemSet::HasRefToOld : break; // Nothing special to do.

Why doesn't call `inc_cards_clean_again` in this case? The card is cleared also. (In fact, I don't get why this needs to a separate case from `NoInteresting`.)

src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 156:

> 154: 
> 155:       _refine_stats.inc_cards_scanned(claim.size());
> 156:       _refine_stats.inc_cards_clean(claim.size() - scanned);

I feel these two "scanned" mean sth diff; the local var should probably be sth like `num_dirty_cards`.

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 207:

> 205: 
> 206:   if (!interrupted_by_gc) {
> 207:     state.add_yield_duration(G1CollectedHeap::heap()->safepoint_duration() - synchronize_duration_at_sweep_start);

I think this is recorded to later calculate actual refine-time, i.e. sweep-time - yield-time. However, why can't yield-duration be recorded in this refine-control-thread directly --  accumulation of `jlong yield_duration = os::elapsed_counter() - yield_start`. I feel that is easier to reason than going through g1heap.

src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 75:

> 73:   {
> 74:     MutexLocker x(G1ReviseYoungLength_lock, Mutex::_no_safepoint_check_flag);
> 75:     G1Policy* p = g1h->policy();

Can probably use the existing `policy`.

src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 88:

> 86: }
> 87: 
> 88: G1ReviseYoungLengthTargetLengthTask::G1ReviseYoungLengthTargetLengthTask(const char* name) :

I wonder if the class name can be shortened a bit, sth like `G1ReviseYoungLengthTask`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033251162
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033222407
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033929489
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033975054
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033934399
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033910496
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2032008908
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855278
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855435