RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17]

Wed Mar 12 11:58:45 UTC 2025

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision:

 - Merge branch 'master' into 8342382-card-table-instead-of-dcq
 - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
 - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
   Cause are last-minute changes before making the PR ready to review.

     Testing: without the patch, occurs fairly frequently when continuously
   (1 in 20) starting refinement. Does not afterward.
 - * ayang review 3
     * comments
     * minor refactorings
 - * iwalulya review
     * renaming
     * fix some includes, forward declaration
 - * fix whitespace
   * additional whitespace between log tags
   * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
 - ayang review
     * renamings
     * refactorings
 - iwalulya review
     * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
     * predicate for determining whether the refinement has been disabled
     * some other typos/comment improvements
     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
 - * ayang review - fix comment
 - * iwalulya review 2
     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
     * some additional documentation
 - ... and 14 more: https://git.openjdk.org/jdk/compare/f77fa17b...aec95051

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/758fac01..aec95051

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15-16

  Stats: 78123 lines in 1539 files changed: 36243 ins; 29177 del; 12703 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739