RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier

Tue Feb 25 15:04:28 UTC 2025

Hi all,

  please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.

The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.

### Current situation

With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.

The main reason for the current barrier is how g1 implements concurrent refinement:
* g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
* For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
* Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.

These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:

// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) == young_card) goto done;  // write to young gen check
StoreLoad;                // synchronize
if (card(@x.a) == dirty_card) goto done;

*card(@x.a) = dirty

// Card tracking
enqueue(card-address(@x.a)) into thread-local-dcq;
if (thread-local-dcq is not full) goto done;

call runtime to move thread-local-dcq into dcqs

done:

Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.

The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.

There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).

The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a second card table ("refinement table"). The second card table also replaces the dirty card queue.

In that scheme the fine-grained synchronization is unnecessary because mutator and refinement threads always write to different memory areas (and no concurrent write where an update can be lost can occur). This removes the necessity for synchronization for every reference write.
Also no card enqueuing is required any more.

Only the filters and the card mark remain.

### How this works

In the beginning both the card table and the refinement table are completely unmarked (contain "clean" cards). The mutator dirties the card table, until G1 heuristics think that a significant enough amount of cards were dirtied based on what is allocated for scanning them during the garbage collection.

At that point, the card table and the refinement table are exchanged "atomically" using handshakes. The mutator keeps dirtying the (the previous, clean refinement table which is now the) card table, while the refinement threads look for and refine dirty cards on the refinement table as before.

Refinement of cards is very similar to before: if an interesting reference in a dirty card has been found, G1 records it in appropriate remembered sets. In this implementation there is an exception for references to the current collection set (typically young gen) - the refinement threads redirty that card on the card table with a special `to-collection-set` value.

This is valid because races with the mutator for that write do not matter - the entire card will eventually be rescanned anyway, regardless of whether it ends up as dirty or to-collection-set. The advantage of marking to-collection-set cards specially is that the next time the card tables are swapped, the refinement threads will not re-refine them on the assumption that that reference to the collection set will not change. This decreases refinement work substantially.

If refinement gets interrupted by GC, the refinement table will be merged with the card table before card scanning, which works as before.

New barrier pseudo-code for an assignment `x.a = y`:

// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) != clean_card) goto done;  // skip already non-clean cards
*card(@x.a) = dirty

This is basically the Serial/Parallel GC barrier with additional filters to keep the number of dirty cards as little as possible.

A few more comments about the barrier:
* the barrier now loads the card table base offset from a thread local instead of inlining it. This is necessary for this mechanism to work as the card table to dirty changes over time, and may even be faster on some architectures (code size), and some architectures already do.
* all existing pre-filters were kept. Benchmarks showed some significant regressions wrt to pause times and even throughput compared to G1 in master. Using the Parallel GC barrier (just the dirty card write) would be possible, and further investigation on stripping parts will be made as follow-up.
* the final check tests for non-clean cards to avoid overwriting existing cards, in particular the "to-collection set" cards described above.

Current G1 marks the cards corresponding to young gen regions as all "young" so that the original barrier could potentially avoid the `StoreLoad`. This implementation removes this facility (which might be re-introduced later), but measurements showed that pre-dirtying the young generation region's cards as "dirty" (g1 does not need to use an extra "young" value) did not yield any measurable performance difference.

### Refinement process

The goal of the refinement (threads) is to make sure that the number of cards to scan in the garbage collection is below a particular threshold.

The prototype changes the refinement threads into a single control thread and a set of (refinement) worker threads. Differently to the previous implementation, the control thread does not do any refinement, but only executes the heuristics to start a calculated amount of worker threads and tracking refinement progress. 

The refinement trigger is based on current known number of pending (i.e. dirty) cards on the card table and a pending card generation rate, fairly similarly to the previous algorithm. After the refinement control thread determines that it is time to do refinement, it starts the following sequence:

1) **Swap the card table**. This consists of several steps:
    1) **Swap the global card table** - the global card table pointer is swapped; newly created threads and runtime calls will eventually use the new values, at the latest after the next two steps.
    2) **Update the pointers in all JavaThread**'s TLS storage to the new card table pointer using a handshake operation
    3) **Update the pointers in the GC thread**'s TLS storage to the new card table pointer using the SuspendibleThreadSet mechanism
2) **Snapshot the heap** - determine the extent of work needed for all regions where the refinement threads need to do some work on the refinement table (the previous card table). The snapshot stores the work progress for each region so that work can be interrupted and continued at any time.
    This work either consists of refinement of the particular card (old generation regions) or clearing the cards (next collection set/young generation regions).
3) **Sweep the refinement table** by activating the refinement worker threads. The threads refine dirty cards using the heap snapshot where worker threads claim parts of regions to process.
      * Cards with references to the young generation are not added to the young generation's card based remembered set. Instead these cards are marked as to-collection-set  in the card table and any remaining refinement of that card skipped.
      * If refinement encounters a card that is already marked as to-collection-set  it is not refined and re-marked as to-collection-set  on the card table .
      * During refinement, the refinement table is also cleared (in bulk for collection set regions as they do not need any refinement, and in other regions as they are refined for the non-clean cards).
      * Dirty cards within unparsable heap areas are forwarded to/redirtied on the card table as is.
4) **Completion work**, mostly statistics.

If the work is interrupted by a non-garbage collection synchronization point, work is suspended temporarily and resumed later using the heap snapshot.

After the refinement process the refinement table is all-clean again and ready to be swapped again.

### Garbage collection pause changes

Since a garbage collection (young or full gc) pause may occur at any point during the refinement process, the garbage collection needs some compensating work for the not yet swept parts of the refinement table.

Note that this situation is very rare, and the heuristics try to avoid that, so in most cases nothing needs to be done as the refinement table is all clean.

If this happens, young collections add a new phase called `Merge Refinement Table` in the garbage collection pause right before the `Merge Heap Roots` phase. This compensating phase does the following:

  0) (Optional) Snapshot the heap if not done yet (if the process has been interrupted between state 1 and 3 of the refinement process)
  1) Merge the refinement table into the card table - in this step the dirty cards of interesting regions are
  2) Completion work (statistics)

If a full collection interrupts concurrent refinement, the refinement table is simply cleared and all dirty cards thrown away.

A garbage collection generates new cards (e.g. references from promoted objects into the young generation) on the refinement table. This acts similarly to the extra DCQS used to record these interesting references/cards and redirty the card table using them in the previous implementation. G1 swaps the card tables at the end of the collection to keep the post-condition of the refinement table being all clean (and any to-be-refined cards on the card table) at the end of garbage collection.

### Performance metrics

Following is an overview of the changes in behavior. Some numbers are provided in the CR in the first comment.

#### Native memory usage

The refinement table takes an additional 0.2% of the Java heap size of native memory compared to JDK 21 and above (in JDK 21 we removed one card table sized data structure, so this is a non-issue when updating from before).

Some of that additional memory usage is automatically reclaimed by removing the dirty card queues. Additional memory is reclaimed by managing the cards containing to-collection-set references on the card table by dropping the explicit remembered sets for young generation completely and any remembered set entries which would otherwise be duplicated into the other region's remembered sets.

In some applications/benchmarks these gains completely offset the additional card table, however most of the time this is not the case, particularly for throughput applications currently.
It is possible to allocate the refinement table lazily, which means that since these applications often do not need any concurrent refinement, there is no overhead at all but actually a net reduction of native memory usage. This is not implemented in this prototype.

#### Latency ("Pause times")

Not affected or slightly better. Pause times decrease due to a shorter "Merge remembered sets" phase due to no work required for the remembered sets for the young generation - they are always already on the card table!

However merging of the refinement table into the card table is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. Since this work is linearly scanning some memory, this is embarassingly parallel too.

The cards created during garbage collection do not need to be redirtied, so that phase has also been removed.

The card table swap is based on predictions for mutator card dirtying rate and refinement rate as before, and the policy is actually fairly similar to before. It is still rather aggressive, but in most cases takes less cpu resources than the one before, mostly because refining takes less cpu time. Many applications do not do any refinement at all like before. More investigation could be done to improve this in the future.

#### Throughput

This change always increases throughput in my measurements, depending on benchmark/application it may not actually show up in scores though.

Due to the pre-barrier and the additional filters in the barrier G1 is still slower than Parallel on raw throughput benchmarks, but is typically somewhere half-way to Parallel GC or closer.

### Platform support

Since the post write barrier changed, additional work for some platforms is required to allow this change to proceed. At this time all work for all platforms is done, but needs testing

- GraalVM (contributed by the GraalVM team)
- S390 (contributed by A. Kumar from IBM)
- PPC (contributed by M. Doerr, from SAP)
- ARM (should work, HelloWorld compiles and runs)
- RISCV (should work, HelloWorld compiles and runs)
- x86 (should work, build/HelloWorld compiles and runs)

None of the above mentioned platforms implement the barrier method to write cards for a reference array (aarch64 and x64 are fully implemented), they call the runtime as before. I believe it is doable fairly easily now with this simplified barrier for some extra performance, but not necessary.

### Alternatives

The JEP text extensively discusses alternatives.

### Reviewing

The change can be roughly divided in these fairly isolated parts
* platform specific changes to the barrier
* refinement and refinement control thread changes; this is best reviewed starting from the `G1ConcurrentRefineThread::run_service` method
* changes to garbage collection: `merge_refinement_table()` in `g1RemSet.cpp`
* policy modifications are typically related to code around the calls to `G1Policy::record_dirtying_stats`.

Further information is available in the [JEP draft](https://bugs.openjdk.org/browse/JDK-8340827); there is also an a bit more extensive discussion of the change on my [blog](https://tschatzl.github.io/2025/02/21/new-write-barriers.html).

Some additional comments:
* the pre-marking of young generation cards has been removed. Benchmarks did not show any significant difference either way. To me this makes somewhat sense because the entire young gen will quickly get marked anyway. I.e. one only saves a single additional card table write (for every card). With the old barrier the costs for a card table mark has been much higher.
* G1 sets `UseCondCardMark` to true by default. The conditional card mark corresponds to the third filter in the write barrier now, and since I decided to keep all filters for this change, it makes sense to directly use this mechanism.

If there are any questions, feel free to ask.

Testing: tier1-7 (multiple tier1-7, tier1-8 with slightly older versions)

Thanks,
  Thomas

-------------

Commit messages:
 - * only provide byte map base for JavaThreads
 - * mdoerr review: fix comments in ppc code
 - * fix crash when writing dirty cards for memory regions during card table switching
 - * remove mention of "enqueue" or "enqueuing" for actions related to post barrier
 - * remove some commented out debug code
 - Card table as DCQ

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8342382
  Stats: 6543 lines in 103 files changed: 2162 ins; 3461 del; 920 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739