RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5]

Mon Mar 3 15:22:10 UTC 2025

On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix comment (trailing whitespace)
>   * another assert when snapshotting at a safepoint.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 106:

> 104: 
> 105:   __ testptr(count, count);
> 106:   __ jcc(Assembler::equal, done);

I wonder if we can use "zero" instead of "equal" here; they have the same underlying value, but the semantic is to checking for "zero".

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 133:

> 131:   Label is_clean_card;
> 132:   __ cmpb(Address(addr, 0), G1CardTable::clean_card_val());
> 133:   __ jcc(Assembler::equal, is_clean_card);

Should this checking be guarded by `if (UseCondCardMark)`? I see that aarch64 does that.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143:

> 141: 
> 142:   __ bind(is_clean_card);
> 143:   // Card was not clean. Dirty card and go to next..

Why "not clean"? I thought this path is for dirtying clean card?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 323:

> 321:   assert(thread == r15_thread, "must be");
> 322: #endif // _LP64
> 323:   assert_different_registers(store_addr, new_val, thread, tmp1 /*, tmp2 unused */, noreg);

Seems that `tmp2` is unused in this method. It is used in aarch64, but it's not obvious to me whether that is indeed necessary. If so, can you add a comment saying sth like "this unused var is needed for other archs..."?

src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54:

> 52: // result = 0xBBAABBAA
> 53: inline size_t blend(size_t a, size_t b, size_t mask) {
> 54:   return a ^ ((a ^ b) & mask);

The example makes it much clearer; I wonder if `return (a & ~mask) | (b & mask);` is more readable.

src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 59:

> 57: 
> 58: void G1CardTableClaimTable::reset_all_claims_to_claimed() {
> 59:   for (size_t i = 0; i < _max_reserved_regions; i++) {

`uint` for `i`?

src/hotspot/share/gc/g1/g1CardTableClaimTable.hpp line 64:

> 62:   void reset_all_claims_to_unclaimed();
> 63:   void reset_all_claims_to_claimed();
> 64: 

I wonder if these two APIs can be renamed to "reset_all_to_x", which is more aligned with its single-region counterpart, `reset_to_unclaimed`, IMO.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 348:

> 346: void G1ConcurrentRefineWorkState::snapshot_heap_into(G1CardTableClaimTable* sweep_table) {
> 347:   // G1CollectedHeap::heap_region_iterate() below will only visit committed regions. Initialize
> 348:   // all entries in the state table here to not require special handling when iterating over it.

Can you elaborate on what the "special handling" would be, if we don's set "claimed" for non-committed regions?

src/hotspot/share/gc/g1/g1RemSet.cpp line 837:

> 835:         for (; refinement_cur_card < refinement_end_card; ++refinement_cur_card, ++card_cur_word) {
> 836:           size_t value = *refinement_cur_card;
> 837:           *refinement_cur_card = G1CardTable::WordAllClean;

Similarly, this is a "word", not "card", also.

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857:

> 855:     // We do not expect too many non-Java threads compared to Java threads, so just
> 856:     // let one worker claim that work.
> 857:     if (!_non_java_threads_claim && !Atomic::cmpxchg(&_non_java_threads_claim, false, true, memory_order_relaxed)) {

Do non-java threads have card-table-base?

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 862:

> 860: 
> 861:     class ResizeAndSwapCardTableClosure : public ThreadClosure {
> 862:     SwapCardTableClosure _cl;

Field indentation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977586579
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977594184
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977583002
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977601907
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977645576
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977571306
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977573354
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977704351
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977575441
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977701293
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977679688