RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30]

Wed Apr 9 12:03:49 UTC 2025

On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits:
> 
>  - * missing file from merge
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
>  - * make young gen length revising independent of refinement thread
>      * use a service task
>      * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update
>  - * fix IR code generation tests that change due to barrier cost changes
>  - * factor out card table and refinement table merging into a single
>      method
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
>  - * obsolete G1UpdateBufferSize
>    
>    G1UpdateBufferSize has previously been used to size the refinement
>    buffers and impose a minimum limit on the number of cards per thread
>    that need to be pending before refinement starts.
>    
>    The former function is now obsolete with the removal of the dirty
>    card queues, the latter functionality has been taken over by the new
>    diagnostic option `G1PerThreadPendingCardThreshold`.
>    
>    I prefer to make this a diagnostic option is better than a product option
>    because it is something that is only necessary for some test cases to
>    produce some otherwise unwanted behavior (continuous refinement).
>    
>    CSR is pending.
>  - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f

Hi Thomas, great simplification and encouraging results! I reviewed the compiler-related parts of the changeset, including x64 and aarch64 changes.

src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 246:

> 244:     __ cbz(new_val, done);
> 245:   }
> 246:   // Storing region crossing non-null, is card young?

Suggestion:

  // Storing region crossing non-null.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101:

> 99: }
> 100: 
> 101: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators,

Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 145:

> 143: 
> 144:   __ bind(is_clean_card);
> 145:   // Card was clean. Dirty card and go to next..

This code seems unreachable if `!UseCondCardMark`, meaning we only dirty cards here if `UseCondCardMark` is enabled. Is that intentional?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 319:

> 317:                                             const Register thread,
> 318:                                             const Register tmp1,
> 319:                                             const Register tmp2,

Since `tmp2` is not needed in the x64 post-barrier, I suggest not passing it around for this platform, for simplicity and also to make optimization opportunities more visible in the future. Here is my suggestion: https://github.com/robcasloz/jdk/commit/855ec8df4a641f8c491c5c09acea3ee434b7e230, feel free to merge if you agree.

src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 38:

> 36: #include "c1/c1_LIRAssembler.hpp"
> 37: #include "c1/c1_MacroAssembler.hpp"
> 38: #endif // COMPILER1

I suggest removing the conditional compilation directives and grouping these includes together with the above `c1` ones.

src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 147:

> 145:     state->do_input(_thread);
> 146: 
> 147:     // Use temp registers to ensure these they use different registers.

Suggestion:

    // Use temps to enforce different registers.

src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 307:

> 305:              + 6  // same region check: Uncompress (new_val) oop, xor, shr, (cmp), jmp
> 306:              + 4  // new_val is null check
> 307:              + 4; // card not clean check.

It probably does not affect the unrolling heuristics too much, but you may want to make the last cost component conditional on `UseCondCardMark`.

src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 396:

> 394:   bool needs_liveness_data(const MachNode* mach) const {
> 395:     return G1BarrierStubC2::needs_pre_barrier(mach) ||
> 396:            G1BarrierStubC2::needs_post_barrier(mach);

Suggestion:

    // Liveness data is only required to compute registers that must be
    // preserved across the runtime call in the pre-barrier stub.
    return G1BarrierStubC2::needs_pre_barrier(mach);

src/hotspot/share/gc/g1/g1BarrierSet.hpp line 56:

> 54: //
> 55: // The refinement threads mark cards in the current collection set specially on the
> 56: // card table - this is fine wrt to synchronization with the mutator, because at

Suggestion:

// card table - this is fine wrt synchronization with the mutator, because at

test/hotspot/jtreg/compiler/gcbarriers/TestG1BarrierGeneration.java line 521:

> 519:         phase = CompilePhase.FINAL_CODE)
> 520:     @IR(counts = {IRNode.COUNTED_LOOP, "2"},
> 521:         phase = CompilePhase.FINAL_CODE)

I suggest to remove this extra IR check to avoid over-specifying the expected loop shape. For example, running this test with loop unrolling disabled (`-XX:LoopUnrollLimit=0`) would now fail because only one counted loop would be found.

-------------

Changes requested by rcastanedalo (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2753154117
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035174209
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035175921
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035177738
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035183250
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035186980
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035192666
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035210464
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035196251
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035198219
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035201056