RFR: 8342382: Implement JEP 522: G1 GC: Improve Throughput by Reducing Synchronization [v58]
    Ivan Walulya 
    iwalulya at openjdk.org
       
    Wed Sep 10 14:54:29 UTC 2025
    
    
  
On Wed, 10 Sep 2025 12:40:11 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:
>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * walulyai review
>   * tried to remove "logged card" terminology for the current "pending card" one
src/hotspot/share/gc/g1/g1ConcurrentRefineThread.hpp line 36:
> 34: class G1ConcurrentRefine;
> 35: 
> 36: // Concurrent refinement control thread watching card mark accrual on the card
Suggestion:
// Concurrent refinement control thread watching card mark accrual on the card table
src/hotspot/share/gc/g1/g1GCPhaseTimes.hpp line 182:
> 180:   double _cur_optional_merge_heap_roots_time_ms;
> 181:   // Included in above merge and optional-merge time.
> 182:   double _cur_distribute_log_buffers_time_ms;
No longer used.
src/hotspot/share/gc/g1/g1HeapRegion.hpp line 41:
> 39: class G1CardSet;
> 40: class G1CardSetConfiguration;
> 41: class G1CardTable;
Do we need the Forward declaration here?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2336962845
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2336992289
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2337004685
    
    
More information about the hotspot-dev
mailing list