RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30]

Wed Apr 9 12:50:42 UTC 2025

On Wed, 9 Apr 2025 11:34:09 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits:
>> 
>>  - * missing file from merge
>>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>>  - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
>>  - * make young gen length revising independent of refinement thread
>>      * use a service task
>>      * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update
>>  - * fix IR code generation tests that change due to barrier cost changes
>>  - * factor out card table and refinement table merging into a single
>>      method
>>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
>>  - * obsolete G1UpdateBufferSize
>>    
>>    G1UpdateBufferSize has previously been used to size the refinement
>>    buffers and impose a minimum limit on the number of cards per thread
>>    that need to be pending before refinement starts.
>>    
>>    The former function is now obsolete with the removal of the dirty
>>    card queues, the latter functionality has been taken over by the new
>>    diagnostic option `G1PerThreadPendingCardThreshold`.
>>    
>>    I prefer to make this a diagnostic option is better than a product option
>>    because it is something that is only necessary for some test cases to
>>    produce some otherwise unwanted behavior (continuous refinement).
>>    
>>    CSR is pending.
>>  - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f
>
> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101:
> 
>> 99: }
>> 100: 
>> 101: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators,
> 
> Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.?

I remember significant impact in some microbenchmark. It's also inlined in Parallel GC. I do not consider it a big issue wrt to maintenance - these things never really change, and the method is small and contained.
I will try to redo numbers.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035298557