RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30]

Thu Apr 10 11:01:42 UTC 2025

On Wed, 9 Apr 2025 12:48:10 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101:
>> 
>>> 99: }
>>> 100: 
>>> 101: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators,
>> 
>> Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.?
>
> I remember significant impact in some microbenchmark. It's also inlined in Parallel GC. I do not consider it a big issue wrt to maintenance - these things never really change, and the method is small and contained.
> I will try to redo numbers.

>From our microbenchmarks (higher numbers are better):

Current code:

Benchmark                                    (size)   Mode  Cnt       Score      Error   Units
ArrayCopyObject.conjoint_micro                   31  thrpt   15  166136.959 ± 5517.157  ops/ms
ArrayCopyObject.conjoint_micro                   63  thrpt   15  108880.108 ± 4331.112  ops/ms
ArrayCopyObject.conjoint_micro                  127  thrpt   15   93159.977 ± 5025.458  ops/ms
ArrayCopyObject.conjoint_micro                 2047  thrpt   15   17234.842 ±  831.344  ops/ms
ArrayCopyObject.conjoint_micro                 4095  thrpt   15    9202.216 ±  292.612  ops/ms
ArrayCopyObject.conjoint_micro                 8191  thrpt   15    3565.705 ±  121.116  ops/ms
ArrayCopyObject.disjoint_micro                   31  thrpt   15  159106.245 ± 5965.576  ops/ms
ArrayCopyObject.disjoint_micro                   63  thrpt   15   95475.658 ± 5415.267  ops/ms
ArrayCopyObject.disjoint_micro                  127  thrpt   15   84249.979 ± 6313.007  ops/ms
ArrayCopyObject.disjoint_micro                 2047  thrpt   15   10682.650 ±  381.832  ops/ms
ArrayCopyObject.disjoint_micro                 4095  thrpt   15    4471.940 ±  216.439  ops/ms
ArrayCopyObject.disjoint_micro                 8191  thrpt   15    1378.296 ±   33.421  ops/ms
ArrayCopy.arrayCopyObject                       N/A   avgt   15      13.880 ±    0.517   ns/op
ArrayCopy.arrayCopyObjectNonConst               N/A   avgt   15      14.844 ±    0.751   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward     N/A   avgt   15      11.080 ±    0.703   ns/op
ArrayCopy.arrayCopyObjectSameArraysForward      N/A   avgt   15      11.003 ±    0.135   ns/op

Runtime call:

Benchmark                                    (size)   Mode  Cnt      Score       Error   Units
ArrayCopyObject.conjoint_micro                   31  thrpt   15  73100.230 ± 11079.381  ops/ms
ArrayCopyObject.conjoint_micro                   63  thrpt   15  65039.431 ±  1996.832  ops/ms
ArrayCopyObject.conjoint_micro                  127  thrpt   15  58336.711 ±  2260.660  ops/ms
ArrayCopyObject.conjoint_micro                 2047  thrpt   15  17035.419 ±   524.445  ops/ms
ArrayCopyObject.conjoint_micro                 4095  thrpt   15   9207.661 ±   286.526  ops/ms
ArrayCopyObject.conjoint_micro                 8191  thrpt   15   3264.491 ±    73.848  ops/ms
ArrayCopyObject.disjoint_micro                   31  thrpt   15  84587.219 ±  3007.310  ops/ms
ArrayCopyObject.disjoint_micro                   63  thrpt   15  62815.254 ±  1214.310  ops/ms
ArrayCopyObject.disjoint_micro                  127  thrpt   15  58423.470 ±   285.670  ops/ms
ArrayCopyObject.disjoint_micro                 2047  thrpt   15  10720.462 ±   617.173  ops/ms
ArrayCopyObject.disjoint_micro                 4095  thrpt   15   4178.195 ±   178.942  ops/ms
ArrayCopyObject.disjoint_micro                 8191  thrpt   15   1374.268 ±    44.290  ops/ms
ArrayCopy.arrayCopyObject                       N/A   avgt   15     19.667 ±     0.740   ns/op
ArrayCopy.arrayCopyObjectNonConst               N/A   avgt   15     21.243 ±     1.891   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward     N/A   avgt   15     16.645 ±     0.504   ns/op
ArrayCopy.arrayCopyObjectSameArraysForward      N/A   avgt   15     17.409 ±     0.705   ns/op

Obviously with larger arrays, the impact diminishes, but it's always there. I think the inlined code is worth the effort in this case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2037086410