RFR: 8311163: Parallel: Improve large object handling during evacuation
Kim Barrett
kbarrett at openjdk.org
Thu Aug 29 06:00:18 UTC 2024
On Tue, 27 Aug 2024 10:40:59 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:
> Using `-Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=6 DelayInducer.java`, I can see some regression (~10%) in avg gc-pause time on my box (nproc=12). Have you seen sth similar?
No, I'm not seeing anything like that. Indeed, I'm seeing strongly the
opposite. Note that my machine has nproc=32, but the results were similar with
-XX:ActiveProcessorCount=12.
old
[1.966s][info][gc] GC(0) Pause Young (Allocation Failure) 504M->387M(1930M) 744.238ms
[2.446s][info][gc] GC(1) Pause Young (Allocation Failure) 891M->882M(2283M) 236.505ms
[7.733s][info][gc] GC(2) Pause Young (Allocation Failure) 1686M->1669M(2668M) 1678.807ms
[9.423s][info][gc] GC(3) Pause Full (Allocation Failure) 2523M->89M(2421M) 45.843ms
new
[1.694s][info][gc] GC(0) Pause Young (Allocation Failure) 504M->397M(1930M) 469.406ms
[2.140s][info][gc] GC(1) Pause Young (Allocation Failure) 901M->892M(2283M) 224.942ms
[6.533s][info][gc] GC(2) Pause Young (Allocation Failure) 1682M->1669M(2668M) 761.521ms
[8.253s][info][gc] GC(3) Pause Full (Allocation Failure) 2509M->89M(2408M) 47.130ms
These results are pretty stable across 10-12 runs. Regardless of which variant
is used, the number and types of GCs is stable, as are the collection amounts.
And for a given variant the times are fairly stable.
For GC(0), the new variant is consistently faster, though the fastest old time
and the slowest new time are close or overlap.
For GC(1), the two variants are pretty similar.
For GC(2), the new variant is consistently more than a factor of 2 faster, and
the average difference is closer to 2.5x.
Digging a bit deeper, here are sample times for the Scavenge phase.
old
[7.849s][debug][gc,phases ] GC(2) Scavenge 1928.903ms
new
[6.511s][debug][gc,phases ] GC(2) Scavenge 774.372ms
I haven't found a way to get more detailed timing information there. In
particular, I thought there was some logging in the TaskTerminator, but don't
see any now.
I rebuilt the variants with -DTASKQUEUE_STATS and ran with gc+task+stats=trace
logging, and here's a sample output:
[6.610s][trace][gc,task,stats] GC(2) thr push steal chunked chunks
[6.610s][trace][gc,task,stats] GC(2) --- ---------- ---------- ---------- ----------
[6.610s][trace][gc,task,stats] GC(2) 0 306340 0 19 306335
[6.610s][trace][gc,task,stats] GC(2) 1 300040 2 7 300041
[6.610s][trace][gc,task,stats] GC(2) 2 303588 1 0 303589
[6.610s][trace][gc,task,stats] GC(2) 3 334770 3 0 334771
[6.610s][trace][gc,task,stats] GC(2) 4 332980 3 0 332981
[6.610s][trace][gc,task,stats] GC(2) 5 436049 1 0 436050
new
[6.657s][trace][gc,task,stats] GC(2) ----partial array---- arrays array
[6.657s][trace][gc,task,stats] GC(2) thr push steal chunked chunks
[6.658s][trace][gc,task,stats] GC(2) --- ---------- ---------- ---------- ----------
[6.658s][trace][gc,task,stats] GC(2) 0 315502 2 18 315503
[6.658s][trace][gc,task,stats] GC(2) 1 314152 1 0 314153
[6.658s][trace][gc,task,stats] GC(2) 2 321092 2 7 321093
[6.658s][trace][gc,task,stats] GC(2) 3 406893 4 0 406893
[6.658s][trace][gc,task,stats] GC(2) 4 326710 1 0 326707
[6.658s][trace][gc,task,stats] GC(2) 5 329418 1 1 329418
The number of partial array steals is about the same.
So I don't yet know why the change is so beneficial for GC(2).
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20720#issuecomment-2316773823
More information about the hotspot-gc-dev
mailing list