RFR: 8311163: Parallel: Improve large object handling during evacuation

Thu Aug 29 06:00:18 UTC 2024

On Tue, 27 Aug 2024 10:40:59 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

> Using `-Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=6 DelayInducer.java`, I can see some regression (~10%) in avg gc-pause time on my box (nproc=12). Have you seen sth similar?

No, I'm not seeing anything like that. Indeed, I'm seeing strongly the
opposite. Note that my machine has nproc=32, but the results were similar with
-XX:ActiveProcessorCount=12.

old
[1.966s][info][gc] GC(0) Pause Young (Allocation Failure) 504M->387M(1930M) 744.238ms
[2.446s][info][gc] GC(1) Pause Young (Allocation Failure) 891M->882M(2283M) 236.505ms
[7.733s][info][gc] GC(2) Pause Young (Allocation Failure) 1686M->1669M(2668M) 1678.807ms
[9.423s][info][gc] GC(3) Pause Full (Allocation Failure) 2523M->89M(2421M) 45.843ms

new
[1.694s][info][gc] GC(0) Pause Young (Allocation Failure) 504M->397M(1930M) 469.406ms
[2.140s][info][gc] GC(1) Pause Young (Allocation Failure) 901M->892M(2283M) 224.942ms
[6.533s][info][gc] GC(2) Pause Young (Allocation Failure) 1682M->1669M(2668M) 761.521ms
[8.253s][info][gc] GC(3) Pause Full (Allocation Failure) 2509M->89M(2408M) 47.130ms

These results are pretty stable across 10-12 runs. Regardless of which variant
is used, the number and types of GCs is stable, as are the collection amounts.
And for a given variant the times are fairly stable.

For GC(0), the new variant is consistently faster, though the fastest old time
and the slowest new time are close or overlap.

For GC(1), the two variants are pretty similar.

For GC(2), the new variant is consistently more than a factor of 2 faster, and
the average difference is closer to 2.5x.

Digging a bit deeper, here are sample times for the Scavenge phase.

old
[7.849s][debug][gc,phases      ] GC(2) Scavenge 1928.903ms
new
[6.511s][debug][gc,phases      ] GC(2) Scavenge 774.372ms

I haven't found a way to get more detailed timing information there.  In
particular, I thought there was some logging in the TaskTerminator, but don't
see any now.

I rebuilt the variants with -DTASKQUEUE_STATS and ran with gc+task+stats=trace
logging, and here's a sample output:

[6.610s][trace][gc,task,stats] GC(2) thr       push      steal    chunked     chunks
[6.610s][trace][gc,task,stats] GC(2) --- ---------- ---------- ---------- ----------
[6.610s][trace][gc,task,stats] GC(2)   0     306340          0         19     306335
[6.610s][trace][gc,task,stats] GC(2)   1     300040          2          7     300041
[6.610s][trace][gc,task,stats] GC(2)   2     303588          1          0     303589
[6.610s][trace][gc,task,stats] GC(2)   3     334770          3          0     334771
[6.610s][trace][gc,task,stats] GC(2)   4     332980          3          0     332981
[6.610s][trace][gc,task,stats] GC(2)   5     436049          1          0     436050

new
[6.657s][trace][gc,task,stats] GC(2)     ----partial array----     arrays      array
[6.657s][trace][gc,task,stats] GC(2) thr       push      steal    chunked     chunks
[6.658s][trace][gc,task,stats] GC(2) --- ---------- ---------- ---------- ----------
[6.658s][trace][gc,task,stats] GC(2)   0     315502          2         18     315503
[6.658s][trace][gc,task,stats] GC(2)   1     314152          1          0     314153
[6.658s][trace][gc,task,stats] GC(2)   2     321092          2          7     321093
[6.658s][trace][gc,task,stats] GC(2)   3     406893          4          0     406893
[6.658s][trace][gc,task,stats] GC(2)   4     326710          1          0     326707
[6.658s][trace][gc,task,stats] GC(2)   5     329418          1          1     329418

The number of partial array steals is about the same.

So I don't yet know why the change is so beneficial for GC(2).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20720#issuecomment-2316773823