G1's parallel full GC significantly increases wasted space in Old regions

Sat Feb 17 12:18:12 UTC 2018

Hi,

On Fri, 2018-02-16 at 17:54 -0800, Man Cao wrote:
> Hi,
> 
> We (Java platform team at Google) are comparing G1's performance in
> JDK9u and JDK10. We expect JDK10's G1 to perform better because of
> JEP 307 (Parallel Full GC).
> However, we found a performance regression in JDK10 with DaCapo
> benchmarks. We set the heap size small (about 2-4 times of minimum
> heap) so they trigger interesting GC activities

This is really small - most of these benchmarks run fine with heaps in
the low tens of MB iirc...

> We found JDK10's full GC results in significantly more wasted space
> in Old regions, which leads to a more fragmented heap and fewer Eden
> regions. We also found the amount of wasted space after a full GC is
> proportional to the number of ParallelGCThread. As a result, several
> benchmarks trigger more Young, Mixed and concurrent collections,
> leading to increased CPU usage and pause time. One reason that makes
> these benchmarks sensitive to full GC is that DaCapo harness performs
> a System.gc() in-between each iteration of the benchmarks. So a more
> fragmented heap hurts the benchmark from the beginning of every
> iteration.
> 
> We are aware this is probably a known issue as described in JEP 307:
> "Risks and Assumptions: The fact that G1 uses regions will most
> likely lead to more wasted space after a parallel full GC than for a
> single threaded one."
> However, it is not impossible to optimize the full GC to reduce
> wasted space. After all, a stop-the-world parallel mark-sweep-
compact 
> algorithm should be able to efficiently compact the heap.

The problem is that compacting the these "tail regions" needs a
significant amount of synchronization if you want to do this in
parallel.

The current (compaction part of the) algorithm is basically serial gc
on completely distinct sets of regions ("compaction queues"), i.e. does
no synchronization at all, which makes it fairly fast.

I think there is some mechanism to have one thread at the end
compacting through the "tail regions" or so, but it is only used in
specific circumstances. Maybe others want to chime in here. :)

We found that it is most of the time more efficient (and actually
improves overall throughput) to simply use less threads on small heaps
instead of having a slow serial phase or add costly synchronization to
the parallel compaction. This also causes less of this fragmentation,
and reduces the issue and its effects significantly.

There is no way for Java programs to get a guaranteed "100% maximally
compacting GC" at this time (unless you run it with a single thread).
Note that other parallel full gcs (e.g. Parallel GC) have the same
issue afaik, although it uses a smaller "region size".

> We did not find any RFE or discussion on JBS regarding this. Is there
> any ongoing effort to reduce wasted space in parallel full GC?

 see JDK-8194316 [0] which reports the same issues and JDK-8196071 [1]
for an RFE with a potential fix.

There are btw a few more parallel full gc related issues open, I added
a "gc-g1-fullgc" label to them right now [4]. This list may not be
exhaustive, and as usual, when it's done it's done - we welcome
contributions :)

If you want to work on any of these, it might be useful to start a
discussion here first to get further help/thoughts.

JDK 11 may also contain more changes to ergonomics to better support
small heaps, see JDK-8172792 [2]. Note that the JEP does not cover Full
GC.

(Shameless plug: recent FOSDEM presentation about G1 changes [3])

Thanks,
  Thomas

[0] https://bugs.openjdk.java.net/browse/JDK-8194316
[1] https://bugs.openjdk.java.net/browse/JDK-8196071 
[2] https://bugs.openjdk.java.net/browse/JDK-8172792
[3] https://fosdem.org/2018/schedule/event/g1/
[4] https://bugs.openjdk.java.net/secure/IssueNavigator.jspa?reset=true
&jqlQuery=labels+%3D+gc-g1-fullgc