RFC: Parallel full collection for G1

Wed Nov 9 22:19:10 UTC 2016

Hi Aleksey,

  thanks for your input.

On Wed, 2016-11-09 at 18:08 +0100, Aleksey Shipilev wrote:
> Hi,
> 
> On 11/09/2016 04:15 PM, Erik Helin wrote:
> > 
> > currently G1 Full GC is very slow as it is done serially. In some
> > cases
> > this might not be optimal.
> YES, single-threaded failure mode is very bad and leaves bad
> impression for those users who mistuned or otherwise fallen into the
> failure mode tarpit. Failure mode should not be unnecessarily
> painful. See e.g.
>  http://cr.openjdk.java.net/~shade/shenandoah/treefragger.txt

Honestly, I would have expected worse for such a load. P90/P95 for G1
seems comparable to Shenandoah, the remainder seems to be partially an
issue with the low average pause time goal of 200ms (it achieves 141ms
in the end), fixing that to something comparable to the others would
probably decrease the number of GCs significantly (serial overhead).
And then there is of course the slow full gc.
It may perform worse though, idk.

Did you analyze the results for the G1 collector further about the
reasons, ie. on when and why they fail? This would yield some data
point(s) that could be used here.

> > - piggybacking on an already running concurrent mark is preferable
> > when G1 is about to encounter a concurrent mode failure. Typically
> > a concurrent mark is about to be done, unless the dynamic IHOP
> > predictions were completely off. If so, then most of the marking
> > work is already done. This could save significant amount of time in
> > the following compaction phase of that full GC.
> Is that concurrent marking information as accurate though? E.g. does
> SATB brings the marked objects that are garbage? If so, is there a
> failure-failure mode when we don't clean up all the garbage? That
> would be bad for a total cleanup phase the full GC is.

As you correctly point out the information may contain a significant
amount of garbage. Do you have data about this being an issue on a
"real world" load?

However, the suggestion to reuse the current marking information might
be save on overall max latency. If this "inexact" full GC frees up
enough memory to keep on going (eg. depending on whether we can free
enough to be reasonably sure that a following marking phase can
complete - we have that information) we are good. Particularly if the
collector does not try to compact everything (like the deadwood
optimization for current full gcs; you removed the paragraph mentioning
that from the proposal).

In my experience, in regular operation, the first few mixed gcs after
concurrent mark typically reclaim a large part of the heap. Which
means, that typically even an inexact marking will yield "enough" free
space to continue working (e.g. for mixed gcs or to complete marking).
I do not expect this to be different in that situation.

Depending on how long the from-scratch marking/LA takes, this might
save a large part of pause time. A from-scratch Full GC is maybe better
on overall throughput (not necessarily, as it might do lots of
unnecessary work so that in the end it is worse).

Assuming that this concurrent mode failure due to temporary high
allocation/survival rate is rather uncommon, I would expect that this
could significantly save on total "Full GC" pause time.

Now one can argue that one still might want to have a "do everything
from scratch" gc for certain occasions (e.g. when you did not start
marking yet at all, or some user really really requesting it).

>From a user perspective, in case the application regularly runs into
this issue (i.e. the promotion rate is just too high), more frequent
shorter pauses seem "better" to me than less frequent long ones too -
because even a fully parallel full gc takes quite a bit of time on
somewhat larger heaps. Even the 8s ones in your micro seem a bit long.

If full gc is a very uncommon failure mode, doing as suggested seems to
be favorable in any case as it completely avoids any "long" pauses.

So overall this seems to be the "lesser" painful kind of behavior to
me, but ymmv.

Only trying to give pros and cons. I could imagine both have their uses
depending on the application or use case.

Sorry for my probably incoherent ramblings :)

> > - parallelizing the closures used within the "MarkSweep framework"
> > will
> >   result in a parallel full GC that can handle the worst case
> >   from-scratch Full GC better.  I.e. even though this algorithm
> > will
> >   have to redo marking in a STW pause, it will get the most precise
> >   liveness information and so will be able to compact the heap more
> >   densely.  This approach can also handle the case when G1 is
> >   completely out of regions.
> Aha, liveness. In Shenandoah, full GC is almost completely separate
> STW mark-compact. And we don't need region liveness data during the 
> full GC, because full GC does all regions, no need to have liveness
> data to select collection set. Avoiding liveness analysis for full-
> gc-mark improves performance.

Isn't the marking part of a mark-compact algorithm some kind of
liveness analysis? (I.e. the usual reachability?)

I am also a bit confused about what you mean with "region liveness" in this context. The initial suggestion did not mention it at all. Can you clarify what you mean here? Particularly because any result of the marking/LA could be recalculated as "liveness per region/area", but I agree with you, I am not sure how this would be particularly useful for the full GC.
Also note, that a G1 full GC may need to supply some "region liveness" (depending on what you mean) for continued operation anyway.

Anyway, sorry, if we were a little bit imprecise or confusing in the descriptions.

This one is basically about redoing everything from scratch as you suggest.

The pauses may be longer overall, but delays any further (heavy) Full GCs longer if the load/promotion rate is not sustainable.

One could also start with this, and then think of the other.

> I tend to think that for failure mode, "drop everything on the floor,
> and do the accurate mark, linear compaction, and pointer fixups" is
> the best way to do it. In Shenandoah, the code cost is not that high
> to frown about having a separate impl for full GC, and I think in G1
> it would be the same.

Thanks again for your input. Code cost may or may indeed not be an
issue.
At least one goal should probably be to completely remove the existing
copy&paste G1 full GC code :)

Thanks,
  Thomas