RFR: 8140326: G1: Consider putting regions where evacuation failed into next collection set

Thu Jun 1 08:22:10 UTC 2023

On Tue, 30 May 2023 14:00:30 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> This change adds management of retained regions, i.e. trying to evacuate evacuation failed regions asap.
> 
> The advantage is that evacuation failed regions do not need to wait until the next marking to be cleaned out; as they are often very sparsely occupied (often being eden regions), this occupies a lot of space, potentially causing additional evacuation failures later on.
> Another use of this change will be region pinning, which are basically evacuation failed regions that can not be reclaimed as long as they are pinned - however as soon as they are unpinned, they should be reclaimed for the same reasons as well.
> 
> It consists of several behavioral changes:
> 
> During garbage collection:
> 
> ... in the Evacuation phase:
> * always collect dirty cards into evacuation failed regions proactively. In tests, the amount of cards/live objects per evacuation failed region is typically very small. Dirty cards are always put into the global refinement buffer immediately, assuming that we will keep most if not all evacuation failed regions.
> 
> ... during Post Evacuation 2/Free Collection Set phase:
> * determine whether the region will be retained (kept for "immediate" evacuation) or not. Highly occupied regions are assumed to stay (mostly) live at least until the next marking, so do not bother with them.
> 
> These "retained" regions are collected in a new "from retained" set in the collection set candidates and managed separately from "from marking" regions. Having the "from retained" and "from marking" sets separate in the collection set candidates is easier to manage than to use a single list and the picking stuff from it. Particularly wrt to making sure that mixed gcs preferentially pick from the "from marking" list first, then second from the "from retained" list.
> 
> ... determining the collection set during the pause:
>   * during gc, the collection set is preferentially (first) populated with regions from the "from marking" candidates (these are the important regions to get cleaned out), second from the "from retained" list.
>   * g1 reserves up to 20% of max gc pause time for retained regions as optional candidates (this is a random number) to make sure that these are cleared out asap to free memory. There is also a minimum number of regions to take from the retained regions list.
> 
> During marking
> 
> ... changes to marking proper
> * retained regions will not be marked through during concurrent mark, i.e. they are considered outside of the snapshot. So they are ...

Because I've been asked about why the strict separation of from-marking and retained regions in the policy: to keep the current gc cycle policy fairly intact.
I.e. do the same as before, with some almost optional additional reclamation of known sparsely populated regions (compared to existing code).

The current heuristic to do young gcs, then a fixed amount of mixed gcs that clean out the old gen asap is, as ugly as it is, surprisingly good in the general case. Treating the retained regions the same as from-marking regions would make it necessary to rethink that: Retained regions are often/most of the time prime targets for evacuation (high efficiency), which means that g1 would start concentrating on these regions first even during the mixed phase (which is generally fine...), but due to how mixed gc works (use "smallest young gen", conceptually fixed amount of gcs) it ultimately would not clean out old gen *fast* enough (or completely, depends) as tuned right now. All the low efficiency regions would need to be cleaned out later. However the prediction isn't good enough to cope well with them. They are typically predicted worse than high efficicency ones, that means failing to be exact for them is worse than for high efficiency regions, so it will not take enough of them.

Now one could extend the mixed phase, but in corner cases g1 would then potentially stay in mixed phase (if evacuation failure/and later pinned regions were commonly encountered but still few) forever, which prohibits marking (leading to full gcs), and degrades performance (as it will be using a small young gen).

In some way, evacuation failed regions, can be seen as kind of extra regions/work due to a mismatch between application and VM configuration (compared to current master). Concentrating on that extra work in the phase that's about keeping the gc cycle going without full gc isn't the best thing to do (the current policy just has fairly strong provisions to take low efficiency regions and avoid full gc), particularly it has less overall impact to stuff high efficiency region collection into the existing young gcs that can go almost if not at full speed (i.e. it has less impact to be wrong about the prediction of a high efficiency region vs. low efficiency one).

Basically I think g1 ultimately needs to get away from what mixed gc is now, and how g1 determines when to start/stop that reclamation phase and how to determine the "right" amount of things to evacuate at what speed. That will certainly have to do something with making sure that the old gen allocation rate is countered, and at the same time being efficient overall (i.e. doing more than necessary to do as many young collections as possible within allowed time goal) without degrading into full gcs.

Fwiw: https://bugs.openjdk.org/browse/JDK-8159697.

I tried and failed to do that in this change (to be better than the current heuristic). Apart from being a different topic, this change in itself has its own merit: it improves resiliency vs. evacuation failure. In the past, what you often had is like having an evacuation failure because you ran out of memory. Since this produced lots of garbage, there has been a very high risk of getting into another evacuation failure because the even more decreased available free space causes gcs more often (smaller young gen available), which causes more surviving objects, resulting in more serious evacuation failures (with more live objects). Ultimately you end up with a full gc very quickly because there is not enough time to mark through the old gen.
With this change at least in the next young gc g1 will compact the evacuation failed regions fairly quickly (next gc), allow better recovery and/or allowing G1 to more easily not loose the race with the mutator until the next marking cycle.

Obviously it is also an important stepping stone for handling pinned regions reasonably.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14220#issuecomment-1571581367