RFR: 8140326: G1: Consider putting regions where evacuation failed into next collection set [v4]

Fri Jun 9 13:59:48 UTC 2023

On Tue, 6 Jun 2023 08:13:42 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> This change adds management of retained regions, i.e. trying to evacuate evacuation failed regions asap.
>> 
>> The advantage is that evacuation failed regions do not need to wait until the next marking to be cleaned out; as they are often very sparsely occupied (often being eden regions), this occupies a lot of space, potentially causing additional evacuation failures later on.
>> Another use of this change will be region pinning, which are basically evacuation failed regions that can not be reclaimed as long as they are pinned - however as soon as they are unpinned, they should be reclaimed for the same reasons as well.
>> 
>> It consists of several behavioral changes:
>> 
>> During garbage collection:
>> 
>> ... in the Evacuation phase:
>> * always collect dirty cards into evacuation failed regions proactively. In tests, the amount of cards/live objects per evacuation failed region is typically very small. Dirty cards are always put into the global refinement buffer immediately, assuming that we will keep most if not all evacuation failed regions.
>> 
>> ... during Post Evacuation 2/Free Collection Set phase:
>> * determine whether the region will be retained (kept for "immediate" evacuation) or not. Highly occupied regions are assumed to stay (mostly) live at least until the next marking, so do not bother with them.
>> 
>> These "retained" regions are collected in a new "from retained" set in the collection set candidates and managed separately from "from marking" regions. Having the "from retained" and "from marking" sets separate in the collection set candidates is easier to manage than to use a single list and the picking stuff from it. Particularly wrt to making sure that mixed gcs preferentially pick from the "from marking" list first, then second from the "from retained" list.
>> 
>> ... determining the collection set during the pause:
>>   * during gc, the collection set is preferentially (first) populated with regions from the "from marking" candidates (these are the important regions to get cleaned out), second from the "from retained" list.
>>   * g1 reserves up to 20% of max gc pause time for retained regions as optional candidates (this is a random number) to make sure that these are cleared out asap to free memory. There is also a minimum number of regions to take from the retained regions list.
>> 
>> During marking
>> 
>> ... changes to marking proper
>> * retained regions will not be marked through during concurrent mark, i.e. they are considered outside of ...
>
> Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - More debug option removal
>  - Remove debug options for tests

src/hotspot/share/gc/g1/g1CollectionSetCandidates.hpp line 138:

> 136: 
> 137: // Iterator for G1CollectionSetCandidates. Multiplexes across the marking/retained
> 138: // region lists based on gc efficiency.

Why does the iteration require ordering? Going through its use sites, I find they just need to perform SIMD like op on each element regardless of the order.

src/hotspot/share/gc/g1/g1Policy.cpp line 516:

> 514: }
> 515: 
> 516: double G1Policy::predict_retained_regions_evac_time() const {

Seems that the result is only for log-print in its sole caller, `predict_base_time_ms`. However, it's also odd that this method is called there at all.

src/hotspot/share/gc/g1/g1Policy.cpp line 655:

> 653: 
> 654:   size_t threshold = G1MixedGCLiveThresholdPercent * HeapRegion::GrainBytes / 100;
> 655:   return live_bytes < threshold;

If the region was Old at gc-start, it would have passed through the same criteria to be in cset; if the region was Young at gc-start, using a threshold meant only for Old-region is questionable, IMO.

Could retaining be performed unconditionally for now based on the assumption that evac-fail regions are sparse?

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 357:

> 355:         g1h->clear_bitmap_for_region(r);
> 356:         r->reset_top_at_mark_start();
> 357:         cm->clear_statistics(r);

I don't get why tams and CM-related data-structure need to be cleared here; shouldn't they be handled inside `G1ClearBitmapClosure` already? (On a higher reasoning, evac-fail processing shouldn't interact with CM; I am assuming the bitmap is not related to CM.)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1224289903
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1224342203
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1224188917
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1224198102