RFR: 8140326: G1: Consider putting regions where evacuation failed into next collection set [v10]

Thu Aug 3 12:17:41 UTC 2023

On Thu, 3 Aug 2023 07:07:02 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> This change adds management of retained regions, i.e. trying to evacuate evacuation failed regions asap.
>> 
>> The advantage is that evacuation failed regions do not need to wait until the next marking to be cleaned out; as they are often very sparsely occupied (often being eden regions), this occupies a lot of space, potentially causing additional evacuation failures later on.
>> Another use of this change will be region pinning, which are basically evacuation failed regions that can not be reclaimed as long as they are pinned - however as soon as they are unpinned, they should be reclaimed for the same reasons as well.
>> 
>> It consists of several behavioral changes:
>> 
>> During garbage collection:
>> 
>> ... in the Evacuation phase:
>> * always collect dirty cards into evacuation failed regions proactively. In tests, the amount of cards/live objects per evacuation failed region is typically very small. Dirty cards are always put into the global refinement buffer immediately, assuming that we will keep most if not all evacuation failed regions.
>> 
>> ... during Post Evacuation 2/Free Collection Set phase:
>> * determine whether the region will be retained (kept for "immediate" evacuation) or not. Highly occupied regions are assumed to stay (mostly) live at least until the next marking, so do not bother with them.
>> 
>> These "retained" regions are collected in a new "from retained" set in the collection set candidates and managed separately from "from marking" regions. Having the "from retained" and "from marking" sets separate in the collection set candidates is easier to manage than to use a single list and the picking stuff from it. Particularly wrt to making sure that mixed gcs preferentially pick from the "from marking" list first, then second from the "from retained" list.
>> 
>> ... determining the collection set during the pause:
>>   * during gc, the collection set is preferentially (first) populated with regions from the "from marking" candidates (these are the important regions to get cleaned out), second from the "from retained" list.
>>   * g1 reserves up to 20% of max gc pause time for retained regions as optional candidates (this is a random number) to make sure that these are cleared out asap to free memory. There is also a minimum number of regions to take from the retained regions list.
>> 
>> During marking
>> 
>> ... changes to marking proper
>> * retained regions will not be marked through during concurrent mark, i.e. they are considered outside of ...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update retained region's top/stats at the end only; move pre_concurrent_start to before determining collection set to simplify condition what regions to set tams for

src/hotspot/share/gc/g1/g1HeapVerifier.cpp line 464:

> 462:     G1ConcurrentMark* cm = G1CollectedHeap::heap()->concurrent_mark();
> 463: 
> 464:     bool part_of_marking = !(r->is_collection_set_candidate() || !r->is_old_or_humongous());

`is_old_or_humongous() && !is_collection_set_candidate()` is probably easier to read; just like what is in `HeapRegion::note_start_of_marking`.

src/hotspot/share/gc/g1/g1HeapVerifier.cpp line 506:

> 504:   // To compare liveness recorded in G1ConcurrentMark and actual we need to flush the
> 505:   // cache.
> 506:   G1CollectedHeap::heap()->concurrent_mark()->flush_all_task_caches();

It can be surprising for verification to update global vars (visible side effect). Does it make sense to verify marking related stats when marking is complete?

src/hotspot/share/gc/g1/g1Policy.cpp line 270:

> 268:     desired_eden_length_by_mmu = calculate_desired_eden_length_by_mmu();
> 269: 
> 270:     double base_time_ms = predict_base_time_ms(pending_cards, rs_length, true /* include_retained */);

Can "retained_time" be accounted for inside `calculate_desired_eden_length_by_pause`? It's kind of confusing that  the content of"base time" is context sensitive.

src/hotspot/share/gc/g1/g1Policy.cpp line 527:

> 525:     if (result + predicted_time_ms >= max_time_for_retaining()) {
> 526:       // Over limit. Exit.
> 527:       break;

Using a contrived example, if the first region exceeds the limit, this will say retain-time is zero. However, inside young-gc pause, the min num of retained regions is 1, decided in `G1Policy::select_candidates_from_retained`.

This discrepancy seems a bit unsettling. Can one use the min-value here, like its counterpart `calculate_desired_eden_length_before_mixed` above?

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 372:

> 370: 
> 371: public:
> 372:   ClearRetainedRegionData(G1EvacFailureRegions* evac_failure_regions) :

The class name should be updated, sth like "ProcessRetainedRegion".

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1283019027
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1283017844
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1283099289
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1283116247
PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1283029201