RFR: 8140326: G1: Consider putting regions where evacuation failed into next collection set [v4]

Tue Aug 1 07:36:54 UTC 2023

On Tue, 13 Jun 2023 08:53:03 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 357:
>> 
>>> 355:         g1h->clear_bitmap_for_region(r);
>>> 356:         r->reset_top_at_mark_start();
>>> 357:         cm->clear_statistics(r);
>> 
>> I don't get why tams and CM-related data-structure need to be cleared here; shouldn't they be handled inside `G1ClearBitmapClosure` already? (On a higher reasoning, evac-fail processing shouldn't interact with CM; I am assuming the bitmap is not related to CM.)
>
>> I don't get why tams and CM-related data-structure need to be cleared here; shouldn't they be handled inside G1ClearBitmapClosure already?
> 
> First, evacuation failure already updated region statistics during concurrent start (as well as g1 updated TAMS during concurrent start).
> 
> It is not that evacuation failure and marking are linked, it is that bitmaps and TAMS and statistics are linked. Previous code also assumed that all old regions were marked through, so (as optimizations):
> 
> 1) Bitmap: g1 keeps the marks on the bitmap of evac-fail regions
> 2) TAMS: could be kept to top() for all evac-failed regions during concurrent start
> 3) statistics: evacuation failure did already update them; clearing them in some cases simply been avoided by conditionally updating statistics (maybe remove that `if` now in evacuation failure handling now, what do you think?).
> 
> Generally, if you update one, you need to keep the other in sync for marking to work. However none of the marking data cleanup has been necessary before because of that assumption that all old regions are to be marked through.
> 
> `G1ClearBitmapClosure` doesn't interact with marking proper (as in: marking live objects on the bitmap), the regions that are cleared in that phase are only those that may not have their bitmap cleared yet due to the "clear bitmaps concurrently" optimization. If that were not the case, g1 would not need that phase.
> At that time TAMS and statistics have already been reset by marking, so it does not need to (and both are marking data structures anyway).
> 
> That is, the reason to have `G1ClearBitmapClosure` is different compared to `ClearRetainedRegionBitmaps`.
> 
> Now that not all old regions may be marked through at concurrent start any more, g1 needs to undo the eagerly updated statistics and TAMS to not confuse other code (mainly marking, but there are asserts wrt to TAMS all over the place).
> 
> I do not think there is a way to avoid evacuation failure to interact with concurrent marking (during the concurrent start pause at least) if we want to clear out pinned regions efficiently and quickly.
> 
> Maybe making this explicit with something like (didn't test):
> 
> 
> bool drop_mark_data = g1h->collector_state()->in_concurrent_start_gc() && 
>                             g1h->policy()->retain_evac_failed_region(r);
> if (drop_mark_data) {
>   // clear statistics and tams
> } else {
>   // tams and statistics must be "set"
> }
> bool drop_bitmap = !g1h->collector_state()->in_concurrent_start_gc() ||
>               ...

tams and stats are updated eagerly in conc-start-gc-pause and later cleared if the evac-fail region is sparse enough. Can update to tams and stats be deferred to when the fate of this evac-fail region is decided? IOW, instead of checking is-conc-start-gc-pause in multiple places, do it once only inside evac-fail-processing to restore the tams and stats invariants if really needed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14220#discussion_r1279584827