RFR: 8256265: G1: Improve parallelism in regions that failed evacuation

Wed Jan 26 12:26:33 UTC 2022

On Wed, 12 Jan 2022 09:03:45 GMT, Hamlin Li <mli at openjdk.org> wrote:

> Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
> 
> This patch tries to improve parallelism when walking over the regions in chunks
> 
> Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.
> 
> Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)
> 
> The performance benefit trend is:
>  - pause time (Post Evacuate Cleanup 1) is decreased from 76.79% to 2.28% for average time, from 71.61% to 3.04% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=8)
>  - pause time (Post Evacuate Cleanup 1) is decreased from 63.84% to 15.16% for average time, from 55.41% to 12.45% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=<default=123>)
> ( Other common Evacuation Failure configurations are:
> -XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )
> 
> For more detailed performance data, please check the related bug.

Regarding the log messages: We might want to fix up them a bit, I did not look at our recent email discussion on what we came up with, and their level.

Some other initial thoughts worth considering:

*) What I already noticed yesterday on some tests, and can also be seen in your log snippet, is that the "Remove self-forwards in chunks" takes a lot of time, unexpectedly much to me actually. I want to look further into this to understand the reason(s).

*) The other concern I have is whether we really need (or can avoid) the need for the "Wait for Ready In Retained Regions" phase. It looks a bit unfortunate to actually have a busy-loop in there; this should definitely use proper synchronization or something to wait on if it is really needed. What of the retained region preparation do we really need? On a first look, maybe just the BOT reset, which we might be able to put somewhere else (I may be totally wrong). Also, if so, the Prepare Retained regions should probably be split out to be started before all other tasks in this "Post Evacuate Cleanup 1" phase.

I can see that from a timing perspective "Wait For Ready" is not a problem in all of my tests so far.

*) The "Prepared Retained Regions" phase stores the amount of live data into the `HeapRegion`; for this reason the change adds these `G1RegionMarkStats` data gathering via the `G1RegionMarkStatsCache`; I think the same information could be provided while iterating over the chunks (just do an `Atomic::add` here) instead. A single `Atomic::add` per thread per retained region at most seems to be okay. That would also remove the `Evac Fail Merge Live` phase afaict.

*) Not too happy that the `G1HeapRegionChunk` constructor does surprisingly much work, which surprisingly takes very little time.

*) I was wondering whether it would be somewhat more efficient for the `Prepare Chunks` phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7047