RFR: 8256265: G1: Improve parallelism in regions that failed evacuation [v6]

Mon Feb 14 08:20:12 UTC 2022

On Mon, 14 Feb 2022 01:24:35 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
>> 
>> This patch tries to improve parallelism when walking over the regions in chunks
>> 
>> Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.
>> 
>> Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)
>> 
>> The performance benefit trend is:
>>  - pause time (Post Evacuate Cleanup 1) is decreased from 76.79% to 2.28% for average time, from 71.61% to 3.04% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=8)
>>  - pause time (Post Evacuate Cleanup 1) is decreased from 63.84% to 15.16% for average time, from 55.41% to 12.45% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=<default=123>)
>> ( Other common Evacuation Failure configurations are:
>> -XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )
>> 
>> For more detailed performance data, please check the related bug.
>
> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Merge branch 'master' into parallelize-evac-failure-in-bm
>  - Sync marked words to HeapRegion; Fix compilation error on windows
>  - prepare evacuation failure regions explicitly before post 1; fix remset verification crash
>  - Remove Prepare-Region and Wait-For-Ready phases
>  - Add logging code from Thomas
>  - Collect livewords in chunk closure
>  - clean vm options
>  - sync region preparation before iterate through chunks in a region
>  - use const G1CMBitMap
>  - adapt to update_bot_if_crossing_boundary changes
>  - ... and 3 more: https://git.openjdk.java.net/jdk/compare/eff5dafb...85bb0635

Hi Thomas,

My test (with the latest implementation) shows that when evacuation failure regions number is less than parallel gc thread number, it bring stable benefit in post 1 phase; but when evacuation failure regions number is more than parallel gc thread number, the benefit is not stable, and can bring some regionssion in post 1 phase.
I think the test result is reasonable. When there are more evaucuation failure regions than parallel gc threads, parallelism at region level should already assign some regions to every gc threads, i.e. it's already fully parallized in some degree; whether parallelism at chunk level could bring more benefit depends on the distribution of evacuation failure objects in regions. Otherwise, when there are less evaucuation failure regions, parallelism at region level can not assign every gc threads a evacuation failure region to process, at this situation parallism at chunk level can bring more benefit, and the benefit is stable.

A simple heuristic is to switch to original implemenation, i.e. parallelize only at region level, when detects that evacuation failure regions number is more than parallel gc thread number. The advantage is that it avoids to consume extra CPU to do unnecessary parallelism at chunk level. The drawback of this solution is that it will bring 2 pieces of code: parallelism in regions, and parallelism in chunks.

How do you think about it?

Thanks

-------------

PR: https://git.openjdk.java.net/jdk/pull/7047