RFR: 8256265: G1: Improve parallelism in regions that failed evacuation [v9]

Mon Feb 28 09:58:50 UTC 2022

On Sat, 26 Feb 2022 12:31:32 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
>> 
>> This patch tries to improve parallelism when walking over the regions in chunks
>> 
>> Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.
>> 
>> Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)
>> 
>> The performance benefit trend is:
>>  - pause time (Post Evacuate Cleanup 1) is decreased from 76.79% to 2.28% for average time, from 71.61% to 3.04% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=8)
>>  - pause time (Post Evacuate Cleanup 1) is decreased from 63.84% to 15.16% for average time, from 55.41% to 12.45% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=<default=123>)
>> ( Other common Evacuation Failure configurations are:
>> -XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )
>> 
>> For more detailed performance data, please check the related bug.
>
> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Thomas & Albert review

Thank you for the revision; some further comments/suggestions/questions:

1. IMO, `par_iterate_chunks_in_regions` is less readable than the explicit for-loop I suggested above.

2. I think the `update_states` method could be more precisely structured to sth like:

struct RegionMarkedWordsCache {
  int _region_idx;
  size_t _marked_words;

  constructor {
    _region_idx = invalid_region_idx;
  }

  void add(int region_idx, size_t marked_words) { // <-- update_states
    if (_region_idx == region_idx) {
      _marked_words += marked_words;
    } else ...
  }

  void flush() { // <-- sync_last_region_data
    ...
  }
};

The benefit of having such kind of a cache is that one can easily enable/disable it to check whether a cache is really needed in this case. (I think it does, but it's better to do some benchmarking.)

3. There seems some dead code in this PR, such as `G1ParRemoveSelfForwardPtrsTask::initialize_chunks`, `G1ScanChunksInHeapRegions::_evac_failure_regions_length`. (I may have missed others.)

4. I don't understand why `RemoveSelfForwardChunksPerRegion` is inside `...WorkItems`. #chunks_per_region should be a constant shared by all workers. Its siblings are worker-local stats, which are sensible to have inside `...WorkItems`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7047