RFR: 8256265: G1: Improve parallelism in regions that failed evacuation [v7]

Thu Feb 17 16:39:05 UTC 2022

On Mon, 14 Feb 2022 12:10:51 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
>> 
>> This patch tries to improve parallelism when walking over the regions in chunks
>> 
>> Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.
>> 
>> Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)
>> 
>> The performance benefit trend is:
>>  - pause time (Post Evacuate Cleanup 1) is decreased from 76.79% to 2.28% for average time, from 71.61% to 3.04% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=8)
>>  - pause time (Post Evacuate Cleanup 1) is decreased from 63.84% to 15.16% for average time, from 55.41% to 12.45% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=<default=123>)
>> ( Other common Evacuation Failure configurations are:
>> -XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )
>> 
>> For more detailed performance data, please check the related bug.
>
> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Clean code; adapt to new bot implementation; others

Doesn't look too bad. :).

src/hotspot/share/gc/g1/g1EvacFailure.cpp line 203:

> 201:   uint max_regions = _evac_failure_regions->max_regions();
> 202:   size_t* marked_words_in_regions = NEW_C_HEAP_ARRAY(size_t, max_regions, mtGC);
> 203:   memset(marked_words_in_regions, 0, sizeof(size_t) * max_regions);

Given that the chunks are handed out region after region, or at least a given thread after switching the region for a given chunk never (or at most very rarely) gets a chunk of the same region again, it would be much more memory conscious to just cache this value for the current region, and flush only if it changes.
Similar to the flushing of the marking statistics.

src/hotspot/share/gc/g1/g1EvacFailure.cpp line 217:

> 215:       region->note_self_forwarding_removal_end_par(marked_words_in_regions[idx] * BytesPerWord);
> 216:     }
> 217:   }

At least put this code into a helper method if not implementing above suggestion.

src/hotspot/share/gc/g1/g1EvacFailure.cpp line 220:

> 218:   phase_times->record_or_add_time_secs(G1GCPhaseTimes::SyncMarkedWordInRetainedRegions, worker_id, (Ticks::now() - start).seconds());
> 219: 
> 220:   FREE_C_HEAP_ARRAY(size_t, marked_words_in_regions);

Indentation, but may be superfluous.

src/hotspot/share/gc/g1/g1GCPhaseTimes.cpp line 507:

> 505:     debug_phase(_gc_par_phases[RecalculateUsed], 1);
> 506:     debug_phase(_gc_par_phases[RestorePreservedMarks], 1);
> 507:     debug_phase(_gc_par_phases[VerifyAfterSelfForwardingPtrRemoval], 1);

That should probably only be printed when the respective verification is enabled, i.e. when this task is actually run.

src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp line 74:

> 72: 
> 73: bool G1ScanChunksInHeapRegionClosure::do_heap_region(HeapRegion* r) {
> 74:   G1GCPhaseTimes* phase_times = G1CollectedHeap::heap()->phase_times();

Maybe rename `phase_times` to `p` to make the calls shorter. Other code does that too.

src/hotspot/share/gc/g1/g1HeapRegionChunk.cpp line 83:

> 81:   while (true) {
> 82:     if (claimer->claim_chunk(chunk_idx)) {
> 83:       Ticks start2 = Ticks::now();

Please rename this to something better than `start2` - yes, I was lazy... Maybe `Ticks chunk_prepare_start = Ticks::now()`.
Unfortunately since the code does that much in the constructor, and we use `chunk` later, we can't just put it into a new scope (to reuse `start`).

src/hotspot/share/gc/g1/g1HeapRegionChunk.hpp line 62:

> 60:   }
> 61: 
> 62:   HeapWord* first_obj_in_chunk() const {

I would prefer if these trivial getters would be trimmed down by doing the `return` in the same line. E.g.

`HeapRegion* heap_region() const { return _region; }`

src/hotspot/share/gc/g1/g1HeapRegionChunk.inline.hpp line 49:

> 47:   }
> 48: 
> 49:   // assert(next_addr == _limit, "Should stop the scan at the limit.");

Remove.

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 114:

> 112:   double worker_cost() const override {
> 113:     assert(_evac_failure_regions->evacuation_failed(), "Should not call this if not executed");
> 114:     return _evac_failure_regions->num_regions_failed_evacuation() * G1YoungGCEvacFailureInjector::evacuation_failure_worker_cost();

This should not depend on anything in the evacuation failure injector. That seems to be the wrong type of dependency. Why not just use the global here?

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.hpp line 71:

> 69: #endif
> 70: 
> 71:   class VerifyAfterSelfForwardingPtrRemovalTask;

The list above in the comments needs to be updated.

src/hotspot/share/gc/g1/g1_globals.hpp line 83:

> 81:           "injection will be applied.")                                     \
> 82:           range(1, 100)                                                     \
> 83:   product(uint, G1EvacuationFailureALotWorkerCost, 2,                       \

This has nothing to do with the `EvacuationFailureALot` functionality, but is primarily a weight for the forward pointer removal. This ought to be renamed and moved to the regular flag section.

src/hotspot/share/gc/g1/g1_globals.hpp line 89:

> 87:   product(uint, G1EvacuationFailureHeapRegionChunkNum, 256,                 \
> 88:           "Chunks num per G1 region when process evacuation failure "       \
> 89:           "regions in parallel.")                                           \

Number of chunks per G1 region when processing evacuation failed regions in parallel.

I also believe this should be moved to the regular section for the same reasons as the other flag. Also, if we determine the number of chunks dynamically as suggested in the answer to your question about that, this flag is not required.

-------------

Changes requested by tschatzl (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/7047