RFR: 8256265 G1: Improve parallelism in regions that failed evacuation [v2]

Fri Dec 10 16:03:18 UTC 2021

On Thu, 2 Dec 2021 01:51:42 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Summary
>> -------
>> 
>> Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
>> Try to improve the parallelsim when walking over the regions by 
>> 
>>  - first, split a region into tasks;
>>  - then, process these task in parallel and load balance among GC threads;
>>  - last, necessary cleanup
>> 
>> NOTE: load balance part of code is almost same as G1ParScanThreadState, if necessary and feasible, consider to refactor this part into a shared code base.
>> 
>> Performance Test
>> -------
>> 
>> The perf test based on lastest implementation + JDK-8277736 shows that:
>> 
>>  - when `ParallelGCThreads`=32, when `G1EvacuationFailureALotCSetPercent` <= 50, the parallelism bring more benefit than regression;
>> - when `ParallelGCThreads`=128, whatever `G1EvacuationFailureALotCSetPercent` is, the parallelism bring more benefit than regression;
>> 
>> other related evac failure vm options:
>>  - `G1EvacuationFailureALotInterval`=1
>>  - `G1EvacuationFailureALotCount`=1
>> 
>> For detailed perf test result, please check:
>> 
>>  - https://bugs.openjdk.java.net/secure/attachment/97227/parallel.evac.failure-threads.32.png
>>  - https://bugs.openjdk.java.net/secure/attachment/97228/parallel.evac.failure-threads.128.png
>>  
>> For the situation like G1EvacuationFailureALotCSetPercent > 50 and ParallelGCThreads=32 , we could fall back to current implmentation, or further optimize the thread sizing at this phase if necessary.
>> 
>> NOTE: I don't include perf data for `Remove Self Forwards`, because the comparison of pause time in this phase does not well show the improvement of this implementation, I think the reason is that the original implementation is not load balanced, and the new implementation is. But as `Remove Self Forwards` is part of `Post Evacuate Cleanup 1`, so only `Post Evacuate Cleanup 1` well show the improvement of the new implementation.
>> It could be a potential improvement to refine the Pause time data in `Remove Self Forwards` phase.
>
> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge branch 'master' into parallelize-evac-failure
>  - Adjust worker cost by a factor; initialize task queues set and terminator threads by active workers
>  - Fix wrong merge
>  - Merge with master
>  - Remove and merge code of G1ParRemoveSelfForwardPtrsTask into RemoveSelfForwardPtrsTask
>  - Fix crashes in ~G1GCParPhaseTimesTracker(), G1PreRemoveSelfForwardClosure::do_heap_region, G1CollectedHeap::par_iterate_regions_array()=>~StubRoutines::atomic entry points; Refine comments
>  - Fix inconsistent length between task queues and terminator
>  - Fix crash when heap verification; Fix compilation error; Refine comments
>  - Initial commit

Hi Hamlin,

I looked at this patch a bit and got some other ideas. Similar to some of your initial ideas around using bitmaps. See a PoC here (that I've just done limited testing on):
https://github.com/openjdk/jdk/compare/master...kstefanj:rnd-evac-failure-in-prev-bm

I think the main problem then was the additional memory overhead and my approach is avoiding this by directly using the prev marking bitmap (we could also investigate using the next bitmap). So instead of first recording the failing objects to then mark them in the closure: 
``` 
class RemoveSelfForwardPtrObjClosure: public ObjectClosure {
...
  void do_object(oop obj) {
    ...
    // We consider all objects that we find self-forwarded to be
    // live. What we'll do is that we'll update the prev marking
    // info so that they are all under PTAMS and explicitly marked.
    if (!_cm->is_marked_in_prev_bitmap(obj)) {
      _cm->mark_in_prev_bitmap(obj);
    }
    ...
  }
};

I directly mark them in the bitmap (that is already in place). One drawback with this approach is that we need all failing regions to have clean bitmaps ready for use. I solve this by clearing them up front. We can probably make this more efficient, but in the PoC I just clear all old regions added to the collection set and all regions compacted in the Full GC. To see the additional cost of the CSet clearing I added a new phase showing this and it takes significant time in some cases, but still it seems to be a lot faster than the current approach. Your logging additions really showed that the sorting part of the current approach is quite slow when many objects are failing the evacuation (and this will be the normal case with region pinning). 

I also think this approach will be easier to parallelize, we can just split each region into multiple chunks that are taken care of individually (we just need to make sure the zapping is handled correctly). What do you think about this?

There are more things on the agenda that will affect this, me and Thomas have ideas on how to remove the need for two marking bitmaps. Currently we keep the prev bitmap to be able to parse regions with unloaded classes, but there are other ways to avoid this without keeping the second bitmap. We also need the bitmap for verification, but this can also be managed in other ways.

If we do these things, the approach in my PoC would also need to be revised a bit. So the question is how we should proceed forward. I'm a bit hesitant to this change because of its complexity and knowing that there probably is a simpler way to improve parallelism if we take another initial approach. 

Please let me know what you think.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6627