RFR: 8256265 G1: Improve parallelism in regions that failed evacuation [v2]

Thu Dec 2 12:42:25 UTC 2021

On Thu, 2 Dec 2021 01:51:42 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Summary
>> -------
>> 
>> Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.
>> Try to improve the parallelsim when walking over the regions by 
>> 
>>  - first, split a region into tasks;
>>  - then, process these task in parallel and load balance among GC threads;
>>  - last, necessary cleanup
>> 
>> NOTE: load balance part of code is almost same as G1ParScanThreadState, if necessary and feasible, consider to refactor this part into a shared code base.
>> 
>> Performance Test
>> -------
>> 
>> The perf test based on lastest implementation + JDK-8277736 shows that:
>> 
>>  - when `ParallelGCThreads`=32, when `G1EvacuationFailureALotCSetPercent` <= 50, the parallelism bring more benefit than regression;
>> - when `ParallelGCThreads`=128, whatever `G1EvacuationFailureALotCSetPercent` is, the parallelism bring more benefit than regression;
>> 
>> other related evac failure vm options:
>>  - `G1EvacuationFailureALotInterval`=1
>>  - `G1EvacuationFailureALotCount`=1
>> 
>> For detailed perf test result, please check:
>> 
>>  - https://bugs.openjdk.java.net/secure/attachment/97227/parallel.evac.failure-threads.32.png
>>  - https://bugs.openjdk.java.net/secure/attachment/97228/parallel.evac.failure-threads.128.png
>>  
>> For the situation like G1EvacuationFailureALotCSetPercent > 50 and ParallelGCThreads=32 , we could fall back to current implmentation, or further optimize the thread sizing at this phase if necessary.
>> 
>> NOTE: I don't include perf data for `Remove Self Forwards`, because the comparison of pause time in this phase does not well show the improvement of this implementation, I think the reason is that the original implementation is not load balanced, and the new implementation is. But as `Remove Self Forwards` is part of `Post Evacuate Cleanup 1`, so only `Post Evacuate Cleanup 1` well show the improvement of the new implementation.
>> It could be a potential improvement to refine the Pause time data in `Remove Self Forwards` phase.
>
> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge branch 'master' into parallelize-evac-failure
>  - Adjust worker cost by a factor; initialize task queues set and terminator threads by active workers
>  - Fix wrong merge
>  - Merge with master
>  - Remove and merge code of G1ParRemoveSelfForwardPtrsTask into RemoveSelfForwardPtrsTask
>  - Fix crashes in ~G1GCParPhaseTimesTracker(), G1PreRemoveSelfForwardClosure::do_heap_region, G1CollectedHeap::par_iterate_regions_array()=>~StubRoutines::atomic entry points; Refine comments
>  - Fix inconsistent length between task queues and terminator
>  - Fix crash when heap verification; Fix compilation error; Refine comments
>  - Initial commit

I am currently doing a few benchmark runs with it, with so far expected results, and looking over the sources. There are a few places I will comment about later after having a more thorough look, but some initial comments now. They are a bit random, so please bear with me.

General questions/suggestions:

- we reviewers would really appreciate to give an overview of the design decisions when posting such a significant change or deviation from current parallelization mechanism. I.e. this change uses special (overflow-)task queues for distribution, pre-filling them and using the existing work stealing mechanism.
No problem with that, but it would have been nice to add this information in addition to the source code dump :)
As a bonus, add this information to the CR when done.
- further this change implements limiting the number of threads for a particular subphase of a `G1BatchedTask`. That's new functionality, and while coincidentally we in the gc team talked about something like this (including the use of task queues) as a possibility for PR#6612 a few days, it would have been nice to discuss such a change - and as usual split it out before. :)
- it would be interesting to understand whether you tried other approaches on work distribution here and what their results were (if so). Naively I would have initially "just" put the `G1EvacFailureObjectsSet`s into an array of those, and provided an atomic claim counter for those. Not that it is particularly better, but it is interesting that the change immediately goes for the task queue approach. (One problem with that is that the task stealing isn't very scalable to a lot of threads at the moment, particularly on aarch64).
- the `worker_cost` could certainly be improved a little: if there are a lot of threads (like 128 in your example), and a given task does around 1000k entries, if there are not (much) more than 128000 objects to process (which seems little, but there are no statistics gathered by this change), there is no point to use all those 128 threads. (and you do not want to start 128 threads for in-total 128 tasks) Currently the heuristics is 5 threads per region). So if it is not too much work (that information might be provided anyway) I would prefer a more exact estimate.
- you mentioned about a possible optimization for 32 threads: not sure this is the target we should aim for with this change - I would actually count a 32 thread machine already as fairly "large" by server/VM standards but you can convince me otherwise :) So while the change should scale up, smaller sizes are likely more interesting.
- not sure if there is a problem with always stuffing all possible tasks into the task queues into the beginning: that seems fairly expensive to do, and the overflow mechanism should be avoided as it allows unbounded memory use (consider all regions with all objects on the heap failing - the code is stuffing billions / 1000 tasks into the queues then). Of course this needs evaluation how many tasks there typically are - the `TASKQUEUE_STATS` mentioned below can give an idea...
I'm also not sure if it is useful to provide separate task queues for this mechanism, maybe the existing one can be reused. At least not sure if 128k entries for each queue allocated at startup for every thread is a good idea (Did you measure how big they are? These tasks are significantly larger than others). Again, statistics can help.
- the results posted here are not very useful given that we do not know the environment and the actual application and settings. E.g. something like `dacapo h2` with 1gb -Xms and -Xmx would be much better. I can only guess that you were using some ARM servers for testing, but may as well be something like a dual-socket Epyc.
All that helps categorizing the results quite a bit - see the question from Kirk. E.g. mention that at higher values of `G1EvacuationFailureALotCSetPercent`, particular with the other settings, the entire young gen will fail, so it depends on luck at least after a few gcs what objects each gc is going to not evacuate, so making the measurements quite useless without a confidence interval; I would largely guess that they are equal in the end.
I.e. this change needs something like [JDK-8140326](https://bugs.openjdk.java.net/browse/JDK-8140326) for example.
But there may be other causes for the problem, and without providing statistics you can't tell.

Other requests:
- traceability: the change needs log messages at different levels, first some summary information (at debug) and second some detailed information (at trace).
For the debug level message I'd suggest an extension of the gc+heap=debug messages to also show the number of failed regions of that category.
E.g. for these messages

[1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56)
[1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8)
[1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56
[1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2
[1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5

to add something like

[1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56) [0]
[1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8) [3]
[1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56 [2]
[1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2 [0]
[1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5 [2]

and even also add the distinction between actual evac failed and pinned regions.
I'm not suggesting my idea is perfect, to be discussed with others.

Also maybe some additional `gc+evacfail` or something category could track more information on detail level per region statistics on which regions failed (region#), how many objects/bytes failed, and the reasons for the failure (pinned and/or "real" evacuation failure). Note that some (or most) of this information should already be provided in an extra CR before doing this change imho.

- maybe also the "Remove Self Forwards" phase could get some additional information about how many regions/tasks/bytes/objects it worked on (just throwing some ideas here, some may be redundant)
- also Termination information for any phase that uses a `TaskTerminator` is a must; also, the `TASKQUEUE_STATS` mechanism should be supported in a useful way.

Thanks for your patience,
  Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/6627