RFR: 8256265 G1: Improve parallelism in regions that failed evacuation [v2]

Fri Dec 3 03:36:16 UTC 2021

On Thu, 2 Dec 2021 12:39:45 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> I am currently doing a few benchmark runs with it, with so far expected results, and looking over the sources. There are a few places I will comment about later after having a more thorough look, but some initial comments now. They are a bit random, so please bear with me.

Thanks a lot. :)

> 
> General questions/suggestions:
> 
> * we reviewers would really appreciate to give an overview of the design decisions when posting such a significant change or deviation from current parallelization mechanism. I.e. this change uses special (overflow-)task queues for distribution, pre-filling them and using the existing work stealing mechanism.
>   No problem with that, but it would have been nice to add this information in addition to the source code dump :)
>   As a bonus, add this information to the CR when done.

Sure, will add this information later. :) (TO add info)

> * further this change implements limiting the number of threads for a particular subphase of a `G1BatchedTask`. That's new functionality, and while coincidentally we in the gc team talked about something like this (including the use of task queues) as a possibility for PR#6612 a few days, it would have been nice to discuss such a change - and as usual split it out before. :)

Not sure if I get the point, current implementation is calculate the thread sizing to set the terminator threads and task queue size at the end of G1PostEvacuateCollectionSetCleanupTask1 constructor. I think there is some better refactoring solution to get this done.
I will check pr #6612  further to see what we can do first. :)  (TO double check)

> * it would be interesting to understand whether you tried other approaches on work distribution here and what their results were (if so). Naively I would have initially "just" put the `G1EvacFailureObjectsSet`s into an array of those, and provided an atomic claim counter for those. Not that it is particularly better, but it is interesting that the change immediately goes for the task queue approach. (One problem with that is that the task stealing isn't very scalable to a lot of threads at the moment, particularly on aarch64).

Thanks, yes, that's another way to do it by providing a atomic claimer, let me measure it later. (TODO & measure)

> * the `worker_cost` could certainly be improved a little: if there are a lot of threads (like 128 in your example), and a given task does around 1000k entries, if there are not (much) more than 128000 objects to process (which seems little, but there are no statistics gathered by this change), there is no point to use all those 128 threads. (and you do not want to start 128 threads for in-total 128 tasks) Currently the heuristics is 5 threads per region). So if it is not too much work (that information might be provided anyway) I would prefer a more exact estimate.

Yes, this part of thread sizing is very draft, we will definitely need to refine the estimate. (TODO)

> * you mentioned about a possible optimization for 32 threads: not sure this is the target we should aim for with this change

Agree.
>  - I would actually count a 32 thread machine already as fairly "large" by server/VM standards but you can convince me otherwise :) So while the change should scale up, smaller sizes are likely more interesting.

Thanks for the suggestion. Let me meansure this later. (TO measure)

>  * not sure if there is a problem with always stuffing all possible tasks into the task queues into the beginning: that seems fairly expensive to do, and the overflow mechanism should be avoided as it allows unbounded memory use (consider all regions with all objects on the heap failing - the code is stuffing billions / 1000 tasks into the queues then). Of course this needs evaluation how many tasks there typically are - the `TASKQUEUE_STATS` mentioned below can give an idea...

Agree. This is another part needed to be refined. (TODO)

>   I'm also not sure if it is useful to provide separate task queues for this mechanism, maybe the existing one can be reused.

Do you mean G1EvacFailureParScanTasksQueue* _task_queue in G1EvacFailureParScanState? it's just an OverflowTaskQueue, not a new structure. Would you mind to clarify further? :)

>  At least not sure if 128k entries for each queue allocated at startup for every thread is a good idea (Did you measure how big they are? These tasks are significantly larger than others). Again, statistics can help.

Not quite get your point, would you mind to clarify futher about "128k entries for each queue allocated at startup for every thread"? :)

> * the results posted here are not very useful given that we do not know the environment and the actual application and settings. E.g. something like `dacapo h2` with 1gb -Xms and -Xmx would be much better. I can only guess that you were using some ARM servers for testing, but may as well be something like a dual-socket Epyc.

Sure, I will try to use more stadard benchmark to measure, will update the result and environement information later too.

>   All that helps categorizing the results quite a bit - see the question from Kirk. E.g. mention that at higher values of `G1EvacuationFailureALotCSetPercent`, particular with the other settings, the entire young gen will fail, so it depends on luck at least after a few gcs what objects each gc is going to not evacuate, so making the measurements quite useless without a confidence interval; I would largely guess that they are equal in the end.

I think so too.

>   I.e. this change needs something like [JDK-8140326](https://bugs.openjdk.java.net/browse/JDK-8140326) for example.

Yes, this is to be done.

>   But there may be other causes for the problem, and without providing statistics you can't tell.
> 
> Other requests:
> 
> * traceability: the change needs log messages at different levels, first some summary information (at debug) and second some detailed information (at trace).
>   For the debug level message I'd suggest an extension of the gc+heap=debug messages to also show the number of failed regions of that category.
>   E.g. for these messages
> 
> ```
> [1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56)
> [1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8)
> [1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56
> [1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2
> [1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5
> ```
> 
> to add something like
> 
> ```
> [1.912s][info ][gc,heap      ] GC(0) Eden regions: 64->0(56) [0]
> [1.912s][info ][gc,heap      ] GC(0) Survivor regions: 0->8(8) [3]
> [1.912s][info ][gc,heap      ] GC(0) Old regions: 0->56 [2]
> [1.912s][info ][gc,heap      ] GC(0) Archive regions: 2->2 [0]
> [1.912s][info ][gc,heap      ] GC(0) Humongous regions: 5->5 [2]
> ```
> 
> and even also add the distinction between actual evac failed and pinned regions. I'm not suggesting my idea is perfect, to be discussed with others.
> 
> Also maybe some additional `gc+evacfail` or something category could track more information on detail level per region statistics on which regions failed (region#), how many objects/bytes failed, and the reasons for the failure (pinned and/or "real" evacuation failure). Note that some (or most) of this information should already be provided in an extra CR before doing this change imho.
> 
> * maybe also the "Remove Self Forwards" phase could get some additional information about how many regions/tasks/bytes/objects it worked on (just throwing some ideas here, some may be redundant)
> * also Termination information for any phase that uses a `TaskTerminator` is a must; also, the `TASKQUEUE_STATS` mechanism should be supported in a useful way.
> 

Can't agree more, I will first collect more statistics by adding logging, will do it in another PR first.

> Thanks for your patience, Thomas

Thanks a lot for the detailed initial comments.
-Hamlin

-------------

PR: https://git.openjdk.java.net/jdk/pull/6627