RFR: 8256265: G1: Improve parallelism in regions that failed evacuation

Thu Jan 27 10:21:34 UTC 2022

On Thu, 27 Jan 2022 09:23:27 GMT, Hamlin Li <mli at openjdk.org> wrote:

> >Some other initial thoughts worth considering:
> >
> > *) What I already noticed yesterday on some tests, and can also be seen in your log snippet, is that the "Remove self-forwards in chunks" takes a lot of time, unexpectedly much to me actually. I want to look further into this to understand the reason(s).
> 
> In fact, normally most of time of "Post Evacuate Cleanup 1" is spent on "Restore Retained Regions" in baseline version. In parallel version, the proportion of "Restore Retained Regions" in "Post Evacuate Cleanup 1" is reduced. [...]

I agree. This has only been a general remark about its performance, not meant to belittle the usefulness this change and in general all the changes in this series have, which are quite substantial. 👍  I compared the throughput (bytes/ms) between `Object Copy` and this phase, and at least without JDK-8280374 the `remove self forwards` is only like 2x the throughput of `Object Copy`, which seemed quite bad compared to what they do. With JDK-8280374 the results are much better (~4.5x) afaict on a single benchmark I tried though. It's a bit hard to reproduce the exact situation/heap though...

Another (future) optimization that may be worthwhile here may be to get some occupancy statistics of the chunks and switch between walking the bitmap and walking the objects; another one that might be "simpler" to implement (but fairly messy probably) is to simply check if the object after the current one is also forwarded, and if so, do not switch back to the bitmap walking but immediately process that one as well.
This might help somewhat because given typical avg. object sizes (~40 bytes), the mark word after the current one might be already in the cache anyway, so a read access practically free.

These are only ideas though.

> > *) The other concern I have is whether we really need (or can avoid) the need for the "Wait for Ready In Retained Regions" phase. It looks a bit unfortunate to actually have a busy-loop in there; this should definitely use proper synchronization or something to wait on if it is really needed. What of the retained region preparation do we really need? On a first look, maybe just the BOT reset, which we might be able to put somewhere else (I may be totally wrong). Also, if so, the Prepare Retained regions should probably be split out to be started before all other tasks in this "Post Evacuate Cleanup 1" phase.
> 
>> I can see that from a timing perspective "Wait For Ready" is not a problem in all of my tests so far.
> 
> Yes, currently seems "Wait For Ready" does not cost much time, as "Prepared Retained Regions" is quick, not sure if synchronization will help any more.
> But I will investigate if we can omit "Prepared Retained Regions" and "Wait For Ready" subphases totally to simplify the logic. [TODO]

The point of "proper synchronization" isn't that it's faster, but it does not burn cpu cycles unnecessarily which potentially keeps the one thread that others are waiting on do the work. If we can remove the dependencies between the "Prepare Retained Regions" and the remaining phases, which only seems to be the BOT. One idea is that maybe all that prepare stuff could be placed where G1 adds that region to the list of retained regions. This does not work for the liveness count obviously - but that can be recreated by the actual self forwarding removal as suggested earlier 😸).

Then none of that is required which is even better.

> 
> >*) The "Prepared Retained Regions" phase stores the amount of live data into the HeapRegion; for this reason the change adds these G1RegionMarkStats data gathering via the G1RegionMarkStatsCache; I think the same information could be provided while iterating over the chunks (just do an Atomic::add here) instead. A single Atomic::add per thread per retained region at most seems to be okay. That would also remove the Evac Fail Merge Live phase afaict.
> 
> I will do this refactor soon.

Thanks!

> 
> >*) Not too happy that the G1HeapRegionChunk constructor does surprisingly much work, which surprisingly takes very little time.
> >
>  >*) I was wondering whether it would be somewhat more efficient for the Prepare Chunks phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.
> 
> I will put it on backlog to see if it can be simplified. [TODO]
> 

Not necessarily simplified: one option is to make that work explicit (we tend to try to not do too much work in constructors - but maybe this just fits here), another is to pre-calculate some of these values during evacuation failure somehow.

We can maybe also postpone the optimization part of that suggestion given that currently that phase takes almost no time if it seems to be too much work.

Thanks for your hard work,
  Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/7047