RFR: 8256265: G1: Improve parallelism in regions that failed evacuation

Fri Jan 28 08:04:18 UTC 2022

On Thu, 27 Jan 2022 10:17:18 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> I agree. This has only been a general remark about its performance, not meant to belittle the usefulness this change and in general all the changes in this series have, which are quite substantial. 👍 I compared the throughput (bytes/ms) between `Object Copy` and this phase, and at least without JDK-8280374 the `remove self forwards` is only like 2x the throughput of `Object Copy`, which seemed quite bad compared to what they do. With JDK-8280374 the results are much better (~4.5x) afaict on a single benchmark I tried though. It's a bit hard to reproduce the exact situation/heap though...
> 
> Another (future) optimization that may be worthwhile here may be to get some occupancy statistics of the chunks and switch between walking the bitmap and walking the objects; 

Yes, we have a similar task on our backlog, to fall back to walking the objects if the statistics tell us so.

> another one that might be "simpler" to implement (but fairly messy probably) is to simply check if the object after the current one is also forwarded, and if so, do not switch back to the bitmap walking but immediately process that one as well. This might help somewhat because given typical avg. object sizes (~40 bytes), the mark word after the current one might be already in the cache anyway, so a read access practically free.

I'm not sure how much this will help. Currently, the code looks like below, if the next obj is also marked, it will be applied with closure next time in the loop, it should has the same cache hit as the way you suggested above, the difference is the method (apply(current)) invocation overhead. But I will put it on backlog too. [TODO]

  while (next_addr < _limit) {
    Prefetch::write(next_addr, PrefetchScanIntervalInBytes);
    if (_bitmap->is_marked(next_addr)) {
      oop current = cast_to_oop(next_addr);
      next_addr += closure->apply(current);
    } else {
      next_addr = _bitmap->get_next_marked_addr(next_addr, _limit);
    }
  }

> 
> These are only ideas though.
> 
> > [...]
> > But I will investigate if we can omit "Prepared Retained Regions" and "Wait For Ready" subphases totally to simplify the logic. [TODO]
> 
> The point of "proper synchronization" isn't that it's faster, but it does not burn cpu cycles unnecessarily which potentially keeps the one thread that others are waiting on do the work. If we can remove the dependencies between the "Prepare Retained Regions" and the remaining phases, which only seems to be the BOT. One idea is that maybe all that prepare stuff could be placed where G1 adds that region to the list of retained regions. This does not work for the liveness count obviously - but that can be recreated by the actual self forwarding removal as suggested earlier 😸).
> 
> Then none of that is required which is even better.

I have just delete the code related to "Prepared Retained Regions" and "Wait For Ready", and put the logic in G1EvacFailureRegions::record(...) and SampleCollectionSetCandidatesTask and VerifyAfterSelfForwardingPtrRemovalTask.

> 
> > > *) The "Prepared Retained Regions" phase stores the amount of live data into the HeapRegion; for this reason the change adds these G1RegionMarkStats data gathering via the G1RegionMarkStatsCache; I think the same information could be provided while iterating over the chunks (just do an Atomic::add here) instead. A single Atomic::add per thread per retained region at most seems to be okay. That would also remove the Evac Fail Merge Live phase afaict.
> > 
> > 
> > I will do this refactor soon.
> 
> Thanks!

This one is also done.

> 
> > > *) Not too happy that the G1HeapRegionChunk constructor does surprisingly much work, which surprisingly takes very little time.
> > > *) I was wondering whether it would be somewhat more efficient for the Prepare Chunks phase to collect some of the information needed there somehow else. Something is bubbling up in my mind, but nothing specific yet, and as mentioned, it might not be worth doing given its (lack of) cost.
> > 
> > 
> > I will put it on backlog to see if it can be simplified. [TODO]
> 
> Not necessarily simplified: one option is to make that work explicit (we tend to try to not do too much work in constructors - but maybe this just fits here), another is to pre-calculate some of these values during evacuation failure somehow.
> 
> We can maybe also postpone the optimization part of that suggestion given that currently that phase takes almost no time if it seems to be too much work.

OK, let's get back to this when it occupies much time in the phase.

> 
> Thanks for your hard work, Thomas

Thanks alot for detailed discussion and valuable suggestion, it helps alot :)

-------------

PR: https://git.openjdk.java.net/jdk/pull/7047