G1 RSet improvement ideas

Thu Feb 6 10:22:21 UTC 2014

On Wed, 2014-02-05 at 17:22 -0500, Tony Printezis wrote:
> Thomas,
> 
> On 2/4/14, 5:34 AM, Thomas Schatzl wrote:
> > Hi,
> >
> > On Mon, 2014-02-03 at 12:13 -0500, Tony Printezis wrote:
> >> Thomas,

> >>> In that direction one could experiment with another idea, a "sticky young
> >>> gen" value in the card table, that can be used to also filter out references
> >>> between old gen regions originating from these cards (they will be
> >>> scanned anyway) too. I think I had that implemented somewhere/sometime, it
> >>> should work.
> >>>
> >>> I.e. during refinement, if we find a to-young reference, instead of
> >>> clearing the card at the end, mark it with some "sticky young value" so
> >>> that the barrier will not redirty that card. Maybe also just leaving it
> >>> dirty may be sufficient.
> >>>
> >>> Note that this idea dirties the card table, so requires clearing.
> >> Note that G1 does try to avoid "dirty / refine / clean / re-dirty /
> >> re-refine / re-clean / etc." patterns on cards (whether they point to
> >> the young gen or not) using the hot card cache (unless you've recently
> >> removed it! haven't looked at that code for a while... but it still
> >> seems to be there).
> > In this idea you would just keep cards having any young reference dirty on the
> > card table.
> > I.e. automatically filtering out duplicates across *all* rsets. Except for
> > the additional data structure maintenance, equal to having a single young
> > RSet.
> 
> But isn't this exactly what the hot card cache does? It keeps the card 
> dirty so that it's not enqueued again by the post-barrier and it's not 
> added to multiple RSets. I can see how keeping a card dirty that points 
> into the young gen (to generalize this: keeping a card dirty that points 
> into the next collection set) can reduce the concurrent refinement 
> overhead. But I believe the hot card cache should have a similar effect.

Measurements on the old hcc implementation years ago on large machines
tens with of GB heap on OpenDS - some database with index stored in
memory, and the remaining memory managed as LRU cache (and some other
basic ad-hoc apps that try to replicate that) - indicated that the hot
card cache just slowed down the application.

(Iirc) the issue has been that the hot card cache got really large with
such big heaps (~100M?), and since it is basically a hash map, it
produced lots of random accesses and cache misses.

This may have changed with the new implementation, I do not know. It's
also nothing that can be generalized.

It was by far best to delay refinement even until the GC, i.e. just what
these sticky cards would (partially) avoid.

> >> FWIW: apart from avoiding card duplication, using a single RSet for all
> >> young regions will also have additional advantages: lower footprint
> >> (only one table instead of one per region), faster to reclaim, fewer
> >> malloc / frees, etc.
> > I am not so sure about footprint, see below.
> 
> If the RSets are reasonably sparse, having one fuller table instead of 
> 100s very sparse tables should be a win (IMHO at least).

Sure.

> >> I don't completely agree with this I have to say. Let's assume you have
> >> some card duplication across all young RSets (for the sake of argument).
> >> Having each thread having to visit each RSet (imagine you have a few
> >> 100s), check whether it's been claimed, try to claim it, then try to
> >> claim cards and contend on the card table is not cheap. Just looking at
> >> a single RSet, seeing quickly that its entries have been claimed by
> >> other GC threads, and going on to the next phase could be much more
> >> efficient.
> > You certainly have more experience here. I do not know the exact
> > implementation of the claim mechanism, but I will look at it. However I
> > assume the claiming is coarse enough that its overhead is small. So the
> > potential gain.
> 
> As a general rule: Don't assume anything in G1. ;-)

Good idea :)

> >>> (G1 typically takes the easy approach by distributing work on a per-region
> >>> basis. That's more an artifact of a quick solution than a real solution.
> >> Are you talking about distributing RSet scanning work? Or in general?
> > Both and in general.
> 
> I think in general, the region heap subdivision is at least a good 
> starting point for load balancing. In some cases, that's enough 

It is, yes, I agree.

> > Others are, off the top of my head generic parallelization problems, either
> > not tied to regions or not yet (code root migration (no CR), free cset
> > (JDK-8027295),
> 
> I've always wanted to do that concurrently, after the pause is done. 
> There's no real reason to need to do that during a pause. (I'll add a 
> couple comments on that CR.)

I saw them, thanks. I added (little) measurement data from the current
prototype. It works slightly differently, I will look at your
suggestion.

> > class loader data graph (JDK-8030144), card redirtying
> > (JDK-8019342)).
> 
> If the pause didn't use the card table for GC threads to claim cards, 
> you might have been able to do that during the pause.

Okay.

> >> interesting trade-off here. I can totally see the attraction of building
> >> the RSets piggy-backed on marking. However, you might end up building
> >> several RSets for regions that you won't end up collecting (because
> >> they're very dense with live objects); so that'd be wasted work. I kinda
> >> like the the approach of doing it as a separate step as it allows you to
> >> decide exactly which RSets to build. Also, maybe the idea I mentioned
> > There's a simple workaround: if the RSet grows too complicated, drop the
> > rset of that region, given some complexity (e.g. memory) bound. Also,
> > you might use previous markings to determine which regions to prefer.
> 
> Absolutely. However, that could mean a lot of wasted work.

I agree. It's the same problem the Shenandoah GC has.

> > Just thinking loud, with no particular purpose, and not refuting anything
> > you mentioned above:
> >
> > There is a big downside of your suggestion doing this (blindly): memory
> > consumption. Currently cards are indices within a base region, taking
> > only 4 bytes per entry.
> 
> Yes (and the card indexes used to be 2-byte shorts FWIW).
> 
> > Without the heap region card index as base
> > offset, you would need double the size per entry.
> 
> Not sure why. You can still organize the RSets in a similar manner: Each 
> RSet component today corresponds to an address range [from,to) of size 
> L. You can still structure it as having the from address plus indexes 
> into said the address range. This is not really different to what we 
> have now. I'm just proposing that L should not necessarily be the heap 
> region size and that we don't get the [from, to) data from a HeapRegion 
> objects.

I understood. Just throwing comments your way :)

> > however this will take away from storage efficiency
> > again, as you are probably going to store on a per-region (or per-area)
> > basis to keep memory footprint low. E.g. assuming that you allocate the
> > per-area storage in X entry sized chunks, there will be some unused space.
> > Which more or less gives you the same effective behavior as before.
> > (With the same basic sub-rset implementations).
> >
> > In the case with a single rset for multiple regions, instead of having
> > multiple RSets each having their own per-source region data structure,
> > you have one RSet having per-source "area" data structures. I can see
> > that there is a certain improvement on management overhead. However you
> > might want more areas per region, which increases memory overhead again.
> 
> I'm not quite following this I have to say. The RSet data structure 

Sorry for being cryptic sometimes...

> encodes where in the heap there are references into a region or set of 
> regions. You don't need to encode on the RSet which regions those are. 
> You only need to encode where the references are coming from. And you 
> just point to that RSet from all the regions it corresponds to.

You may want to. It helps you to quickly find the parts of the
remembered set that needs to be removed after reclaiming some regions.
In this thread we already mentioned that this could be done concurrently
and/or in parallel; this information could help decreasing its actual
cost dramatically, so it can be done more regularly.

Currently the scrubbing seems to iterate over all sub-rsets of a given
RSet, which seems to make up the main cost.

Making this faster might allow idling worker threads make lots of
progress during STW pause.... (that's just a wacky idea of mine).

Thanks,
  Thomas