G1 RSet improvement ideas

Mon Feb 3 09:58:29 UTC 2014

Hi,

  let me have a stab at this long mail with a few half-baked
thoughts... ;)

On Thu, 2014-01-30 at 12:33 -0500, Tony Printezis wrote:
> Hi all,
> 
> I had a recent e-mail exchange with Bengt and Thomas and we touched a 
> bit on G1 RSet work. RSets can be a major bottleneck in G1 right now, 
> but I don't believe it's fundamental issue with the concept. IMHO, the 
> implementation can be substantially improved to be much more efficient. 
> I recently created a few issues in our (Twitter) bug system with some 
> ideas on how to do that. I pasted all the text below (but replaced the 
> internal issue #s with (1), (2), etc.).
> 
> Let me know if you have any ideas, thoughts, etc. on this.
> 
> Tony
> 
> ----------
> 
> (1) Use a single RSet for all young regions
> 
> In G1 each region has its own Remembered Set (RSet). During a 
> collection, all RSets of the Collection Set (CSet) regions need to be 
> scanned. Given that all young regions are always collected together 
> during the same collection, it will be beneficial to actually make all 
> young regions point to the same physical RSet. This will have several 
> advantages:
> 
> - lower overall RSet footprint (better to have one fuller RSet, than 
> many mostly empty ones; this will also quickly eliminate duplicate RSet 
> entries across all young regions)

That is true, but in my experience with a similar system (not on G1),
the savings are not that big - of course there are, but not huge iirc.

In that direction one could experiment with another idea, a "sticky young
gen" value in the card table, that can be used to also filter out references
between old gen regions originating from these cards (they will be
scanned anyway) too. I think I had that implemented somewhere/sometime, it
should work.

I.e. during refinement, if we find a to-young reference, instead of
clearing the card at the end, mark it with some "sticky young value" so
that the barrier will not redirty that card. Maybe also just leaving it
dirty may be sufficient.

Note that this idea dirties the card table, so requires clearing.

> - more efficient RSet scanning (the number of RSets scanned during a 
> collection will not depend on the number of young regions in the CSet)

The main contributing factor for rset scanning is the number of entries,
not the number of rsets. You can always improve the distribution across
threads if that is what is your concern.

(G1 typically takes the easy approach by distributing work on a per-region
basis. That's more an artifact of a quick solution than a real solution.
There are already a few CRs and prototypes that improve this situation).

> Minus any issues and assumptions that will need to be ironed out in the 
> RSet code, doing this should be straightforward. During a GC, a new RSet 

I do not see huge issues with that either.

> is created and all Survivor regions created during that GC will be made 
> to point to it. Then, as new Eden regions are allocated, they will also 
> point to the same RSet. At the start of the next GC, all Eden / Survivor 
> regions will be pointing to the same RSet. During young GCs, we'll only 
> need to scan that RSet. During mixed GCs, we'll need to scan the young 
> RSet and all the RSets of the old regions. But this will be an 
> improvement over what we do now. We'll also need to decouple the young 
> RSet freeing code from the region reclamation code so that we don't 
> attempt to reclaim the same RSet multiple times.

Another minus, that might be implementation dependent: possibly lots of
contention on a single RSet.
Consider machines with a few hundred threads out there. Not that I think
anyone ever made measurements regarding that with the current RSets on
such a system either...

> (2) G1 RSets: Use a single RSet for old regions of a single CSet
> 
> (This is a follow-up to (1))
> 
> For the same reasons that it's a good idea to use a single Remembered 
> Set (RSet) for all young regions, it'd be also helpful to use a single 
> RSet for all old regions that are added to a Collection Set (CSet). In 
> combination with (1), this will result in only having either one (young 
> GCs) or two (mixed GCs) RSets to scan during GC.
> 
> This will be quite a lot more work to do, compared to (1), given that we 
> only choose the old regions to add to a CSet at the start of a GC. To 
> achieve this we'll have to do a bit more advance preparation: choose 
> groups of old regions to collect together (maybe do that concurrently to 
> minimize STW work) and combine their RSets. So, at each mixed GC, we'll 
> pick an old region group, and its associated RSet, and collect them 
> along with the young regions.

A similar effect could be achieved by increasing region size, it's not
as flexible though.

It, and all follow-up ideas in that direction, assume that the regions that
we want to pick and their order can be precalculated. That may be a good
assumption now, as G1 tries to do the cleanup asap, it may not hold in the
future.

Eg. there are already minor complaints that the mixed gc phase impacts 
throughput too much during that time, so spreading this phase out may be
considered in the future (e.g. just reclaim enough to keep up with promotion
rate and then a little).

Also, we might need the flexibility to e.g. clear up particular regions to
allow allocation of large objects without full GC. That's a major concern
in some cases, you may have noticed that there are already some CRs trying
to improve G1 behavior with many large objects.

> (3) G1 RSets: Use one RSet for each GC
> 
> (follow-up to (1) and (2))
> 
> If we manage to have one young RSet and one old RSet for each mixed GC, 
> it'd be straightforward to combine the two to always end up having to 
> scan a single RSet at every GC.

>From a contention point of view this seems even worse than (1), but a
straightforward extension.

> (2) proposes to create old region groups, each with its own combined 
> RSet. We could take this further and grab this combined old RSet at the 
> start of a GC and use that as the young RSet (for Survivor regions 
> allocated during that GC and Eden regions allocated after that GC). So, 
> at the start of the next GC, all regions in the CSet (Eden, Survivors, 
> old region group) will have the same RSet.

Afair the order in the collection candidate list is only ever evaluated
once after mark end at the moment. You could create the groups at that
opportunity already.

Also probably needs retuning of the different thresholds for switching
between the coarseness levels of the remembered set.

> (4) G1 RSets: Filter old RSets before a collection
> 
> Old RSets tend to grow quite large and sometimes they contain a large 
> amount of stale entries that need to be scanned for nothing during a GC. 
> After concurrent marking has completed, but maybe before each mixed GC, 
> we could introduce a concurrent process that scans RSets of regions that 
> will be collected and filters out any stale entries (by scanning the 
> cards that the RSet points to and removing any cards that don't point to 
> the region any more). This should reduce the amount of RSet scanning 
> work that's done during mixed GCs.
> 
> The easiest way to do this, to avoid any nasty races / MT issues, is to 
> give each old region a new RSet (during a safepoint, say), store the old 

This safepoint need not be an extra safepoint, could be piggybacked to
existing ones. That actually sounds quite easy to implement, however,
any suggestions how to find the RSets with the largest amount of stale
entries?

Apart from the terms "tend to grow quite large" and "sometimes they
contain a large amount of stale entries", do you have some numbers from
experience here?

> one somewhere, scan it (concurrently), and add any entries that need to 
> be retained to the new one. At the end, we can just reclaim the old one. 

Also, when checking for existing entries, probably only look into the
new one: if it does not exist, add it, and any duplicates from the old
will have no effect. (Instead of always checking both RSets).

> This way, any new updates to the new RSet will automatically happen 
> through the normal path. Explicitly removing an entry, instead of 
> recreating the RSet as described, will be much harder, given that we'd 
> have to synchronize with other threads potentially trying to add the 
> same entry.
> 
> This can be done in addition to (2). In fact, the two operations could 
> also be combined (i.e. combined old region group's RSets and also filter 
> them at the same time).
> 
> ----------
> 
> (5) G1 RSets: Don't use the card table during RSet scanning to reduce 
> duplication
> 
> Currently, while scanning Remembered Sets (RSets) during a GC we use the 
> card table to mark whether a particular card has been scanned or not 
> (since the same card might appear on multiple RSets). This is bad for a 
> couple of reasons (the first one is the important one, the second one we 
> could address reasonably easily):
> 
> - First, we have to clear the card table at the end of the GC, which 
> requires a non-trivial amount of work.

Do you have numbers here? It definitely is a valid concern (with large
heaps), but in the logs I got so far, this has not been a really big issue
yet because the other phases are so much longer...

A more coarse second-level card table (512k/1M card size?) remembering
approximately where these marks are should already go a long way towards
reducing the overhead?

Do you have some numbers on how far these card marks are spread out? I would
assume that particularly on large heaps, only a fraction of the card table
is ever dirtied per collection.

> - Second, the card marks during RSet scanning are not done atomically 
> (atomics were deemed too expensive). This is safe (rescanning a card is 
> not a correctness issue and it shouldn't happen very often anyway). 
> However, this prevents some subtle optimizations (when one thread comes 
> across a reference on a card that points into the CSet and pushes its 
> location on the stack, it has to recheck that it still points into the 
> CSet when it eventually pops it from the stack, given that another 
> thread might also scan that card and update the reference meanwhile).

:)

> A better solution will be to somehow use the RSet data structure for 
> each thread to claim RSet chunks. It can even be done in a destructive 
> way, given that the RSet will be reclaimed at the end of the GC anyway.

That would be something nice (that is not slower in the end than just
clearing the card table).

> This depends on (3), given we can only ensure that each card is claimed 
> exactly once if it appears on a single RSet.
> 
> ----------
> 
> (6) G1 RSets: Only create old region RSets when they are needed
> 
> Currently, each region has its own RSet from the beginning and that's 
> maintained throughout the region's lifetime. For very long regions, it 
> turns out that long-lived old regions end up with a lot of stale entries 
> in their RSets (as ref fields are updated over time to point to 
> different objects) which make them very bloated. An important 
> observation is that the old region RSets are only really needed during 
> mixed GCs after a concurrent cycle. Sometimes, concurrent cycles happen 
> very infrequently (every few hours, maybe) so we constantly maintain the 
> old region RSets and only take advantage of them very infrequently.
> 
> A better approach would be to eliminate RSets from old regions (say, set 
> the RSet field to NULL for those regions so that any refinement 
> operation will just do nothing for refs that point to such regions) and 
> only recreate them when needed.

Totally agree.

> 
> One approach, which has been done by other GCs, is to recreate the RSets 
> during marking (as the marking phase will visit all live objects).

Adding that to the marking phase is simple.

> Maybe a better approach will be to finish the marking phase, do some 
> quick analysis on the old regions to decide which to collect and which 
> not to collect, and do a sweep phase to populate the RSets of the 
> regions that will be collected. Only when that operation is done, mixed 
> GCs will be allowed.

Do you think sweeping the entire heap is faster than piggy-backing to the
marking, even if done concurrently?

> This is a (better but more involved) alternative to (4) as the recreated 
> RSets will not be very long-lived, so they will hopefully not need 
> further filtering.
> 
> This can be combined with the approach in (2), i.e., find the old region 
> groups we'll collect, give all regions in each group a new RSet, and 
> only then do the sweeping phase to create directly RSets per group 
> instead of per region.
> 
> ----------
> 
> (7) G1 RSets: Redesign for remembered set data structures
> 
> This is somewhat orthogonal to the other G1 RSet issues, but there 
> should be a lot of gains to be made by rethinking / redesigning the RSet 
> data structures. The concept of having multiple RSet granularities is 
> quite interesting. Unfortunately, the implementation is quite 

The idea seems good.

> inefficient. In particular, there are two hash tables (for fine and 
> sparse entries) which is completely wasteful (requires two look-ups and 
> extra footprint for the second table). Also, several operations require 
> locks which affects scalability. Finally, the implementation is too 
> dependent on the region abstraction (i,e., it's baked in the code that 
> each RSet component represents a region-to-region relationship) which 
> makes them very inflexible (RSset components can get very large when 
> very large regions are used).

I think just assigning the same RSet to multiple regions works already.
There is some overhead that will be incurred due to filters assuming a
one-to-one relationship, but that should be minimal, and benign as in the
end it will be coalesced with the existing entry anyway. (Did not really
look at the code).

> The goals of a redesign should be the following:
> 
> - At a high-level, the new structure should just be a card address -> 
> bool map (false -> card not included in the RSet, true -> card included 
> in the RSet).

I am not sure what this means: this suggests an interface, not an implementation.

> - A subdivision of the heap, so that each such heap chunk is represented 
> by a different structure, can still be used. However, chunks should not 
> be tied to regions. I.e., a region might be covered by several chunks 
> (helpful, when regions are large).

Okay.

> - The fine / medium / coarse granularity approach will still be useful 
> to represent each chunk as efficiently as possible. However, there 
> should be a single hash table that will keep track of the entries of all 
> granularities.
> - The RSet structure should have a facility for threads to claim entries 
> to work on them. This is to facilitate (5).

:)

> - Addition operations should be entirely lock-free.
> - RSet iterations should be done using iterators.
> - A facility to efficiently do a union of various RSets will also be 
> helpful (to facilitate (2)).

Thomas