G1 RSet improvement ideas

Mon Feb 3 20:27:41 UTC 2014

Hi Tony and Thomas,

Just a question for my understanding regarding rebuilding RSets. You 
both say it is straight forward to do that during marking. How does that 
work with SATB since that is not guaranteed to visit all edges in the 
object graph?

Thanks,
Bengt

On 2/3/14 6:13 PM, Tony Printezis wrote:
> Thomas,
>
> Thanks for actually reading my brain dump and for the comments. :-) 
> See inline.
>
> On 2/3/14, 4:58 AM, Thomas Schatzl wrote:
>> Hi,
>>
>>    let me have a stab at this long mail with a few half-baked
>> thoughts... ;)
>>
>> On Thu, 2014-01-30 at 12:33 -0500, Tony Printezis wrote:
>>> Hi all,
>>>
>>> I had a recent e-mail exchange with Bengt and Thomas and we touched a
>>> bit on G1 RSet work. RSets can be a major bottleneck in G1 right now,
>>> but I don't believe it's fundamental issue with the concept. IMHO, the
>>> implementation can be substantially improved to be much more efficient.
>>> I recently created a few issues in our (Twitter) bug system with some
>>> ideas on how to do that. I pasted all the text below (but replaced the
>>> internal issue #s with (1), (2), etc.).
>>>
>>> Let me know if you have any ideas, thoughts, etc. on this.
>>>
>>> Tony
>>>
>>> ----------
>>>
>>> (1) Use a single RSet for all young regions
>>>
>>> In G1 each region has its own Remembered Set (RSet). During a
>>> collection, all RSets of the Collection Set (CSet) regions need to be
>>> scanned. Given that all young regions are always collected together
>>> during the same collection, it will be beneficial to actually make all
>>> young regions point to the same physical RSet. This will have several
>>> advantages:
>>>
>>> - lower overall RSet footprint (better to have one fuller RSet, than
>>> many mostly empty ones; this will also quickly eliminate duplicate RSet
>>> entries across all young regions)
>> That is true, but in my experience with a similar system (not on G1),
>> the savings are not that big - of course there are, but not huge iirc.
>
> A quick study to measure this would be interesting. I would be really 
> shocked if we don't see quite a lot of duplication, I have to say 
> (especially when a very large eden is used).
>
>> In that direction one could experiment with another idea, a "sticky 
>> young
>> gen" value in the card table, that can be used to also filter out 
>> references
>> between old gen regions originating from these cards (they will be
>> scanned anyway) too. I think I had that implemented 
>> somewhere/sometime, it
>> should work.
>>
>> I.e. during refinement, if we find a to-young reference, instead of
>> clearing the card at the end, mark it with some "sticky young value" so
>> that the barrier will not redirty that card. Maybe also just leaving it
>> dirty may be sufficient.
>>
>> Note that this idea dirties the card table, so requires clearing.
>
> Note that G1 does try to avoid "dirty / refine / clean / re-dirty / 
> re-refine / re-clean / etc." patterns on cards (whether they point to 
> the young gen or not) using the hot card cache (unless you've recently 
> removed it! haven't looked at that code for a while... but it still 
> seems to be there).
>
> FWIW: apart from avoiding card duplication, using a single RSet for 
> all young regions will also have additional advantages: lower 
> footprint (only one table instead of one per region), faster to 
> reclaim, fewer malloc / frees, etc.
>
>>> - more efficient RSet scanning (the number of RSets scanned during a
>>> collection will not depend on the number of young regions in the CSet)
>> The main contributing factor for rset scanning is the number of entries,
>> not the number of rsets.
>
> I don't completely agree with this I have to say. Let's assume you 
> have some card duplication across all young RSets (for the sake of 
> argument). Having each thread having to visit each RSet (imagine you 
> have a few 100s), check whether it's been claimed, try to claim it, 
> then try to claim cards and contend on the card table is not cheap. 
> Just looking at a single RSet, seeing quickly that its entries have 
> been claimed by other GC threads, and going on to the next phase could 
> be much more efficient.
>
>>   You can always improve the distribution across
>> threads if that is what is your concern.
>>
>> (G1 typically takes the easy approach by distributing work on a 
>> per-region
>> basis. That's more an artifact of a quick solution than a real solution.
>
> Are you talking about distributing RSet scanning work? Or in general?
>
>> There are already a few CRs and prototypes that improve this situation).
>
> Can you point me to a couple?
>
>>> Minus any issues and assumptions that will need to be ironed out in the
>>> RSet code, doing this should be straightforward. During a GC, a new 
>>> RSet
>> I do not see huge issues with that either.
>
> (see towards the e-mail for a bit more info on this)
>
>>> is created and all Survivor regions created during that GC will be made
>>> to point to it. Then, as new Eden regions are allocated, they will also
>>> point to the same RSet. At the start of the next GC, all Eden / 
>>> Survivor
>>> regions will be pointing to the same RSet. During young GCs, we'll only
>>> need to scan that RSet. During mixed GCs, we'll need to scan the young
>>> RSet and all the RSets of the old regions. But this will be an
>>> improvement over what we do now. We'll also need to decouple the young
>>> RSet freeing code from the region reclamation code so that we don't
>>> attempt to reclaim the same RSet multiple times.
>> Another minus, that might be implementation dependent: possibly lots of
>> contention on a single RSet.
>
> That's definitely a good point. On the other hand, threads trying to 
> claim individual RSets should not be too different to threads trying 
> to claim sub-components of a single RSet. Also, having one RSet per 
> region is an arbitrary way to cut down contention. If it's possible to 
> have a single RSet for all the young regions, it will be easy to 
> expand that and maybe have N RSets (again, each covering all young 
> regions) and easily map individual threads (whether they are writing 
> or reading) to one of them. Here, N can be decided on the amount of 
> concurrency available. One per region is completely arbitrary in that 
> respect.
>> Consider machines with a few hundred threads out there.
>
> (just to clarify) Do you mean:
>
> - Hundreds of HW threads?
> - Hundreds of Java threads?
> - Hundreds of GC threads?
>
>> Not that I think
>> anyone ever made measurements regarding that with the current RSets on
>> such a system either...
>
> Not as far as I know...
>
>>> (2) G1 RSets: Use a single RSet for old regions of a single CSet
>>>
>>> (This is a follow-up to (1))
>>>
>>> For the same reasons that it's a good idea to use a single Remembered
>>> Set (RSet) for all young regions, it'd be also helpful to use a single
>>> RSet for all old regions that are added to a Collection Set (CSet). In
>>> combination with (1), this will result in only having either one (young
>>> GCs) or two (mixed GCs) RSets to scan during GC.
>>>
>>> This will be quite a lot more work to do, compared to (1), given 
>>> that we
>>> only choose the old regions to add to a CSet at the start of a GC. To
>>> achieve this we'll have to do a bit more advance preparation: choose
>>> groups of old regions to collect together (maybe do that 
>>> concurrently to
>>> minimize STW work) and combine their RSets. So, at each mixed GC, we'll
>>> pick an old region group, and its associated RSet, and collect them
>>> along with the young regions.
>> A similar effect could be achieved by increasing region size, it's not
>> as flexible though.
>
> That's kind of a big hammer approach. And it might make other parts of 
> G1 less efficient (you might find fewer completely empty regions at 
> the end of a marking cycle, which will put more pressure on the mixed 
> GCs).
>
>> It, and all follow-up ideas in that direction, assume that the 
>> regions that
>> we want to pick and their order can be precalculated. That may be a good
>> assumption now, as G1 tries to do the cleanup asap, it may not hold 
>> in the
>> future.
>>
>> Eg. there are already minor complaints that the mixed gc phase impacts
>> throughput too much during that time, so spreading this phase out may be
>> considered in the future (e.g. just reclaim enough to keep up with 
>> promotion
>> rate and then a little).
>
> Two things:
>
> a) The reason I think it's a good idea to do the mixed GCs 
> back-to-back is to clean up the heap as quickly as possible to give 
> the GC and app breathing room as quickly as possible. Yes, I had also 
> thought of spreading them out more. However, my intuition is that, 
> given you know there's garbage in heap, might as well try to reclaim 
> it asap.
>
> b) Putting together groups of regions that share an RSet and can be 
> collected together is not incompatible with spreading out the mixed 
> GCs. You just have to maintain the RSet for that group for longer. 
> However, if you do that, you'll end up with more and more stale 
> entries in it and you might have to do some extra filtering before you 
> decide to do the collection.
>
>> Also, we might need the flexibility to e.g. clear up particular 
>> regions to
>> allow allocation of large objects without full GC. That's a major 
>> concern
>> in some cases, you may have noticed that there are already some CRs 
>> trying
>> to improve G1 behavior with many large objects.
>
> (Actually, I haven't looked at all the open GC / G1 CRs)
>
> Again, this is a very good observation. However, maintaining complete 
> RSets for all regions in order to deal with an infrequent edge case 
> sounds like the wrong trade-off. Maybe you can do something 
> in-between: Always keep very coarse RSets per region, i.e., which 
> region, or whatever subdivision you decide on, has references into 
> that region (i.e., but not at the card level; something that you can 
> do with a fixed data structure like a bitmap / byte map). Then, if you 
> unexpectedly want to move some regions to make space for a new large 
> one, you can still move a few without having to scan the entire heap, 
> but maybe having to scan more than individual cards.
>
>>> (3) G1 RSets: Use one RSet for each GC
>>>
>>> (follow-up to (1) and (2))
>>>
>>> If we manage to have one young RSet and one old RSet for each mixed GC,
>>> it'd be straightforward to combine the two to always end up having to
>>> scan a single RSet at every GC.
>>  From a contention point of view this seems even worse than (1), but a
>> straightforward extension.
>
> I'd agree. But, I still don't think contention on a single RSet has to 
> be much worse than having to scan / claim a bunch of RSets. (And see 
> note about having multiple RSets to match the available concurrency.)
>
>>> (2) proposes to create old region groups, each with its own combined
>>> RSet. We could take this further and grab this combined old RSet at the
>>> start of a GC and use that as the young RSet (for Survivor regions
>>> allocated during that GC and Eden regions allocated after that GC). So,
>>> at the start of the next GC, all regions in the CSet (Eden, Survivors,
>>> old region group) will have the same RSet.
>> Afair the order in the collection candidate list is only ever evaluated
>> once after mark end at the moment. You could create the groups at that
>> opportunity already.
>
> Absolutely. I was thinking that maybe after a marking cycle has 
> completed, a concurrent thread (probably, the marking thread itself) 
> can do a lot of the mixed GC preparation work (examples: sort the old 
> regions, find which ones should be collected, put together the region 
> groups, combine / scrub the RSets, etc. whatever needs to be done). 
> Then, as each region group has been prepared is made available on a 
> queue. Each evacuation pause bases the mixed / young GC decision on 
> whether a new group has been made available. This way we move the 
> expensive preparation work to the concurrent domain and the "when to 
> do a mixed GC" decision becomes very straightforward.
>
>> Also probably needs retuning of the different thresholds for switching
>> between the coarseness levels of the remembered set.
>>
>>> (4) G1 RSets: Filter old RSets before a collection
>>>
>>> Old RSets tend to grow quite large and sometimes they contain a large
>>> amount of stale entries that need to be scanned for nothing during a 
>>> GC.
>>> After concurrent marking has completed, but maybe before each mixed GC,
>>> we could introduce a concurrent process that scans RSets of regions 
>>> that
>>> will be collected and filters out any stale entries (by scanning the
>>> cards that the RSet points to and removing any cards that don't 
>>> point to
>>> the region any more). This should reduce the amount of RSet scanning
>>> work that's done during mixed GCs.
>>>
>>> The easiest way to do this, to avoid any nasty races / MT issues, is to
>>> give each old region a new RSet (during a safepoint, say), store the 
>>> old
>> This safepoint need not be an extra safepoint, could be piggybacked to
>> existing ones.
>
> I didn't say anything about having an extra safepoint. I agree, this 
> can be piggy-backed on an existing safepoint.
>
>> That actually sounds quite easy to implement, however,
>> any suggestions how to find the RSets with the largest amount of stale
>> entries?
>
> Nope. I don't believe is very easy to decide that. I'd just scrub all 
> RSets for the old regions to be collected.
>
>> Apart from the terms "tend to grow quite large" and "sometimes they
>> contain a large amount of stale entries", do you have some numbers from
>> experience here?
>
> I had had some quick studies but I didn't keep the numbers after I 
> left Oracle. But I had definitely seen RSets for particular regions 
> keep growing in size and they really shouldn't.
>
>>> one somewhere, scan it (concurrently), and add any entries that need to
>>> be retained to the new one. At the end, we can just reclaim the old 
>>> one.
>> Also, when checking for existing entries, probably only look into the
>> new one:
>
> Of course. The RSet of the region will be the new one. The only one 
> will be treated similarly to a bunch of dirty cards (as Igor said in a 
> separate e-mail).
>
>> if it does not exist, add it, and any duplicates from the old
>> will have no effect. (Instead of always checking both RSets).
>
> You should never have to check both RSets.
>
>>> This way, any new updates to the new RSet will automatically happen
>>> through the normal path. Explicitly removing an entry, instead of
>>> recreating the RSet as described, will be much harder, given that we'd
>>> have to synchronize with other threads potentially trying to add the
>>> same entry.
>>>
>>> This can be done in addition to (2). In fact, the two operations could
>>> also be combined (i.e. combined old region group's RSets and also 
>>> filter
>>> them at the same time).
>>>
>>> ----------
>>>
>>> (5) G1 RSets: Don't use the card table during RSet scanning to reduce
>>> duplication
>>>
>>> Currently, while scanning Remembered Sets (RSets) during a GC we use 
>>> the
>>> card table to mark whether a particular card has been scanned or not
>>> (since the same card might appear on multiple RSets). This is bad for a
>>> couple of reasons (the first one is the important one, the second 
>>> one we
>>> could address reasonably easily):
>>>
>>> - First, we have to clear the card table at the end of the GC, which
>>> requires a non-trivial amount of work.
>> Do you have numbers here? It definitely is a valid concern (with large
>> heaps), but in the logs I got so far, this has not been a really big 
>> issue
>> yet because the other phases are so much longer...
>
> Last time I checked, G1 kept track of which regions would potentially 
> have dirty cards (IIRC: the collection set regions and any regions 
> that appeared on the RSet of the collection set regions; I could be 
> wrong). And it only clears the CT for those regions. The Clean CT 
> phase is on the GC log. It might not be very long, but if you can 
> eliminate it, it can only be a good thing. Also, there might be more 
> regions whose CT needs to be cleared after mixed GCs (i.e., don't base 
> your decision only on data from young GCs). Finally, if you can also 
> remove the code that keeps track of which regions might contain dirty 
> cards post GC can also be a good thing.
>
>> A more coarse second-level card table (512k/1M card size?) remembering
>> approximately where these marks are should already go a long way towards
>> reducing the overhead?
>
> As I said earlier, only part of the CT is cleared.
>
>> Do you have some numbers on how far these card marks are spread out? 
>> I would
>> assume that particularly on large heaps, only a fraction of the card 
>> table
>> is ever dirtied per collection.
>
> For young GCs, I agree there probably won't be too many cards that get 
> dirty during a GC. I'm sure there will be more during mixed GCs.
>
>>> - Second, the card marks during RSet scanning are not done atomically
>>> (atomics were deemed too expensive). This is safe (rescanning a card is
>>> not a correctness issue and it shouldn't happen very often anyway).
>>> However, this prevents some subtle optimizations (when one thread comes
>>> across a reference on a card that points into the CSet and pushes its
>>> location on the stack, it has to recheck that it still points into the
>>> CSet when it eventually pops it from the stack, given that another
>>> thread might also scan that card and update the reference meanwhile).
>> :)
>>
>>> A better solution will be to somehow use the RSet data structure for
>>> each thread to claim RSet chunks. It can even be done in a destructive
>>> way, given that the RSet will be reclaimed at the end of the GC anyway.
>> That would be something nice (that is not slower in the end than just
>> clearing the card table).
>
> Well, after a collection the RSet(s) of the CSet regions is/are 
> reclaimed given that the CSet regions are reclaimed. So, if you have a 
> couple of extra fields on the RSet to help with the claiming algorithm 
> (or each RSet component, etc.) it shouldn't have any extra overhead at 
> the end.
>
>>> This depends on (3), given we can only ensure that each card is claimed
>>> exactly once if it appears on a single RSet.
>>>
>>> ----------
>>>
>>> (6) G1 RSets: Only create old region RSets when they are needed
>>>
>>> Currently, each region has its own RSet from the beginning and that's
>>> maintained throughout the region's lifetime. For very long regions, it
>>> turns out that long-lived old regions end up with a lot of stale 
>>> entries
>>> in their RSets (as ref fields are updated over time to point to
>>> different objects) which make them very bloated. An important
>>> observation is that the old region RSets are only really needed during
>>> mixed GCs after a concurrent cycle. Sometimes, concurrent cycles happen
>>> very infrequently (every few hours, maybe) so we constantly maintain 
>>> the
>>> old region RSets and only take advantage of them very infrequently.
>>>
>>> A better approach would be to eliminate RSets from old regions (say, 
>>> set
>>> the RSet field to NULL for those regions so that any refinement
>>> operation will just do nothing for refs that point to such regions) and
>>> only recreate them when needed.
>> Totally agree.
>>> One approach, which has been done by other GCs, is to recreate the 
>>> RSets
>>> during marking (as the marking phase will visit all live objects).
>> Adding that to the marking phase is simple.
>>> Maybe a better approach will be to finish the marking phase, do some
>>> quick analysis on the old regions to decide which to collect and which
>>> not to collect, and do a sweep phase to populate the RSets of the
>>> regions that will be collected. Only when that operation is done, mixed
>>> GCs will be allowed.
>> Do you think sweeping the entire heap is faster than piggy-backing to 
>> the
>> marking, even if done concurrently?
>
> (clarification : not the entire heap, but only old regions) Yeah, 
> interesting trade-off here. I can totally see the attraction of 
> building the RSets piggy-backed on marking. However, you might end up 
> building several RSets for regions that you won't end up collecting 
> (because they're very dense with live objects); so that'd be wasted 
> work. I kinda like the the approach of doing it as a separate step as 
> it allows you to decide exactly which RSets to build. Also, maybe the 
> idea I mentioned earlier about always maintaining very coarse-grain 
> RSets per region, could help in reducing how much of the old gen you'd 
> have to sweep (this can of course degrade to scanning the entire old 
> gen after all).
>
>>> This is a (better but more involved) alternative to (4) as the 
>>> recreated
>>> RSets will not be very long-lived, so they will hopefully not need
>>> further filtering.
>>>
>>> This can be combined with the approach in (2), i.e., find the old 
>>> region
>>> groups we'll collect, give all regions in each group a new RSet, and
>>> only then do the sweeping phase to create directly RSets per group
>>> instead of per region.
>>>
>>> ----------
>>>
>>> (7) G1 RSets: Redesign for remembered set data structures
>>>
>>> This is somewhat orthogonal to the other G1 RSet issues, but there
>>> should be a lot of gains to be made by rethinking / redesigning the 
>>> RSet
>>> data structures. The concept of having multiple RSet granularities is
>>> quite interesting. Unfortunately, the implementation is quite
>> The idea seems good.
>>
>>> inefficient. In particular, there are two hash tables (for fine and
>>> sparse entries) which is completely wasteful (requires two look-ups and
>>> extra footprint for the second table). Also, several operations require
>>> locks which affects scalability. Finally, the implementation is too
>>> dependent on the region abstraction (i,e., it's baked in the code that
>>> each RSet component represents a region-to-region relationship) which
>>> makes them very inflexible (RSset components can get very large when
>>> very large regions are used).
>> I think just assigning the same RSet to multiple regions works already.
>
> Have you changed this recently?
>
>> There is some overhead that will be incurred due to filters assuming a
>> one-to-one relationship, but that should be minimal, and benign as in 
>> the
>> end it will be coalesced with the existing entry anyway. (Did not really
>> look at the code).
>
> IIRC, the way the RSet mapped entries to card addresses was to get the 
> region address from the HeapRegion data structure.
>
>>> The goals of a redesign should be the following:
>>>
>>> - At a high-level, the new structure should just be a card address ->
>>> bool map (false -> card not included in the RSet, true -> card included
>>> in the RSet).
>> I am not sure what this means: this suggests an interface, not an 
>> implementation.
>
> What I tried to say is that the implementation shouldn't deal with the 
> HeapRegion data structure at all:
>
> void OtherRegionsTable::add_reference(OopOrNarrowOopStar from, int tid) {
>   ...
>   // Note that this may be a continued H region.
>   HeapRegion* from_hr = _g1h->heap_region_containing_raw(from);
>   RegionIdx_t from_hrs_ind = (RegionIdx_t) from_hr->hrs_index();
>   ...
> }
>
> PerRegionTable*
> OtherRegionsTable::find_region_table(size_t ind, HeapRegion* hr) const {
>   assert(0 <= ind && ind < _max_fine_entries, "Preconditions.");
>   PerRegionTable* prt = _fine_grain_regions[ind];
>   while (prt != NULL && prt->hr() != hr) {
>     prt = prt->collision_list_next();
>   }
>   // Loop postcondition is the method postcondition.
>   return prt;
> }
>
> etc.
>
> The are many problems with this: First, there's complete lack of 
> flexibility if we want to use one RSet for many regions (or if we want 
> to change the granularity of the RSet's components). Second, there 
> could be stale entries on the RSet that point to free regions that are 
> being re-allocated concurrently with the refinement code that's 
> actually using hr() (we had some very horrible races in this code). 
> This is why I said the RSet should just be a card address -> bool map 
> and know nothing about heap regions.
>
> Tony
>
>>> - A subdivision of the heap, so that each such heap chunk is 
>>> represented
>>> by a different structure, can still be used. However, chunks should not
>>> be tied to regions. I.e., a region might be covered by several chunks
>>> (helpful, when regions are large).
>> Okay.
>>
>>> - The fine / medium / coarse granularity approach will still be useful
>>> to represent each chunk as efficiently as possible. However, there
>>> should be a single hash table that will keep track of the entries of 
>>> all
>>> granularities.
>>> - The RSet structure should have a facility for threads to claim 
>>> entries
>>> to work on them. This is to facilitate (5).
>> :)
>>
>>> - Addition operations should be entirely lock-free.
>>> - RSet iterations should be done using iterators.
>>> - A facility to efficiently do a union of various RSets will also be
>>> helpful (to facilitate (2)).
>> Thomas
>>
>>
>