G1 RSet improvement ideas

Tue Feb 4 01:41:27 UTC 2014

Hi Bengt,

Very well spotted! :-) You're completely right that SATB marking will 
not visit all the edges in the object graph. However, in G1, the extra 
post-barrier should ensure that all the edges are visited.

Tony

On 2/3/14, 3:27 PM, Bengt Rutisson wrote:
>
> Hi Tony and Thomas,
>
> Just a question for my understanding regarding rebuilding RSets. You 
> both say it is straight forward to do that during marking. How does 
> that work with SATB since that is not guaranteed to visit all edges in 
> the object graph?
>
> Thanks,
> Bengt
>
>
>
> On 2/3/14 6:13 PM, Tony Printezis wrote:
>> Thomas,
>>
>> Thanks for actually reading my brain dump and for the comments. :-) 
>> See inline.
>>
>> On 2/3/14, 4:58 AM, Thomas Schatzl wrote:
>>> Hi,
>>>
>>>    let me have a stab at this long mail with a few half-baked
>>> thoughts... ;)
>>>
>>> On Thu, 2014-01-30 at 12:33 -0500, Tony Printezis wrote:
>>>> Hi all,
>>>>
>>>> I had a recent e-mail exchange with Bengt and Thomas and we touched a
>>>> bit on G1 RSet work. RSets can be a major bottleneck in G1 right now,
>>>> but I don't believe it's fundamental issue with the concept. IMHO, the
>>>> implementation can be substantially improved to be much more 
>>>> efficient.
>>>> I recently created a few issues in our (Twitter) bug system with some
>>>> ideas on how to do that. I pasted all the text below (but replaced the
>>>> internal issue #s with (1), (2), etc.).
>>>>
>>>> Let me know if you have any ideas, thoughts, etc. on this.
>>>>
>>>> Tony
>>>>
>>>> ----------
>>>>
>>>> (1) Use a single RSet for all young regions
>>>>
>>>> In G1 each region has its own Remembered Set (RSet). During a
>>>> collection, all RSets of the Collection Set (CSet) regions need to be
>>>> scanned. Given that all young regions are always collected together
>>>> during the same collection, it will be beneficial to actually make all
>>>> young regions point to the same physical RSet. This will have several
>>>> advantages:
>>>>
>>>> - lower overall RSet footprint (better to have one fuller RSet, than
>>>> many mostly empty ones; this will also quickly eliminate duplicate 
>>>> RSet
>>>> entries across all young regions)
>>> That is true, but in my experience with a similar system (not on G1),
>>> the savings are not that big - of course there are, but not huge iirc.
>>
>> A quick study to measure this would be interesting. I would be really 
>> shocked if we don't see quite a lot of duplication, I have to say 
>> (especially when a very large eden is used).
>>
>>> In that direction one could experiment with another idea, a "sticky 
>>> young
>>> gen" value in the card table, that can be used to also filter out 
>>> references
>>> between old gen regions originating from these cards (they will be
>>> scanned anyway) too. I think I had that implemented 
>>> somewhere/sometime, it
>>> should work.
>>>
>>> I.e. during refinement, if we find a to-young reference, instead of
>>> clearing the card at the end, mark it with some "sticky young value" so
>>> that the barrier will not redirty that card. Maybe also just leaving it
>>> dirty may be sufficient.
>>>
>>> Note that this idea dirties the card table, so requires clearing.
>>
>> Note that G1 does try to avoid "dirty / refine / clean / re-dirty / 
>> re-refine / re-clean / etc." patterns on cards (whether they point to 
>> the young gen or not) using the hot card cache (unless you've 
>> recently removed it! haven't looked at that code for a while... but 
>> it still seems to be there).
>>
>> FWIW: apart from avoiding card duplication, using a single RSet for 
>> all young regions will also have additional advantages: lower 
>> footprint (only one table instead of one per region), faster to 
>> reclaim, fewer malloc / frees, etc.
>>
>>>> - more efficient RSet scanning (the number of RSets scanned during a
>>>> collection will not depend on the number of young regions in the CSet)
>>> The main contributing factor for rset scanning is the number of 
>>> entries,
>>> not the number of rsets.
>>
>> I don't completely agree with this I have to say. Let's assume you 
>> have some card duplication across all young RSets (for the sake of 
>> argument). Having each thread having to visit each RSet (imagine you 
>> have a few 100s), check whether it's been claimed, try to claim it, 
>> then try to claim cards and contend on the card table is not cheap. 
>> Just looking at a single RSet, seeing quickly that its entries have 
>> been claimed by other GC threads, and going on to the next phase 
>> could be much more efficient.
>>
>>>   You can always improve the distribution across
>>> threads if that is what is your concern.
>>>
>>> (G1 typically takes the easy approach by distributing work on a 
>>> per-region
>>> basis. That's more an artifact of a quick solution than a real 
>>> solution.
>>
>> Are you talking about distributing RSet scanning work? Or in general?
>>
>>> There are already a few CRs and prototypes that improve this 
>>> situation).
>>
>> Can you point me to a couple?
>>
>>>> Minus any issues and assumptions that will need to be ironed out in 
>>>> the
>>>> RSet code, doing this should be straightforward. During a GC, a new 
>>>> RSet
>>> I do not see huge issues with that either.
>>
>> (see towards the e-mail for a bit more info on this)
>>
>>>> is created and all Survivor regions created during that GC will be 
>>>> made
>>>> to point to it. Then, as new Eden regions are allocated, they will 
>>>> also
>>>> point to the same RSet. At the start of the next GC, all Eden / 
>>>> Survivor
>>>> regions will be pointing to the same RSet. During young GCs, we'll 
>>>> only
>>>> need to scan that RSet. During mixed GCs, we'll need to scan the young
>>>> RSet and all the RSets of the old regions. But this will be an
>>>> improvement over what we do now. We'll also need to decouple the young
>>>> RSet freeing code from the region reclamation code so that we don't
>>>> attempt to reclaim the same RSet multiple times.
>>> Another minus, that might be implementation dependent: possibly lots of
>>> contention on a single RSet.
>>
>> That's definitely a good point. On the other hand, threads trying to 
>> claim individual RSets should not be too different to threads trying 
>> to claim sub-components of a single RSet. Also, having one RSet per 
>> region is an arbitrary way to cut down contention. If it's possible 
>> to have a single RSet for all the young regions, it will be easy to 
>> expand that and maybe have N RSets (again, each covering all young 
>> regions) and easily map individual threads (whether they are writing 
>> or reading) to one of them. Here, N can be decided on the amount of 
>> concurrency available. One per region is completely arbitrary in that 
>> respect.
>>> Consider machines with a few hundred threads out there.
>>
>> (just to clarify) Do you mean:
>>
>> - Hundreds of HW threads?
>> - Hundreds of Java threads?
>> - Hundreds of GC threads?
>>
>>> Not that I think
>>> anyone ever made measurements regarding that with the current RSets on
>>> such a system either...
>>
>> Not as far as I know...
>>
>>>> (2) G1 RSets: Use a single RSet for old regions of a single CSet
>>>>
>>>> (This is a follow-up to (1))
>>>>
>>>> For the same reasons that it's a good idea to use a single Remembered
>>>> Set (RSet) for all young regions, it'd be also helpful to use a single
>>>> RSet for all old regions that are added to a Collection Set (CSet). In
>>>> combination with (1), this will result in only having either one 
>>>> (young
>>>> GCs) or two (mixed GCs) RSets to scan during GC.
>>>>
>>>> This will be quite a lot more work to do, compared to (1), given 
>>>> that we
>>>> only choose the old regions to add to a CSet at the start of a GC. To
>>>> achieve this we'll have to do a bit more advance preparation: choose
>>>> groups of old regions to collect together (maybe do that 
>>>> concurrently to
>>>> minimize STW work) and combine their RSets. So, at each mixed GC, 
>>>> we'll
>>>> pick an old region group, and its associated RSet, and collect them
>>>> along with the young regions.
>>> A similar effect could be achieved by increasing region size, it's not
>>> as flexible though.
>>
>> That's kind of a big hammer approach. And it might make other parts 
>> of G1 less efficient (you might find fewer completely empty regions 
>> at the end of a marking cycle, which will put more pressure on the 
>> mixed GCs).
>>
>>> It, and all follow-up ideas in that direction, assume that the 
>>> regions that
>>> we want to pick and their order can be precalculated. That may be a 
>>> good
>>> assumption now, as G1 tries to do the cleanup asap, it may not hold 
>>> in the
>>> future.
>>>
>>> Eg. there are already minor complaints that the mixed gc phase impacts
>>> throughput too much during that time, so spreading this phase out 
>>> may be
>>> considered in the future (e.g. just reclaim enough to keep up with 
>>> promotion
>>> rate and then a little).
>>
>> Two things:
>>
>> a) The reason I think it's a good idea to do the mixed GCs 
>> back-to-back is to clean up the heap as quickly as possible to give 
>> the GC and app breathing room as quickly as possible. Yes, I had also 
>> thought of spreading them out more. However, my intuition is that, 
>> given you know there's garbage in heap, might as well try to reclaim 
>> it asap.
>>
>> b) Putting together groups of regions that share an RSet and can be 
>> collected together is not incompatible with spreading out the mixed 
>> GCs. You just have to maintain the RSet for that group for longer. 
>> However, if you do that, you'll end up with more and more stale 
>> entries in it and you might have to do some extra filtering before 
>> you decide to do the collection.
>>
>>> Also, we might need the flexibility to e.g. clear up particular 
>>> regions to
>>> allow allocation of large objects without full GC. That's a major 
>>> concern
>>> in some cases, you may have noticed that there are already some CRs 
>>> trying
>>> to improve G1 behavior with many large objects.
>>
>> (Actually, I haven't looked at all the open GC / G1 CRs)
>>
>> Again, this is a very good observation. However, maintaining complete 
>> RSets for all regions in order to deal with an infrequent edge case 
>> sounds like the wrong trade-off. Maybe you can do something 
>> in-between: Always keep very coarse RSets per region, i.e., which 
>> region, or whatever subdivision you decide on, has references into 
>> that region (i.e., but not at the card level; something that you can 
>> do with a fixed data structure like a bitmap / byte map). Then, if 
>> you unexpectedly want to move some regions to make space for a new 
>> large one, you can still move a few without having to scan the entire 
>> heap, but maybe having to scan more than individual cards.
>>
>>>> (3) G1 RSets: Use one RSet for each GC
>>>>
>>>> (follow-up to (1) and (2))
>>>>
>>>> If we manage to have one young RSet and one old RSet for each mixed 
>>>> GC,
>>>> it'd be straightforward to combine the two to always end up having to
>>>> scan a single RSet at every GC.
>>>  From a contention point of view this seems even worse than (1), but a
>>> straightforward extension.
>>
>> I'd agree. But, I still don't think contention on a single RSet has 
>> to be much worse than having to scan / claim a bunch of RSets. (And 
>> see note about having multiple RSets to match the available 
>> concurrency.)
>>
>>>> (2) proposes to create old region groups, each with its own combined
>>>> RSet. We could take this further and grab this combined old RSet at 
>>>> the
>>>> start of a GC and use that as the young RSet (for Survivor regions
>>>> allocated during that GC and Eden regions allocated after that GC). 
>>>> So,
>>>> at the start of the next GC, all regions in the CSet (Eden, Survivors,
>>>> old region group) will have the same RSet.
>>> Afair the order in the collection candidate list is only ever evaluated
>>> once after mark end at the moment. You could create the groups at that
>>> opportunity already.
>>
>> Absolutely. I was thinking that maybe after a marking cycle has 
>> completed, a concurrent thread (probably, the marking thread itself) 
>> can do a lot of the mixed GC preparation work (examples: sort the old 
>> regions, find which ones should be collected, put together the region 
>> groups, combine / scrub the RSets, etc. whatever needs to be done). 
>> Then, as each region group has been prepared is made available on a 
>> queue. Each evacuation pause bases the mixed / young GC decision on 
>> whether a new group has been made available. This way we move the 
>> expensive preparation work to the concurrent domain and the "when to 
>> do a mixed GC" decision becomes very straightforward.
>>
>>> Also probably needs retuning of the different thresholds for switching
>>> between the coarseness levels of the remembered set.
>>>
>>>> (4) G1 RSets: Filter old RSets before a collection
>>>>
>>>> Old RSets tend to grow quite large and sometimes they contain a large
>>>> amount of stale entries that need to be scanned for nothing during 
>>>> a GC.
>>>> After concurrent marking has completed, but maybe before each mixed 
>>>> GC,
>>>> we could introduce a concurrent process that scans RSets of regions 
>>>> that
>>>> will be collected and filters out any stale entries (by scanning the
>>>> cards that the RSet points to and removing any cards that don't 
>>>> point to
>>>> the region any more). This should reduce the amount of RSet scanning
>>>> work that's done during mixed GCs.
>>>>
>>>> The easiest way to do this, to avoid any nasty races / MT issues, 
>>>> is to
>>>> give each old region a new RSet (during a safepoint, say), store 
>>>> the old
>>> This safepoint need not be an extra safepoint, could be piggybacked to
>>> existing ones.
>>
>> I didn't say anything about having an extra safepoint. I agree, this 
>> can be piggy-backed on an existing safepoint.
>>
>>> That actually sounds quite easy to implement, however,
>>> any suggestions how to find the RSets with the largest amount of stale
>>> entries?
>>
>> Nope. I don't believe is very easy to decide that. I'd just scrub all 
>> RSets for the old regions to be collected.
>>
>>> Apart from the terms "tend to grow quite large" and "sometimes they
>>> contain a large amount of stale entries", do you have some numbers from
>>> experience here?
>>
>> I had had some quick studies but I didn't keep the numbers after I 
>> left Oracle. But I had definitely seen RSets for particular regions 
>> keep growing in size and they really shouldn't.
>>
>>>> one somewhere, scan it (concurrently), and add any entries that 
>>>> need to
>>>> be retained to the new one. At the end, we can just reclaim the old 
>>>> one.
>>> Also, when checking for existing entries, probably only look into the
>>> new one:
>>
>> Of course. The RSet of the region will be the new one. The only one 
>> will be treated similarly to a bunch of dirty cards (as Igor said in 
>> a separate e-mail).
>>
>>> if it does not exist, add it, and any duplicates from the old
>>> will have no effect. (Instead of always checking both RSets).
>>
>> You should never have to check both RSets.
>>
>>>> This way, any new updates to the new RSet will automatically happen
>>>> through the normal path. Explicitly removing an entry, instead of
>>>> recreating the RSet as described, will be much harder, given that we'd
>>>> have to synchronize with other threads potentially trying to add the
>>>> same entry.
>>>>
>>>> This can be done in addition to (2). In fact, the two operations could
>>>> also be combined (i.e. combined old region group's RSets and also 
>>>> filter
>>>> them at the same time).
>>>>
>>>> ----------
>>>>
>>>> (5) G1 RSets: Don't use the card table during RSet scanning to reduce
>>>> duplication
>>>>
>>>> Currently, while scanning Remembered Sets (RSets) during a GC we 
>>>> use the
>>>> card table to mark whether a particular card has been scanned or not
>>>> (since the same card might appear on multiple RSets). This is bad 
>>>> for a
>>>> couple of reasons (the first one is the important one, the second 
>>>> one we
>>>> could address reasonably easily):
>>>>
>>>> - First, we have to clear the card table at the end of the GC, which
>>>> requires a non-trivial amount of work.
>>> Do you have numbers here? It definitely is a valid concern (with large
>>> heaps), but in the logs I got so far, this has not been a really big 
>>> issue
>>> yet because the other phases are so much longer...
>>
>> Last time I checked, G1 kept track of which regions would potentially 
>> have dirty cards (IIRC: the collection set regions and any regions 
>> that appeared on the RSet of the collection set regions; I could be 
>> wrong). And it only clears the CT for those regions. The Clean CT 
>> phase is on the GC log. It might not be very long, but if you can 
>> eliminate it, it can only be a good thing. Also, there might be more 
>> regions whose CT needs to be cleared after mixed GCs (i.e., don't 
>> base your decision only on data from young GCs). Finally, if you can 
>> also remove the code that keeps track of which regions might contain 
>> dirty cards post GC can also be a good thing.
>>
>>> A more coarse second-level card table (512k/1M card size?) remembering
>>> approximately where these marks are should already go a long way 
>>> towards
>>> reducing the overhead?
>>
>> As I said earlier, only part of the CT is cleared.
>>
>>> Do you have some numbers on how far these card marks are spread out? 
>>> I would
>>> assume that particularly on large heaps, only a fraction of the card 
>>> table
>>> is ever dirtied per collection.
>>
>> For young GCs, I agree there probably won't be too many cards that 
>> get dirty during a GC. I'm sure there will be more during mixed GCs.
>>
>>>> - Second, the card marks during RSet scanning are not done atomically
>>>> (atomics were deemed too expensive). This is safe (rescanning a 
>>>> card is
>>>> not a correctness issue and it shouldn't happen very often anyway).
>>>> However, this prevents some subtle optimizations (when one thread 
>>>> comes
>>>> across a reference on a card that points into the CSet and pushes its
>>>> location on the stack, it has to recheck that it still points into the
>>>> CSet when it eventually pops it from the stack, given that another
>>>> thread might also scan that card and update the reference meanwhile).
>>> :)
>>>
>>>> A better solution will be to somehow use the RSet data structure for
>>>> each thread to claim RSet chunks. It can even be done in a destructive
>>>> way, given that the RSet will be reclaimed at the end of the GC 
>>>> anyway.
>>> That would be something nice (that is not slower in the end than just
>>> clearing the card table).
>>
>> Well, after a collection the RSet(s) of the CSet regions is/are 
>> reclaimed given that the CSet regions are reclaimed. So, if you have 
>> a couple of extra fields on the RSet to help with the claiming 
>> algorithm (or each RSet component, etc.) it shouldn't have any extra 
>> overhead at the end.
>>
>>>> This depends on (3), given we can only ensure that each card is 
>>>> claimed
>>>> exactly once if it appears on a single RSet.
>>>>
>>>> ----------
>>>>
>>>> (6) G1 RSets: Only create old region RSets when they are needed
>>>>
>>>> Currently, each region has its own RSet from the beginning and that's
>>>> maintained throughout the region's lifetime. For very long regions, it
>>>> turns out that long-lived old regions end up with a lot of stale 
>>>> entries
>>>> in their RSets (as ref fields are updated over time to point to
>>>> different objects) which make them very bloated. An important
>>>> observation is that the old region RSets are only really needed during
>>>> mixed GCs after a concurrent cycle. Sometimes, concurrent cycles 
>>>> happen
>>>> very infrequently (every few hours, maybe) so we constantly 
>>>> maintain the
>>>> old region RSets and only take advantage of them very infrequently.
>>>>
>>>> A better approach would be to eliminate RSets from old regions 
>>>> (say, set
>>>> the RSet field to NULL for those regions so that any refinement
>>>> operation will just do nothing for refs that point to such regions) 
>>>> and
>>>> only recreate them when needed.
>>> Totally agree.
>>>> One approach, which has been done by other GCs, is to recreate the 
>>>> RSets
>>>> during marking (as the marking phase will visit all live objects).
>>> Adding that to the marking phase is simple.
>>>> Maybe a better approach will be to finish the marking phase, do some
>>>> quick analysis on the old regions to decide which to collect and which
>>>> not to collect, and do a sweep phase to populate the RSets of the
>>>> regions that will be collected. Only when that operation is done, 
>>>> mixed
>>>> GCs will be allowed.
>>> Do you think sweeping the entire heap is faster than piggy-backing 
>>> to the
>>> marking, even if done concurrently?
>>
>> (clarification : not the entire heap, but only old regions) Yeah, 
>> interesting trade-off here. I can totally see the attraction of 
>> building the RSets piggy-backed on marking. However, you might end up 
>> building several RSets for regions that you won't end up collecting 
>> (because they're very dense with live objects); so that'd be wasted 
>> work. I kinda like the the approach of doing it as a separate step as 
>> it allows you to decide exactly which RSets to build. Also, maybe the 
>> idea I mentioned earlier about always maintaining very coarse-grain 
>> RSets per region, could help in reducing how much of the old gen 
>> you'd have to sweep (this can of course degrade to scanning the 
>> entire old gen after all).
>>
>>>> This is a (better but more involved) alternative to (4) as the 
>>>> recreated
>>>> RSets will not be very long-lived, so they will hopefully not need
>>>> further filtering.
>>>>
>>>> This can be combined with the approach in (2), i.e., find the old 
>>>> region
>>>> groups we'll collect, give all regions in each group a new RSet, and
>>>> only then do the sweeping phase to create directly RSets per group
>>>> instead of per region.
>>>>
>>>> ----------
>>>>
>>>> (7) G1 RSets: Redesign for remembered set data structures
>>>>
>>>> This is somewhat orthogonal to the other G1 RSet issues, but there
>>>> should be a lot of gains to be made by rethinking / redesigning the 
>>>> RSet
>>>> data structures. The concept of having multiple RSet granularities is
>>>> quite interesting. Unfortunately, the implementation is quite
>>> The idea seems good.
>>>
>>>> inefficient. In particular, there are two hash tables (for fine and
>>>> sparse entries) which is completely wasteful (requires two look-ups 
>>>> and
>>>> extra footprint for the second table). Also, several operations 
>>>> require
>>>> locks which affects scalability. Finally, the implementation is too
>>>> dependent on the region abstraction (i,e., it's baked in the code that
>>>> each RSet component represents a region-to-region relationship) which
>>>> makes them very inflexible (RSset components can get very large when
>>>> very large regions are used).
>>> I think just assigning the same RSet to multiple regions works already.
>>
>> Have you changed this recently?
>>
>>> There is some overhead that will be incurred due to filters assuming a
>>> one-to-one relationship, but that should be minimal, and benign as 
>>> in the
>>> end it will be coalesced with the existing entry anyway. (Did not 
>>> really
>>> look at the code).
>>
>> IIRC, the way the RSet mapped entries to card addresses was to get 
>> the region address from the HeapRegion data structure.
>>
>>>> The goals of a redesign should be the following:
>>>>
>>>> - At a high-level, the new structure should just be a card address ->
>>>> bool map (false -> card not included in the RSet, true -> card 
>>>> included
>>>> in the RSet).
>>> I am not sure what this means: this suggests an interface, not an 
>>> implementation.
>>
>> What I tried to say is that the implementation shouldn't deal with 
>> the HeapRegion data structure at all:
>>
>> void OtherRegionsTable::add_reference(OopOrNarrowOopStar from, int 
>> tid) {
>>   ...
>>   // Note that this may be a continued H region.
>>   HeapRegion* from_hr = _g1h->heap_region_containing_raw(from);
>>   RegionIdx_t from_hrs_ind = (RegionIdx_t) from_hr->hrs_index();
>>   ...
>> }
>>
>> PerRegionTable*
>> OtherRegionsTable::find_region_table(size_t ind, HeapRegion* hr) const {
>>   assert(0 <= ind && ind < _max_fine_entries, "Preconditions.");
>>   PerRegionTable* prt = _fine_grain_regions[ind];
>>   while (prt != NULL && prt->hr() != hr) {
>>     prt = prt->collision_list_next();
>>   }
>>   // Loop postcondition is the method postcondition.
>>   return prt;
>> }
>>
>> etc.
>>
>> The are many problems with this: First, there's complete lack of 
>> flexibility if we want to use one RSet for many regions (or if we 
>> want to change the granularity of the RSet's components). Second, 
>> there could be stale entries on the RSet that point to free regions 
>> that are being re-allocated concurrently with the refinement code 
>> that's actually using hr() (we had some very horrible races in this 
>> code). This is why I said the RSet should just be a card address -> 
>> bool map and know nothing about heap regions.
>>
>> Tony
>>
>>>> - A subdivision of the heap, so that each such heap chunk is 
>>>> represented
>>>> by a different structure, can still be used. However, chunks should 
>>>> not
>>>> be tied to regions. I.e., a region might be covered by several chunks
>>>> (helpful, when regions are large).
>>> Okay.
>>>
>>>> - The fine / medium / coarse granularity approach will still be useful
>>>> to represent each chunk as efficiently as possible. However, there
>>>> should be a single hash table that will keep track of the entries 
>>>> of all
>>>> granularities.
>>>> - The RSet structure should have a facility for threads to claim 
>>>> entries
>>>> to work on them. This is to facilitate (5).
>>> :)
>>>
>>>> - Addition operations should be entirely lock-free.
>>>> - RSet iterations should be done using iterators.
>>>> - A facility to efficiently do a union of various RSets will also be
>>>> helpful (to facilitate (2)).
>>> Thomas
>>>
>>>
>>
>

-- 
Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprintezis at twitter.com