G1 RSet improvement ideas

Wed Feb 5 22:22:26 UTC 2014

Thomas,

On 2/4/14, 5:34 AM, Thomas Schatzl wrote:
> Hi,
>
> On Mon, 2014-02-03 at 12:13 -0500, Tony Printezis wrote:
>> Thomas,
>>
>> Thanks for actually reading my brain dump and for the comments. :-) See
>> inline.
>>
> If not clear already, my comments were just that too. (At least mine are) Full of assumptions, errors, etc. :)

Of course, mine too. This is a good discussions though.

>> On 2/3/14, 4:58 AM, Thomas Schatzl wrote:
>>> Hi,
>>>
>>>     let me have a stab at this long mail with a few half-baked
>>> thoughts... ;)
>>>
>>> On Thu, 2014-01-30 at 12:33 -0500, Tony Printezis wrote:
>>>> Hi all,
>>>>
>>>> I had a recent e-mail exchange with Bengt and Thomas and we touched a
>>>> bit on G1 RSet work. RSets can be a major bottleneck in G1 right now,
>>>> but I don't believe it's fundamental issue with the concept. IMHO, the
>>>> implementation can be substantially improved to be much more efficient.
>>>> I recently created a few issues in our (Twitter) bug system with some
>>>> ideas on how to do that. I pasted all the text below (but replaced the
>>>> internal issue #s with (1), (2), etc.).
>>>>
>>>> Let me know if you have any ideas, thoughts, etc. on this.
>>>>
>>>> Tony
>>>>
>>>> ----------
>>>>
>>>> (1) Use a single RSet for all young regions
>>>>
>>>> In G1 each region has its own Remembered Set (RSet). During a
>>>> collection, all RSets of the Collection Set (CSet) regions need to be
>>>> scanned. Given that all young regions are always collected together
>>>> during the same collection, it will be beneficial to actually make all
>>>> young regions point to the same physical RSet. This will have several
>>>> advantages:
>>>>
>>>> - lower overall RSet footprint (better to have one fuller RSet, than
>>>> many mostly empty ones; this will also quickly eliminate duplicate RSet
>>>> entries across all young regions)
>>> That is true, but in my experience with a similar system (not on G1),
>>> the savings are not that big - of course there are, but not huge iirc.
>> A quick study to measure this would be interesting. I would be really
>> shocked if we don't see quite a lot of duplication, I have to say
>> (especially when a very large eden is used).
> Possibly our definition of "quite a lot" differs :) I agree.

:-)

>>> In that direction one could experiment with another idea, a "sticky young
>>> gen" value in the card table, that can be used to also filter out references
>>> between old gen regions originating from these cards (they will be
>>> scanned anyway) too. I think I had that implemented somewhere/sometime, it
>>> should work.
>>>
>>> I.e. during refinement, if we find a to-young reference, instead of
>>> clearing the card at the end, mark it with some "sticky young value" so
>>> that the barrier will not redirty that card. Maybe also just leaving it
>>> dirty may be sufficient.
>>>
>>> Note that this idea dirties the card table, so requires clearing.
>> Note that G1 does try to avoid "dirty / refine / clean / re-dirty /
>> re-refine / re-clean / etc." patterns on cards (whether they point to
>> the young gen or not) using the hot card cache (unless you've recently
>> removed it! haven't looked at that code for a while... but it still
>> seems to be there).
> In this idea you would just keep cards having any young reference dirty on the
> card table.
> I.e. automatically filtering out duplicates across *all* rsets. Except for
> the additional data structure maintenance, equal to having a single young
> RSet.

But isn't this exactly what the hot card cache does? It keeps the card 
dirty so that it's not enqueued again by the post-barrier and it's not 
added to multiple RSets. I can see how keeping a card dirty that points 
into the young gen (to generalize this: keeping a card dirty that points 
into the next collection set) can reduce the concurrent refinement 
overhead. But I believe the hot card cache should have a similar effect.

> There is the guarantee that at least one young rset (or the big single
> one) contains that card anyway.
>
> [I am also not completely convinced about the usefulness of the hot card cache
> on large systems, especially the old implementation (which you probably
> remember. :) It has changed somewhat since, and its overhead reduced I hope.]
>
>> FWIW: apart from avoiding card duplication, using a single RSet for all
>> young regions will also have additional advantages: lower footprint
>> (only one table instead of one per region), faster to reclaim, fewer
>> malloc / frees, etc.
> I am not so sure about footprint, see below.

If the RSets are reasonably sparse, having one fuller table instead of 
100s very sparse tables should be a win (IMHO at least).

> The other improvements are
> certainly true. However a better RSet implementation should do away with
> (hopefully) large parts of this overhead.
>
>>>> - more efficient RSet scanning (the number of RSets scanned during a
>>>> collection will not depend on the number of young regions in the CSet)
>>> The main contributing factor for rset scanning is the number of entries,
>>> not the number of rsets.
>> I don't completely agree with this I have to say. Let's assume you have
>> some card duplication across all young RSets (for the sake of argument).
>> Having each thread having to visit each RSet (imagine you have a few
>> 100s), check whether it's been claimed, try to claim it, then try to
>> claim cards and contend on the card table is not cheap. Just looking at
>> a single RSet, seeing quickly that its entries have been claimed by
>> other GC threads, and going on to the next phase could be much more
>> efficient.
> You certainly have more experience here. I do not know the exact
> implementation of the claim mechanism, but I will look at it. However I
> assume the claiming is coarse enough that its overhead is small. So the
> potential gain.

As a general rule: Don't assume anything in G1. ;-)

>  From my measurements (on "small" machines/benchmarks) actual heap
> walking and reference processing were the most time consuming during GC
> pause by far.
> (I am not claiming that there is no overhead. Just that it's not that
> big as you seem to suggest. I am all for looking at the RSets again).
>
>>>    You can always improve the distribution across
>>> threads if that is what is your concern.
>>>
>>> (G1 typically takes the easy approach by distributing work on a per-region
>>> basis. That's more an artifact of a quick solution than a real solution.
>> Are you talking about distributing RSet scanning work? Or in general?
> Both and in general.

I think in general, the region heap subdivision is at least a good 
starting point for load balancing. In some cases, that's enough 
(parallel verification). In some other cases, more fine-grain load 
balancing is necessary (parallel marking).

>>> There are already a few CRs and prototypes that improve this situation).
>> Can you point me to a couple?
> The main area where this has been a problem is code root marking
> (JDK-8025813). I remember that sometimes RSet scan is also too
> unbalanced, keeping other threads waiting, but no CR yet, as it's not too
> bad most of the time and there are other, more important problems in the
> area of parallelization.
>
> Others are, off the top of my head generic parallelization problems, either
> not tied to regions or not yet (code root migration (no CR), free cset
> (JDK-8027295),

I've always wanted to do that concurrently, after the pause is done. 
There's no real reason to need to do that during a pause. (I'll add a 
couple comments on that CR.)

> class loader data graph (JDK-8030144), card redirtying
> (JDK-8019342)).

If the pause didn't use the card table for GC threads to claim cards, 
you might have been able to do that during the pause.

> The CRs are likely not too informative about any proposed solution. Sorry.

(thanks for pointing me to these BTW)

[snip]

>>> It, and all follow-up ideas in that direction, assume that the regions that
>>> we want to pick and their order can be precalculated. That may be a good
>>> assumption now, as G1 tries to do the cleanup asap, it may not hold in the
>>> future.
>>>
>>> Eg. there are already minor complaints that the mixed gc phase impacts
>>> throughput too much during that time, so spreading this phase out may be
>>> considered in the future (e.g. just reclaim enough to keep up with promotion
>>> rate and then a little).
>> Two things:
>>
>> a) The reason I think it's a good idea to do the mixed GCs back-to-back
>> is to clean up the heap as quickly as possible to give the GC and app
>> breathing room as quickly as possible. Yes, I had also thought of
>> spreading them out more. However, my intuition is that, given you know
>> there's garbage in heap, might as well try to reclaim it asap.
>  From a throughput POV it is certainly best because you keep the remsets
> the least amount of time. Also there is a non-negligible overhead in
> just picking a single old gen region for evacuation - although G1 tends
> to overestimate the impact a lot (imo).
>
> Then, typically keeping unreclaimed memory decreases the amount of memory
> available for the young gen too.
>
> There're a lot of tradeoffs here.

Agreed.

>> b) Putting together groups of regions that share an RSet and can be
>> collected together is not incompatible with spreading out the mixed GCs.
>> You just have to maintain the RSet for that group for longer. However,
>> if you do that, you'll end up with more and more stale entries in it and
>> you might have to do some extra filtering before you decide to do the
>> collection.
> Sure. This was just a remark that in some applications the gain may not
> be that large because you are forced to keep the (separate) RSets longer
> for one reason or another.
>
>>> Also, we might need the flexibility to e.g. clear up particular regions to
>>> allow allocation of large objects without full GC. That's a major concern
>>> in some cases, you may have noticed that there are already some CRs trying
>>> to improve G1 behavior with many large objects.
>> (Actually, I haven't looked at all the open GC / G1 CRs)
> You should be able to find all of them by searching for the "gc-g1"
> label.
>
>> Again, this is a very good observation. However, maintaining complete
>> RSets for all regions in order to deal with an infrequent edge case
>> sounds like the wrong trade-off. Maybe you can do something in-between:
> It does not seem too uncommon for some class of applications (web
> application servers) to have large, short-living per request buffers.
>
> While it may not be too common after all for you, unfortunately the impact
> seems too high to ignore in some cases. See e.g. JDK-8031381 or JDK-8027959
> for (some) examples. We see completely unacceptable behavior with G1 in
> these tests due to too many full GCs caused by large objects.

(I posted a comment on one of those) BTW, you can always maintain RSets 
just for the H regions. I don't think this will be a huge overhead.

[snip]

>>>> - Second, the card marks during RSet scanning are not done atomically
>>>> (atomics were deemed too expensive). This is safe (rescanning a card is
>>>> not a correctness issue and it shouldn't happen very often anyway).
>>>> However, this prevents some subtle optimizations (when one thread comes
>>>> across a reference on a card that points into the CSet and pushes its
>>>> location on the stack, it has to recheck that it still points into the
>>>> CSet when it eventually pops it from the stack, given that another
>>>> thread might also scan that card and update the reference meanwhile).
>>> :)
>>>
>>>> A better solution will be to somehow use the RSet data structure for
>>>> each thread to claim RSet chunks. It can even be done in a destructive
>>>> way, given that the RSet will be reclaimed at the end of the GC anyway.
>>> That would be something nice (that is not slower in the end than just
>>> clearing the card table).
>> Well, after a collection the RSet(s) of the CSet regions is/are
>> reclaimed given that the CSet regions are reclaimed. So, if you have a
>> couple of extra fields on the RSet to help with the claiming algorithm
>> (or each RSet component, etc.) it shouldn't have any extra overhead at
>> the end.
> That's a tough tradeoff to make: the RSet is already too large, but I
> would need to think about it. Reusing the RSet memory of already
> processed RSet data for that might be doable.
>
> I am just worried that this takes too much time: compared to a simple
> card table lookup, any lookup in a more complicated data structure seems
> expensive. Needs to be investigated.

I'm not quite sure I understand this. For the sake of argument, imagine 
that the RSet only has sparse tables. Each thread can iterate over the 
RSet and try to claim each sparse table (by atomically setting a field 
on the sparse table; it can maybe re-use one of the existing fields for 
that). Once it has claimed said table, it can go and scan the cards 
represented in it. I can't see why this will be worse than trying to 
claim every card in case there's duplication.

>>> Do you think sweeping the entire heap is faster than piggy-backing to the
>>> marking, even if done concurrently?
>> (clarification : not the entire heap, but only old regions) Yeah,
> Sure :)
>
>> interesting trade-off here. I can totally see the attraction of building
>> the RSets piggy-backed on marking. However, you might end up building
>> several RSets for regions that you won't end up collecting (because
>> they're very dense with live objects); so that'd be wasted work. I kinda
>> like the the approach of doing it as a separate step as it allows you to
>> decide exactly which RSets to build. Also, maybe the idea I mentioned
> There's a simple workaround: if the RSet grows too complicated, drop the
> rset of that region, given some complexity (e.g. memory) bound. Also,
> you might use previous markings to determine which regions to prefer.

Absolutely. However, that could mean a lot of wasted work.

> Things like large objects make it a little complicated, but I think that
> can be managed (by e.g. maintaining coarse level RSets for regions that
> break large contiguous areas).
>
> Even scanning only the old gen might require scanning tens of GBs of
> memory, and iterating over all references of them isn't exactly cheap.

Absolutely. :-)

> Might as well do another marking round instead.

FWIW, sweeping has quite different performance trace-offs compared to 
marking (it can be a lot more efficient because it probably has much 
better cache behavior.

> I do not know, likely
> scanning is faster, but is it worth the extra code?

Writing code to iterate over the heap to rebuild RSets should not be 
that much work. And keeps the marking code more straightforward. Again, 
like you, I can make the argument both ways (piggy-back on marking or do 
an additional phase), depending on the shape of the heap.

[snip]

>>> There is some overhead that will be incurred due to filters assuming a
>>> one-to-one relationship, but that should be minimal, and benign as in the
>>> end it will be coalesced with the existing entry anyway. (Did not really
>>> look at the code).
>> IIRC, the way the RSet mapped entries to card addresses was to get the
>> region address from the HeapRegion data structure.
>>
>>>> The goals of a redesign should be the following:
>>>>
>>>> - At a high-level, the new structure should just be a card address ->
>>>> bool map (false -> card not included in the RSet, true -> card included
>>>> in the RSet).
>>> I am not sure what this means: this suggests an interface, not an implementation.
>> What I tried to say is that the implementation shouldn't deal with the
>> HeapRegion data structure at all:
>>
>> void OtherRegionsTable::add_reference(OopOrNarrowOopStar from, int tid) {
>>     ...
>>     // Note that this may be a continued H region.
>>     HeapRegion* from_hr = _g1h->heap_region_containing_raw(from);
>>     RegionIdx_t from_hrs_ind = (RegionIdx_t) from_hr->hrs_index();
>>     ...
>> }
>>
>> PerRegionTable*
>> OtherRegionsTable::find_region_table(size_t ind, HeapRegion* hr) const {
>>     assert(0 <= ind && ind < _max_fine_entries, "Preconditions.");
>>     PerRegionTable* prt = _fine_grain_regions[ind];
>>     while (prt != NULL && prt->hr() != hr) {
>>       prt = prt->collision_list_next();
>>     }
>>     // Loop postcondition is the method postcondition.
>>     return prt;
>> }
>>
>> etc.
>>
>> The are many problems with this: First, there's complete lack of
>> flexibility if we want to use one RSet for many regions (or if we want
>> to change the granularity of the RSet's components). Second, there could
>> be stale entries on the RSet that point to free regions that are being
>> re-allocated concurrently with the refinement code that's actually using
>> hr() (we had some very horrible races in this code). This is why I said
>> the RSet should just be a card address -> bool map and know nothing
>> about heap regions.
> Just thinking loud, with no particular purpose, and not refuting anything
> you mentioned above:
>
> There is a big downside of your suggestion doing this (blindly): memory
> consumption. Currently cards are indices within a base region, taking
> only 4 bytes per entry.

Yes (and the card indexes used to be 2-byte shorts FWIW).

> Without the heap region card index as base
> offset, you would need double the size per entry.

Not sure why. You can still organize the RSets in a similar manner: Each 
RSet component today corresponds to an address range [from,to) of size 
L. You can still structure it as having the from address plus indexes 
into said the address range. This is not really different to what we 
have now. I'm just proposing that L should not necessarily be the heap 
region size and that we don't get the [from, to) data from a HeapRegion 
objects.

> I am not sure if perfect duplicate avoidance will cancel that out.
>
> So you probably want to have some sort of "base card index" anyway for
> each of the sub-rsets,

Absolutely, I never said we shouldn't.

> however this will take away from storage efficiency
> again, as you are probably going to store on a per-region (or per-area)
> basis to keep memory footprint low. E.g. assuming that you allocate the
> per-area storage in X entry sized chunks, there will be some unused space.
> Which more or less gives you the same effective behavior as before.
> (With the same basic sub-rset implementations).
>
> In the case with a single rset for multiple regions, instead of having
> multiple RSets each having their own per-source region data structure,
> you have one RSet having per-source "area" data structures. I can see
> that there is a certain improvement on management overhead. However you
> might want more areas per region, which increases memory overhead again.

I'm not quite following this I have to say. The RSet data structure 
encodes where in the heap there are references into a region or set of 
regions. You don't need to encode on the RSet which regions those are. 
You only need to encode where the references are coming from. And you 
just point to that RSet from all the regions it corresponds to.

> Also, for scrubbing efficiency you might not want the memory area that a
> sub-rset covers to cross a region too (or at least not span a lot of them).

I absolutely agree.

> Using a HeapRegion* directly as such "base index" is an unfortunate
> design decision though, and I guess that's what you want to avoid?
> (Finally got your point I belive)

Absolutely. Not only avoid that HeapRegion* provides the base index for 
each RSet component, but also that HeapRegion::GrainBytes provides the 
granularity for each RSet component.

Tony

> (I haven't thought that through as much as you did as you might notice;
> however I definitely assume we are talking about the same thing
> and we agree about that :).
>
> Thanks,
> Thomas
>
>

-- 
Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprintezis at twitter.com