RFR(M): 6921087: G1: remove per-GC-thread expansion tables from the fine-grain remembered sets

Mon Jun 25 21:43:42 UTC 2012

Hi Brandon, John,

On Mon, 2012-06-25 at 12:51 -0700, John Cuthbertson wrote: 
> Hi Brandon,
> 
> Thanks for the suggestion. The set of changes sent out were just the 
> cleanup changes. Thomas actually has an optimization where he maintains 
> a list of PRTs for each RSet. When he comes to free them - it's a single 
> list concatenation on to the free list which should have the same 
> benefit as you describe. In the meantime I'll forward your email to 
> Thomas to see if he has any comments.
> 
> On 06/19/12 13:15, Brandon Mitchell wrote:
> > @@ -992,10 +714,10 @@
> >  void OtherRegionsTable::clear() {
> >    MutexLockerEx x(&_m, Mutex::_no_safepoint_check_flag);
> >    for (size_t i = 0; i < _max_fine_entries; i++) {
> > -    PosParPRT* cur = _fine_grain_regions[i];
> > +    PerRegionTable* cur = _fine_grain_regions[i];
> >      while (cur != NULL) {
> > -      PosParPRT* nxt = cur->next();
> > -      PosParPRT::free(cur);
> > +      PerRegionTable* nxt = cur->next();
> > +      PerRegionTable::free(cur);
> >        cur = nxt;
> >      }
> >
> > Linking the PerRegionTables into a temporary list in the loop and
> > calling PerRegionTable::free() with that list should reduce contention
> > potential on the global freelist. You could also tighten up the
> > MutexLockerEx around the loop, and link to the free list without holding
> > that Mutex around the freelist CAS/spin cycle unnecessarily.
> 

  as John mentioned, the actual patch is still pending, we split the
cleanup from the actual change; and it uses the same technique among
other things to improve performance.

The initial cause for looking at it was very long collection set clear
time, often taken more time than the "actual" gc (on large heaps, with
smallish pause time goal).

The (pending) changes can be summarized as follows:

  - links together all PerRegionTables in a doubly linked list on a per
region remembered set basis as they are allocated.

  - doubly linked because it is easy and fast to remove single elements
from this list when the particular PerRegionTable remembered set is
coarsened. The patch adds two additional fields for the doubly linked
list, but this cleanup patch removed one or two member variables, so we
should be even again.

  - the patch also walks this list of all PerRegionTables to count the
number of remembered set entries - this is a lot faster than walking the
hash table array. (Which is another performance issue not mentioned yet)

  - the changes to improve clearing the fine remembered set themselves
are as follows:
    * first, if there are no PerRegionTables at all (which is not
uncommon btw), do not do anything. Not walking or even zeroing out
(again) the hash table array,
    * second, you can do the linking of the whole set of now free
PosParPRT into the free list in a single CAS operation. I.e. similar to
your idea, but since all PerRegionTables are already linked together no
need to do it again.
Consider that there may literally be thousands of regions to clear...
each with a lot of PerRegionTables to free - e.g. 16G young gen at 2M
regions -> 8k regions, so actually even "only" linking them together
right away will probably just take too long too.
    * and third, do a memset() to clear the hash table, not NULL'ing out
per element as done previously.

Regarding contention, I do not think that after these changes taking the
mutex and the CASes are points of contention any more. I did not really
check, but these clear operations are typically done (atm) serially, or
distributed across threads on a per-region basis, so there will be no
contention on the mutex (and has never been: you typically clear only if
you know nobody else is currently processing the region's rset...); the
CAS is basically taken care of too.

Tests showed that after applying the coming patch, remaining time will
be mostly spent clearing the sparse remembered sets. Imho it essentially
needs a redesign/rewrite because it has not been written with fast
clearing in mind at all.
Performance runs on larger heaps indicate that above changes will reduce
the average total time spent in clearing the remembered sets
significantly in my measurements (50-70%, depends on remembered set
composition).

Hth,
  Thomas