G1GC Full GCs

Fri Jul 30 13:47:31 PDT 2010

> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short.

(The following is wannabe speculation based on limited understanding
of the code, please take it with a grain of salt.)

My first thought here is swapping. My reading is that other time is
going to be the collection set selection time plus the collection set
free time (or at least intended to be). I think (am I wrong?) that
this should be really low under normal circumstances since no "bulk"
work is done really; in particular the *per-region* cost should be
low.

If the cost of these operations *per region* ended up being predicted
to > 40ms, I wonder if this was not due to swapping?

Additionally: As far as I can tell the estimated 'other' cost is based
on a history of the cost from previous GC:s and completely independent
of the particular region being evaluated.

Anyways, I suspect you've already confirmed that the system is not
actively swapping at the time of the fallback to full GC. But here is
one low-confidence  hypothesis (it would be really great to hear from
one of the gc devs whether it is even remotely plausible):

* At some point in time, there was swapping happening affecting GC
operations such that the work done do gather stats and select regions
was slow (makes some sense since that should touch lots of distinct
regions and you don't need a lot of those memory accesses swapping to
accumulate quite a bit of time).

* This screwed up the 'other' cost history and thus the prediction,
possibly for both young and non-young regions.

* I believe young collections would never be entirely prevented due to
pause time goals, so here the cost history and thus predictions would
always have time to recover and you would not notice any effect
looking at the behavior of the system down the line.

* Non-young "other" cost was so high that non-young regions were never
selected. This in turn meant that additional cost history for the
"other" category was never recorded, preventing recovery from the
temporary swap storm.

* The end result is that no non-young regions are ever collected, and
you end up falling back to full GC once the young collections have
"leaked" enough garbage.

Thoughts, anyone?

-- 
/ Peter Schuller