G1GC Full GCs

Tue Jul 13 17:15:43 PDT 2010

Ramki/Tony,

> Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
> sending us the log, or part of it (say between two Full GCs)? Be prepared:
> this will generate piles of output. But it will give us per-region
> information that might shed more light on the cause of the issue.... thanks,

So what I have in terms of data is (see footnotes for urls references in []):

(a) A patch[1] that prints some additional information about estimated
costs of region eviction, and disables the GC efficiency check that
normally terminates selection of regions. (Note: This is a throw-away
patch for debugging; it's not intended as a suggested change for
inclusion.)

(b) A log[2] showing the output of a test run I did just now, with
both your flags above and my patch enabled (but without disabling the
efficiency check). It shows fallback to full GC when the actual live
set size is 252 MB, and the maximum heap size is 2 GB (in other words,
~ 12% liveness). An easy way to find the point of full gc is to search
for the string 'full 1'.

(c) A file[3] with the effective VM options during the test.

(d) Instructions for how to run the test to reproduce it (I'll get to
that at the end; it's simplified relative to previously).

(e) Nature of the test.

Discussion:

WIth respect to region information: I originally tried it in response
to your recommendation earlier, but I found I did not see the
information I was after. Perhaps I was just misreading it, but I
mostly just saw either 0% or 100% fullness, and never the actual
liveness estimate as produced by the mark phase. In the log I am
referring to in this E-Mail, you can see that the last printout of
region information just before the live GC fits this pattern; I just
don't see anything that looks like legitimate liveness information
being printed. (I don't have time to dig back into it right now to
double-check what it's printing.)

If you scroll up from the point of the full gc until you find a bunch
of output starting with "predict_region_elapsed_time_ms" you see some
output resulting from the patch, with pretty extreme values such as:

predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
(46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
(61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
(21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
time
predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
time

So in the most extreme case in the excerpt above, that's > half a
second of estimate rset scanning time for a single region with 914147
cards to be scanned. While not all are that extreme, lots and lots of
regions are very expensive and almost only due to rset scanning costs.

If you scroll down a bit to the first (and ONLY) partial that happened
after the statistics accumulating from the marking phase, we see more
output resulting form the patch. At the end, we see:

    (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
    (393380 KB left in heap.)
    (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
    (393365 KB left in heap.)
    (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
marked; 15kb maxlive; 1-1% liveness)
    (393349 KB left in heap.)
    (no more marked regions; next region too expensive (adaptive;
predicted 0.346036ms > remaining 0.279355ms))

So in other words, it picked a bunch of regions in order of "lowest
hanging fruit". The *least* low hanging fruit picked still had
liveness at 1%; in other words, there's plenty of further regions that
ideally should be collected because they contain almost no garbage
(ignoring the cost of collecting them).

In this case, it stopped picking regions because the next region to be
picked, though cheap, was the straw that broke the camel's neck and we
simply exceeded the alloted time for this particular GC.

However, after this partial completes, it reverts back to doing just
young gc:s. In other words, even though there's *plenty* of regions
with very low liveness, further partials aren't happening.

By applying this part of the patch:

  -           (adaptive_young_list_length() &&
 +           (adaptive_young_list_length() && false && // scodetodo

I artificially force g1 to not fall back to doing young gc:s for
efficiency reasons. When I run with that change, I don't experience
the slow perpetual growth until fallback to full GC. If I remember
correctly though, the rset scanning cost is in fact high, but I don't
have details saved and I'm afraid I don't have time to re-run those
tests right now and compare numbers.

Reproducing it:

I made some changes and the test case should now hopefully be easy to
run assuming you have maven installed. The github project is at:

   http://github.com/scode/httpgctest

There is a README, but the shortest possible instructions to
re-produce the test that I did:

  git clone git://github.com/scode/httpgctest.git
  cd httpgctest.git
  git checkout 20100714_1       # grab from appropriate tag, in case I
change master
  mvn package
  HTTPGCTEST_LOGGC=gc.log ./run.sh

That should start the http server; then run concurrently:

  while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1  ; done

And then just wait and observe.

Nature of the test:

So the test if run as above will essentially reach a steady state of
equilibrium with about 25000 pieces of data in a clojure immutable
map. The result is that a significant amount of new data is being
allocated, but very little writing to old regions is happening. The
garbage generated is very well spread out over the entire heap because
it goes through all objects and drops 10% (the ratio=0.10) for each
iteration, after which it adds 25000 new items.

In other words; not a lot of old gen writing, but lots of writes to
the young gen referencing objects in the old gen.

[1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
[2] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
[3] http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt

-- 
/ Peter Schuller