G1GC Full GCs

Fri Jan 21 11:38:52 PST 2011

Hey folks,

Took some time over the last day or two to follow up on this on the latest
checkout of JDK7. I added some more instrumentation and my findings so far
are:

1) CMS is definitely hitting a fragmentation problem. Our workload is pretty
much guaranteed to fragment, and I don't think there's anything CMS can do
about it - see the following graphs:
http://people.apache.org/~todd/hbase-fragmentation/

<http://people.apache.org/~todd/hbase-fragmentation/>2) G1GC is hitting full
pauses because the "other" pause time predictions end up higher than the
minimum pause length. I'm seeing the following sequence:

- A single choose_cset operation for a non_young region takes a long time
(unclear yet why this is happening, see below)
- This inflates the predict_non_young_other_time_ms(1) result to a value
greater than my pause goal
- From then on, it doesn't collect any more non-young regions (ever!)
because any region will be considered expensive regardless of the estimated
rset or collection costs
- The heap slowly fills up with non-young regions until we reach a full GC

3) So the question is why the choose_cset is taking a long time. I added
getrusage() calls to wrap the choose_cset operation. Here's some output with
extra logging:

--> About to choose cset at   725.458
Adding 1 young regions to the CSet
  Added 0x0000000000000001 Young Regions to CS.
    (3596288 KB left in heap.)
    (picked region; 9.948053ms predicted; 21.164738ms remaining; 2448kb
marked; 2448kb maxlive; 59-59% liveness)
    (3593839 KB left in heap.)
predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan (14528
cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
    (picked region; 10.493828ms predicted; 11.216685ms remaining; 2279kb
marked; 2279kb maxlive; 55-55% liveness)
predict_region_elapsed_time_ms: 10.493828ms total, 5.486072ms rs scan (14528
cnum), 4.828102 copy time (2333800 bytes), 0.179654 other time
    (3591560 KB left in heap.)
predict_region_elapsed_time_ms: 10.346346ms total, 5.119780ms rs scan (13558
cnum), 5.046912 copy time (2439568 bytes), 0.179654 other time
predict_region_elapsed_time_ms: 10.407672ms total, 5.333135ms rs scan (14123
cnum), 4.894882 copy time (2366080 bytes), 0.179654 other time
    (no more marked regions; next region too expensive (adaptive; predicted
10.407672ms > remaining 0.722857ms))
Resource usage of choose_cset:majflt: 0 nswap: 0  nvcsw: 6  nivcsw: 0
--> About to prepare RS scan at   725.657

The resource usage line with nvcsw=6 indicates there were 6 voluntary
context switches while choosing cset. This choose_cset operation took
198.9ms all in choosing non-young.

So, why are there voluntary context switches while choosing cset? This isn't
swapping -- that should show under majflt, right? My only theories are:
- are any locks acquired in choose_cset?
- maybe the gc logging itself is blocking on IO to the log file? ie the
instrumentation itself is interfering with the algorithm?

Regardless, I think a single length choose_non_young_cset operation
shouldn't be allowed to push the prediction above the time boundary and
trigger this issue. Perhaps a simple workaround is that, whenever a
collection chooses no non_young regions, it should contribute a value of 0
to the average?

I'll give this heuristic a try on my build and see if it solves the issue.

-Todd

On Tue, Jul 27, 2010 at 3:08 PM, Todd Lipcon <todd at cloudera.com> wrote:

> Hi all,
>
> Back from my vacation and took some time yesterday and today to build a
> fresh JDK 7 with some additional debug printouts from Peter's patch.
>
> What I found was a bit different - the rset scanning estimates are low, but
> I consistently am seeing "Other time" estimates in the >40ms range. Given my
> pause time goal of 20ms, these estimates are I think excluding most of the
> regions from collectability. I haven't been able to dig around yet to figure
> out where the long estimate for "other time" is coming from - in the
> collections logged it sometimes shows fairly high "Other" but the "Choose
> CSet" component is very short. I'll try to add some more debug info to the
> verbose logging and rerun some tests over the next couple of days.
>
> At the moment I'm giving the JRockit VM a try to see how its deterministic
> GC stacks up against G1 and CMS.
>
> Thanks
> -Todd
>
>
> On Tue, Jul 13, 2010 at 5:15 PM, Peter Schuller <
> peter.schuller at infidyne.com> wrote:
>
>> Ramki/Tony,
>>
>> > Any chance of setting +PrintHeapAtGC and -XX:+PrintHeapAtGCExtended and
>> > sending us the log, or part of it (say between two Full GCs)? Be
>> prepared:
>> > this will generate piles of output. But it will give us per-region
>> > information that might shed more light on the cause of the issue....
>> thanks,
>>
>> So what I have in terms of data is (see footnotes for urls references in
>> []):
>>
>> (a) A patch[1] that prints some additional information about estimated
>> costs of region eviction, and disables the GC efficiency check that
>> normally terminates selection of regions. (Note: This is a throw-away
>> patch for debugging; it's not intended as a suggested change for
>> inclusion.)
>>
>> (b) A log[2] showing the output of a test run I did just now, with
>> both your flags above and my patch enabled (but without disabling the
>> efficiency check). It shows fallback to full GC when the actual live
>> set size is 252 MB, and the maximum heap size is 2 GB (in other words,
>> ~ 12% liveness). An easy way to find the point of full gc is to search
>> for the string 'full 1'.
>>
>> (c) A file[3] with the effective VM options during the test.
>>
>> (d) Instructions for how to run the test to reproduce it (I'll get to
>> that at the end; it's simplified relative to previously).
>>
>> (e) Nature of the test.
>>
>> Discussion:
>>
>> WIth respect to region information: I originally tried it in response
>> to your recommendation earlier, but I found I did not see the
>> information I was after. Perhaps I was just misreading it, but I
>> mostly just saw either 0% or 100% fullness, and never the actual
>> liveness estimate as produced by the mark phase. In the log I am
>> referring to in this E-Mail, you can see that the last printout of
>> region information just before the live GC fits this pattern; I just
>> don't see anything that looks like legitimate liveness information
>> being printed. (I don't have time to dig back into it right now to
>> double-check what it's printing.)
>>
>> If you scroll up from the point of the full gc until you find a bunch
>> of output starting with "predict_region_elapsed_time_ms" you see some
>> output resulting from the patch, with pretty extreme values such as:
>>
>> predict_region_elapsed_time_ms: 34.378642ms total, 34.021154ms rs scan
>> (46542 cnum), 0.040069 copy time (20704 bytes), 0.317419 other time
>> predict_region_elapsed_time_ms: 45.020866ms total, 44.653222ms rs scan
>> (61087 cnum), 0.050225 copy time (25952 bytes), 0.317419 other time
>> predict_region_elapsed_time_ms: 16.250033ms total, 15.887065ms rs scan
>> (21734 cnum), 0.045550 copy time (23536 bytes), 0.317419 other time
>> predict_region_elapsed_time_ms: 226.942877ms total, 226.559163ms rs
>> scan (309940 cnum), 0.066296 copy time (34256 bytes), 0.317419 other
>> time
>> predict_region_elapsed_time_ms: 542.344828ms total, 541.954750ms rs
>> scan (741411 cnum), 0.072659 copy time (37544 bytes), 0.317419 other
>> time
>> predict_region_elapsed_time_ms: 668.595597ms total, 668.220877ms rs
>> scan (914147 cnum), 0.057301 copy time (29608 bytes), 0.317419 other
>> time
>>
>> So in the most extreme case in the excerpt above, that's > half a
>> second of estimate rset scanning time for a single region with 914147
>> cards to be scanned. While not all are that extreme, lots and lots of
>> regions are very expensive and almost only due to rset scanning costs.
>>
>> If you scroll down a bit to the first (and ONLY) partial that happened
>> after the statistics accumulating from the marking phase, we see more
>> output resulting form the patch. At the end, we see:
>>
>>    (picked region; 0.345890ms predicted; 1.317244ms remaining; 15kb
>> marked; 15kb maxlive; 1-1% liveness)
>>    (393380 KB left in heap.)
>>    (picked region; 0.345963ms predicted; 0.971354ms remaining; 15kb
>> marked; 15kb maxlive; 1-1% liveness)
>>    (393365 KB left in heap.)
>>    (picked region; 0.346036ms predicted; 0.625391ms remaining; 15kb
>> marked; 15kb maxlive; 1-1% liveness)
>>    (393349 KB left in heap.)
>>    (no more marked regions; next region too expensive (adaptive;
>> predicted 0.346036ms > remaining 0.279355ms))
>>
>> So in other words, it picked a bunch of regions in order of "lowest
>> hanging fruit". The *least* low hanging fruit picked still had
>> liveness at 1%; in other words, there's plenty of further regions that
>> ideally should be collected because they contain almost no garbage
>> (ignoring the cost of collecting them).
>>
>> In this case, it stopped picking regions because the next region to be
>> picked, though cheap, was the straw that broke the camel's neck and we
>> simply exceeded the alloted time for this particular GC.
>>
>> However, after this partial completes, it reverts back to doing just
>> young gc:s. In other words, even though there's *plenty* of regions
>> with very low liveness, further partials aren't happening.
>>
>> By applying this part of the patch:
>>
>>  -           (adaptive_young_list_length() &&
>>  +           (adaptive_young_list_length() && false && // scodetodo
>>
>> I artificially force g1 to not fall back to doing young gc:s for
>> efficiency reasons. When I run with that change, I don't experience
>> the slow perpetual growth until fallback to full GC. If I remember
>> correctly though, the rset scanning cost is in fact high, but I don't
>> have details saved and I'm afraid I don't have time to re-run those
>> tests right now and compare numbers.
>>
>> Reproducing it:
>>
>> I made some changes and the test case should now hopefully be easy to
>> run assuming you have maven installed. The github project is at:
>>
>>   http://github.com/scode/httpgctest
>>
>> There is a README, but the shortest possible instructions to
>> re-produce the test that I did:
>>
>>  git clone git://github.com/scode/httpgctest.git
>>  cd httpgctest.git
>>  git checkout 20100714_1       # grab from appropriate tag, in case I
>> change master
>>  mvn package
>>  HTTPGCTEST_LOGGC=gc.log ./run.sh
>>
>> That should start the http server; then run concurrently:
>>
>>  while [ 1 ] ; do curl 'http://localhost:9191/dropdata?ratio=0.10' ;
>> curl 'http://localhost:9191/gendata?amount=25000' ; sleep 0.1  ; done
>>
>> And then just wait and observe.
>>
>> Nature of the test:
>>
>> So the test if run as above will essentially reach a steady state of
>> equilibrium with about 25000 pieces of data in a clojure immutable
>> map. The result is that a significant amount of new data is being
>> allocated, but very little writing to old regions is happening. The
>> garbage generated is very well spread out over the entire heap because
>> it goes through all objects and drops 10% (the ratio=0.10) for each
>> iteration, after which it adds 25000 new items.
>>
>> In other words; not a lot of old gen writing, but lots of writes to
>> the young gen referencing objects in the old gen.
>>
>> [1] http://distfiles.scode.org/mlref/g1/g1_region_live_stats_hack.patch
>> [2]
>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/gc-fullfallback.log
>> [3]
>> http://distfiles.scode.org/mlref/gctest/httpgctest-g1-fullgc-20100714/vmoptions.txt
>>
>> --
>> / Peter Schuller
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

-- 
Todd Lipcon
Software Engineer, Cloudera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110121/40e6c946/attachment.html