g1: dealing with high rates of inter-region pointer writes

Thu Jan 6 14:55:49 UTC 2011

> It is http://github.com/scode/httpgctest - but before anyone spends
> time on this let me come up with a different test case that is better
> (an actual LRU, and for both immutable and non-immutable cases). I'll
> try to do that in time for next week.

I now have an LRU test case which is entirely realistic, and can be
observed causing fallback to full GC on a 2 GB heap, with a live set
size of 350 MB (according to the results of a full GC). This is in
spite of the mutator being severely rate limited and only saturating
10-20% of a single core on this machine. Test performed with recent
openjdk7 and with G1 pause time/interval goals of 50/75.

The test shows very little partial collection activity (but lots of
concurrent marking, to the point of wasting up to 10x amount of CPU
relative to mutator).

I originally wanted to do something more elaborate (comparing both
immutable LRU with regular mutable Java style LRU, doing some graphs
etc) but I realized I was not going to get to it in a reasonable
amount of time, so this will have to do for now. I did run this test
briefly with a VM that prints sparse table entry overflow, and they
are indeed happening a lot. I did not, for this particular test, look
specifically at liveness information and rs scan costs for individual
regions.

Assuming git and maven is available, the easiest way to run the test is:

  # Get plru, which is a dependency not available in public repositories
  git clone git://github.com/scode/plru.git
  (cd plru && mvn install)

  # Build/run the test itself
  git clone git://github.com/scode/lrugctest.git
  cd lrugctest
  mvn compile
  mvn clojure:run

However, for practical purposes there is a wrapper script 'run.sh'
which runs it with some appropriate VM options depending on collector.
Here is the command line I used for G1; substitute accordingly:

   GCTEST_COLLECTOR=g1 JAVA_HOME=/usr/local/openjdk7 ./run.sh

The test itself is just a rate limited loop that throws data into an
LRU with pseudo-random keys, simulating 50% cache hit rate (on a full
cache). While it is running, it's also listening for HTTP requests on
port 9191, allowing changing the LRU size and the rate limiting on the
running test without restarting (see the README for example curl
commands).

Here is a subset of the output of my test run (the end of it); notice
the full GC at the end and the lack of significant partial collection
following the preceeding concurrent mark:

382.372: [GC concurrent-mark-end, 2.9801090 sec]
382.372: [GC remark, 0.0007910 secs]
382.373: [GC concurrent-count-start]
1000000 entries, 100.0% full
382.854: [GC concurrent-count-end, 0.4811190]
382.854: [GC cleanup 1913M->1913M(2048M), 0.1419250 secs]
382.996: [GC concurrent-cleanup-start]
382.996: [GC concurrent-cleanup-end, 0.0001040]
383.302: [GC pause (young) 1945M->1855M(2048M), 0.0295900 secs]
383.360: [GC pause (partial) 1858M->1853M(2048M), 0.0255510 secs]
1000000 entries, 100.0% full
383.951: [GC pause (young) 1901M->1859M(2048M), 0.0134920 secs]
384.793: [GC pause (young) (initial-mark) 1953M->1868M(2048M)384.822:
[GC concurrent-mark-start]
, 0.0289590 secs]
1000000 entries, 100.0% full
385.222: [GC pause (young) 1913M->1873M(2048M), 0.0428580 secs]
1000000 entries, 100.0% full
386.087: [GC pause (young) 1960M->1881M(2048M), 0.0609920 secs]
1000000 entries, 100.0% full
386.908: [GC pause (young) 1964M->1888M(2048M), 0.0733250 secs]
387.102: [GC concurrent-mark-end, 2.2799430 sec]
387.102: [GC remark, 0.0007420 secs]
387.103: [GC concurrent-count-start]
387.584: [GC concurrent-count-end, 0.4815970]
387.584: [GC cleanup 1955M->1955M(2048M), 0.1430320 secs]
387.727: [GC concurrent-cleanup-start]
387.728: [GC concurrent-cleanup-end, 0.0000620]
387.840: [GC pause (young) 1967M->1895M(2048M), 0.0287990 secs]
1000000 entries, 100.0% full
387.916: [GC pause (partial) 1898M->1890M(2048M), 0.0295090 secs]
388.770: [GC pause (young) 1968M->1898M(2048M), 0.0215940 secs]
1000000 entries, 100.0% full
389.441: [GC pause (young) (initial-mark) 1972M->1905M(2048M)389.465:
[GC concurrent-mark-start]
, 0.0244800 secs]
1000000 entries, 100.0% full
1000000 entries, 100.0% full
391.638: [GC concurrent-mark-end, 2.1725060 sec]
391.684: [GC remark, 0.0151950 secs]
391.700: [GC concurrent-count-start]
1000000 entries, 100.0% full
392.206: [GC pause (young)-- 2046M->2047M(2048M), 0.4325950 secs]
392.642: [Full GC 2047M->355M(2048M), 3.0683440 secs]

I think the test is interesting because it's quite realistic:

* Pause time goal of 50 ms over 75 ms. The 50/75 ratio is not extreme
at all, and 50 ms in absolute numbers is not extreme either,
particularly for so very limited mutator activity.

* It's an LRU cache. This is a frequent use-case.

Again though, I realize the persistent data structure version of the
LRU is uncommon. I've observed and reported high RS scan times before
with a regular LinkedHashMap, but that was before I had dug into
things as far as now so I only had PrintGCDetails and no information
on liveness or sparse table entry overflows (but I highly suspect this
was the cause there too).

I am still interested in just hearing opinions on whether or not this
is unexpected. As far as I can understand based on how G1 works, it
*seems* expected that lots of inter-region writes would be problematic
in that they massively inflate the conservative view of the remembered
set. And given the cost of rs scanning and the way regions are picked
for collection, it seems expected that the result will be the
inability to collect these regions.

-- 
/ Peter Schuller