g1: remembered set scan costs very high w/ LRU cache type of behavior

Sat Feb 27 10:48:10 PST 2010

Hello,

I have a very simple test case specifically designed to stress GC:s
(while being reasonably representative of some worst-case but
plausible production use cases):

   http://github.com/scode/plrutest/blob/master/src/org/scode/lrutest/LruTest.java

It implements a very simple LRU cached using a LinkedHashMap. It
inserts entries in this hash map at a pre-defined target rate (100 000
per second). Each insert is key:ed on a random integer between 0 and
CACHE_SIZE * 2; thus when the cache reaches steady state 50% of
iterations are expected to generate old garbage.

It is completely non-generational, and the intent was to see how well
the concurrent marking and partially young evacuations handle large
heaps.

I have run it with various parameters, but the ones relevant to this post are:

    -verbose:gc
    -Xms20g \
    -Xmx20g \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintGCDetails \
    -XX:+UnlockExperimentalVMOptions \
    -XX:+UseG1GC \
    -XX:MaxGCPauseMillis=10 \
    -XX:GCPauseIntervalMillis=20 \
    -XX:G1ConfidencePercent=100 \
    -XX:+G1ParallelRSetUpdatingEnabled \
    -XX:+G1ParallelRSetScanningEnabled \
    -XX:+DisableExplicitGC \

It's being run on a 16 core 64 bit Linux machine with 36 GB of RAM.

(I know about there being a bug that affects parallel rset
scanning/updating; for the purpose of this test I don't care about
that. The observed behavior of very high RS scan costs happens with
non-parallel RS scanning as well, only the numbers of higher since it
will be running on one core instead of 16.)

However, I run into trouble with young evacuations before even
reaching the point of generating actual garbage (i.e., before the
cache is full). Up to roughly 1 gb of heap used (as reported by GC
output), pauses are fairly reasonable, with about 10 ms being spent in
the RS scan phase (with parallel rs scanning enabled). At roughly 1 gb
heap size the RS scan cost as reported by the GC output, suddenly
starts steadly increasing until it reaches at least hundreds of
milliseconds with no sign of stopping. Here is some sample output at ~
1.7 gb heap size:

103.128: [GC pause (young)^C, 0.19716900 secs]
   [Parallel Time: 193.8 ms]
      [Update RS (Start) (ms):  103128.7  103128.7  103128.7  103128.7
 103128.8  103128.8  103128.8  103128.9  103128.9  103128.9  103129.0
103129.0  103129.1]
      [Update RS (ms):  1.1  1.2  1.0  0.8  1.2  1.2  1.1  1.0  1.1
1.1  0.9  0.1  0.0
       Avg:   0.9, Min:   0.0, Max:   1.2]
         [Processed Buffers : 1 1 1 1 1 1 1 1 1 1 1 1 0
          Sum: 12, Avg: 0, Min: 0, Max: 1]
      [Ext Root Scanning (ms):  0.2  0.1  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.2]
      [Mark Stack Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Scan-Only Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
         [Scan-Only Regions : 0 0 0 0 0 0 0 0 0 0 0 0 0
          Sum: 0, Avg: 0, Min: 0, Max: 0]
      [Scan RS (ms):  178.8  178.8  179.1  179.4  178.7  179.2  178.9
178.8  179.2  179.2  178.3  179.9  179.6
       Avg: 179.1, Min: 178.3, Max: 179.9]
      [Object Copy (ms):  10.6  10.5  10.3  10.2  10.6  10.0  10.3
10.4  9.9  9.9  10.9  10.2  10.5
       Avg:  10.3, Min:   9.9, Max:  10.9]
      [Termination (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Other:   3.4 ms]
   [Clear CT:   1.7 ms]
   [Other:   1.7 ms]
   [ 1722M->1720M(20480M)]
 [Times: user=2.44 sys=0.06, real=0.19 secs]

Except the RS scan bit, it looks reasonable. Object copy seems to be
landing very nicely at around my asked-for 10 ms. RS updating is
limited as would be expected given that most writes to the cache would
likely happen to old objects. Root/stack scanning is insignificant
since it is a small program with a single thread.

But RS scan is apparantly very expensive. While I can see how RS
scanning would be a significant factor given that all objects created
in the young generation (barring any internal to the LinkedHashMap)
will be referenced by some pseudo-random object in non-young regions
(internal LinkedHashMap structure), I would expect the cost of RS
scanning per young evacuation to scale with the size of the young
generation. Beyond the edge case of a very very small hash map, I
expect the distribution of references into the young generation to be
fairly widely spread across the internal structure of the hash map.

I can see some possibilities:

(1) Maybe the LInkedHashMap happens to be structured such that there
is significant locality to the object referenced put into it, such
that up to the 1 gb heap size the number of distinct dirty cards is
fairly low. The increase in RS scan time I am seeing may simply be the
cost converging on an eventual very high cost resulting from poor
locality of writes to old regions. In this case, it may be that the
remember set simply do not scale well enough period, with many
well-spread writes to older generations pointing to young data.

(2) Maybe I am right in my thinking that writes to older regions would
already be well spread long before the 1gb heapsize, and there is
something wrong with the RS scanning itself that causes costs to go up
much more than they "should" given the work load.

Is there a better explanation?

In the case of (1) it seems to me that what I really want is to run G1
in non-generational mode since I would then expect it to evacuate
useful regions that does in fact free data (thanks to the statistics
gathered by the concurrent marking). In this case the cost of RS
scanning, even if it may be high, would at least be a trade-off to
actual progress in terms of freeing heap. RIght now, all this RS scan
work is being spent on not making any progress.

(What are the reasons for non-generational mode no longer being
supported? Enabling it asserts later on so it is clearly not by
accident. Is it because of the optimization to not to remember set
writes for writes in the young generation, that it is not just a
switch to turn it off? Are there other problems with running it
non-generationally?)

In the case of (2) perhaps this indicates a fixable problem with the GC.

As an aside, the GC behavior during young generation eviction is, as
expected, a lot better when I run it against an immutable LRU cache:

   http://github.com/scode/plrutest/blob/master/src/org/scode/plrutest/clojure/lrucache.clj

(That is just the cache, not a runnable test.)

-- 
/ Peter Schuller