g1: remembered set scan costs very high w/ LRU cache type of behavior

Sat Feb 27 11:13:23 PST 2010

Peter,

Which version of the jdk are you using?  A jdk 6.0 update?  Or
jdk 7.0 build?

Thanks.

Jon

Peter Schuller wrote On 02/27/10 10:48,:

>Hello,
>
>I have a very simple test case specifically designed to stress GC:s
>(while being reasonably representative of some worst-case but
>plausible production use cases):
>
>   http://github.com/scode/plrutest/blob/master/src/org/scode/lrutest/LruTest.java
>
>It implements a very simple LRU cached using a LinkedHashMap. It
>inserts entries in this hash map at a pre-defined target rate (100 000
>per second). Each insert is key:ed on a random integer between 0 and
>CACHE_SIZE * 2; thus when the cache reaches steady state 50% of
>iterations are expected to generate old garbage.
>
>It is completely non-generational, and the intent was to see how well
>the concurrent marking and partially young evacuations handle large
>heaps.
>
>I have run it with various parameters, but the ones relevant to this post are:
>
>    -verbose:gc
>    -Xms20g \
>    -Xmx20g \
>    -XX:+PrintGCTimeStamps \
>    -XX:+PrintGCDetails \
>    -XX:+UnlockExperimentalVMOptions \
>    -XX:+UseG1GC \
>    -XX:MaxGCPauseMillis=10 \
>    -XX:GCPauseIntervalMillis=20 \
>    -XX:G1ConfidencePercent=100 \
>    -XX:+G1ParallelRSetUpdatingEnabled \
>    -XX:+G1ParallelRSetScanningEnabled \
>    -XX:+DisableExplicitGC \
>
>It's being run on a 16 core 64 bit Linux machine with 36 GB of RAM.
>
>(I know about there being a bug that affects parallel rset
>scanning/updating; for the purpose of this test I don't care about
>that. The observed behavior of very high RS scan costs happens with
>non-parallel RS scanning as well, only the numbers of higher since it
>will be running on one core instead of 16.)
>
>However, I run into trouble with young evacuations before even
>reaching the point of generating actual garbage (i.e., before the
>cache is full). Up to roughly 1 gb of heap used (as reported by GC
>output), pauses are fairly reasonable, with about 10 ms being spent in
>the RS scan phase (with parallel rs scanning enabled). At roughly 1 gb
>heap size the RS scan cost as reported by the GC output, suddenly
>starts steadly increasing until it reaches at least hundreds of
>milliseconds with no sign of stopping. Here is some sample output at ~
>1.7 gb heap size:
>
>103.128: [GC pause (young)^C, 0.19716900 secs]
>   [Parallel Time: 193.8 ms]
>      [Update RS (Start) (ms):  103128.7  103128.7  103128.7  103128.7
> 103128.8  103128.8  103128.8  103128.9  103128.9  103128.9  103129.0
>103129.0  103129.1]
>      [Update RS (ms):  1.1  1.2  1.0  0.8  1.2  1.2  1.1  1.0  1.1
>1.1  0.9  0.1  0.0
>       Avg:   0.9, Min:   0.0, Max:   1.2]
>         [Processed Buffers : 1 1 1 1 1 1 1 1 1 1 1 1 0
>          Sum: 12, Avg: 0, Min: 0, Max: 1]
>      [Ext Root Scanning (ms):  0.2  0.1  0.0  0.0  0.0  0.0  0.0  0.0
> 0.0  0.0  0.0  0.0  0.0
>       Avg:   0.0, Min:   0.0, Max:   0.2]
>      [Mark Stack Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0
>0.0  0.0  0.0  0.0  0.0  0.0
>       Avg:   0.0, Min:   0.0, Max:   0.0]
>      [Scan-Only Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0
>0.0  0.0  0.0  0.0  0.0  0.0
>       Avg:   0.0, Min:   0.0, Max:   0.0]
>         [Scan-Only Regions : 0 0 0 0 0 0 0 0 0 0 0 0 0
>          Sum: 0, Avg: 0, Min: 0, Max: 0]
>      [Scan RS (ms):  178.8  178.8  179.1  179.4  178.7  179.2  178.9
>178.8  179.2  179.2  178.3  179.9  179.6
>       Avg: 179.1, Min: 178.3, Max: 179.9]
>      [Object Copy (ms):  10.6  10.5  10.3  10.2  10.6  10.0  10.3
>10.4  9.9  9.9  10.9  10.2  10.5
>       Avg:  10.3, Min:   9.9, Max:  10.9]
>      [Termination (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
>0.0  0.0  0.0  0.0
>       Avg:   0.0, Min:   0.0, Max:   0.0]
>      [Other:   3.4 ms]
>   [Clear CT:   1.7 ms]
>   [Other:   1.7 ms]
>   [ 1722M->1720M(20480M)]
> [Times: user=2.44 sys=0.06, real=0.19 secs]
>
>Except the RS scan bit, it looks reasonable. Object copy seems to be
>landing very nicely at around my asked-for 10 ms. RS updating is
>limited as would be expected given that most writes to the cache would
>likely happen to old objects. Root/stack scanning is insignificant
>since it is a small program with a single thread.
>
>But RS scan is apparantly very expensive. While I can see how RS
>scanning would be a significant factor given that all objects created
>in the young generation (barring any internal to the LinkedHashMap)
>will be referenced by some pseudo-random object in non-young regions
>(internal LinkedHashMap structure), I would expect the cost of RS
>scanning per young evacuation to scale with the size of the young
>generation. Beyond the edge case of a very very small hash map, I
>expect the distribution of references into the young generation to be
>fairly widely spread across the internal structure of the hash map.
>
>I can see some possibilities:
>
>(1) Maybe the LInkedHashMap happens to be structured such that there
>is significant locality to the object referenced put into it, such
>that up to the 1 gb heap size the number of distinct dirty cards is
>fairly low. The increase in RS scan time I am seeing may simply be the
>cost converging on an eventual very high cost resulting from poor
>locality of writes to old regions. In this case, it may be that the
>remember set simply do not scale well enough period, with many
>well-spread writes to older generations pointing to young data.
>
>(2) Maybe I am right in my thinking that writes to older regions would
>already be well spread long before the 1gb heapsize, and there is
>something wrong with the RS scanning itself that causes costs to go up
>much more than they "should" given the work load.
>
>Is there a better explanation?
>
>In the case of (1) it seems to me that what I really want is to run G1
>in non-generational mode since I would then expect it to evacuate
>useful regions that does in fact free data (thanks to the statistics
>gathered by the concurrent marking). In this case the cost of RS
>scanning, even if it may be high, would at least be a trade-off to
>actual progress in terms of freeing heap. RIght now, all this RS scan
>work is being spent on not making any progress.
>
>(What are the reasons for non-generational mode no longer being
>supported? Enabling it asserts later on so it is clearly not by
>accident. Is it because of the optimization to not to remember set
>writes for writes in the young generation, that it is not just a
>switch to turn it off? Are there other problems with running it
>non-generationally?)
>
>In the case of (2) perhaps this indicates a fixable problem with the GC.
>
>As an aside, the GC behavior during young generation eviction is, as
>expected, a lot better when I run it against an immutable LRU cache:
>
>   http://github.com/scode/plrutest/blob/master/src/org/scode/plrutest/clojure/lrucache.clj
>
>(That is just the cache, not a runnable test.)
>
>  
>