g1: remembered set scan costs very high w/ LRU cache type of behavior
Jon Masamitsu
Jon.Masamitsu at Sun.COM
Sat Feb 27 11:13:23 PST 2010
Peter,
Which version of the jdk are you using? A jdk 6.0 update? Or
jdk 7.0 build?
Thanks.
Jon
Peter Schuller wrote On 02/27/10 10:48,:
>Hello,
>
>I have a very simple test case specifically designed to stress GC:s
>(while being reasonably representative of some worst-case but
>plausible production use cases):
>
> http://github.com/scode/plrutest/blob/master/src/org/scode/lrutest/LruTest.java
>
>It implements a very simple LRU cached using a LinkedHashMap. It
>inserts entries in this hash map at a pre-defined target rate (100 000
>per second). Each insert is key:ed on a random integer between 0 and
>CACHE_SIZE * 2; thus when the cache reaches steady state 50% of
>iterations are expected to generate old garbage.
>
>It is completely non-generational, and the intent was to see how well
>the concurrent marking and partially young evacuations handle large
>heaps.
>
>I have run it with various parameters, but the ones relevant to this post are:
>
> -verbose:gc
> -Xms20g \
> -Xmx20g \
> -XX:+PrintGCTimeStamps \
> -XX:+PrintGCDetails \
> -XX:+UnlockExperimentalVMOptions \
> -XX:+UseG1GC \
> -XX:MaxGCPauseMillis=10 \
> -XX:GCPauseIntervalMillis=20 \
> -XX:G1ConfidencePercent=100 \
> -XX:+G1ParallelRSetUpdatingEnabled \
> -XX:+G1ParallelRSetScanningEnabled \
> -XX:+DisableExplicitGC \
>
>It's being run on a 16 core 64 bit Linux machine with 36 GB of RAM.
>
>(I know about there being a bug that affects parallel rset
>scanning/updating; for the purpose of this test I don't care about
>that. The observed behavior of very high RS scan costs happens with
>non-parallel RS scanning as well, only the numbers of higher since it
>will be running on one core instead of 16.)
>
>However, I run into trouble with young evacuations before even
>reaching the point of generating actual garbage (i.e., before the
>cache is full). Up to roughly 1 gb of heap used (as reported by GC
>output), pauses are fairly reasonable, with about 10 ms being spent in
>the RS scan phase (with parallel rs scanning enabled). At roughly 1 gb
>heap size the RS scan cost as reported by the GC output, suddenly
>starts steadly increasing until it reaches at least hundreds of
>milliseconds with no sign of stopping. Here is some sample output at ~
>1.7 gb heap size:
>
>103.128: [GC pause (young)^C, 0.19716900 secs]
> [Parallel Time: 193.8 ms]
> [Update RS (Start) (ms): 103128.7 103128.7 103128.7 103128.7
> 103128.8 103128.8 103128.8 103128.9 103128.9 103128.9 103129.0
>103129.0 103129.1]
> [Update RS (ms): 1.1 1.2 1.0 0.8 1.2 1.2 1.1 1.0 1.1
>1.1 0.9 0.1 0.0
> Avg: 0.9, Min: 0.0, Max: 1.2]
> [Processed Buffers : 1 1 1 1 1 1 1 1 1 1 1 1 0
> Sum: 12, Avg: 0, Min: 0, Max: 1]
> [Ext Root Scanning (ms): 0.2 0.1 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.2]
> [Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan-Only Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>0.0 0.0 0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Scan-Only Regions : 0 0 0 0 0 0 0 0 0 0 0 0 0
> Sum: 0, Avg: 0, Min: 0, Max: 0]
> [Scan RS (ms): 178.8 178.8 179.1 179.4 178.7 179.2 178.9
>178.8 179.2 179.2 178.3 179.9 179.6
> Avg: 179.1, Min: 178.3, Max: 179.9]
> [Object Copy (ms): 10.6 10.5 10.3 10.2 10.6 10.0 10.3
>10.4 9.9 9.9 10.9 10.2 10.5
> Avg: 10.3, Min: 9.9, Max: 10.9]
> [Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>0.0 0.0 0.0 0.0
> Avg: 0.0, Min: 0.0, Max: 0.0]
> [Other: 3.4 ms]
> [Clear CT: 1.7 ms]
> [Other: 1.7 ms]
> [ 1722M->1720M(20480M)]
> [Times: user=2.44 sys=0.06, real=0.19 secs]
>
>Except the RS scan bit, it looks reasonable. Object copy seems to be
>landing very nicely at around my asked-for 10 ms. RS updating is
>limited as would be expected given that most writes to the cache would
>likely happen to old objects. Root/stack scanning is insignificant
>since it is a small program with a single thread.
>
>But RS scan is apparantly very expensive. While I can see how RS
>scanning would be a significant factor given that all objects created
>in the young generation (barring any internal to the LinkedHashMap)
>will be referenced by some pseudo-random object in non-young regions
>(internal LinkedHashMap structure), I would expect the cost of RS
>scanning per young evacuation to scale with the size of the young
>generation. Beyond the edge case of a very very small hash map, I
>expect the distribution of references into the young generation to be
>fairly widely spread across the internal structure of the hash map.
>
>I can see some possibilities:
>
>(1) Maybe the LInkedHashMap happens to be structured such that there
>is significant locality to the object referenced put into it, such
>that up to the 1 gb heap size the number of distinct dirty cards is
>fairly low. The increase in RS scan time I am seeing may simply be the
>cost converging on an eventual very high cost resulting from poor
>locality of writes to old regions. In this case, it may be that the
>remember set simply do not scale well enough period, with many
>well-spread writes to older generations pointing to young data.
>
>(2) Maybe I am right in my thinking that writes to older regions would
>already be well spread long before the 1gb heapsize, and there is
>something wrong with the RS scanning itself that causes costs to go up
>much more than they "should" given the work load.
>
>Is there a better explanation?
>
>In the case of (1) it seems to me that what I really want is to run G1
>in non-generational mode since I would then expect it to evacuate
>useful regions that does in fact free data (thanks to the statistics
>gathered by the concurrent marking). In this case the cost of RS
>scanning, even if it may be high, would at least be a trade-off to
>actual progress in terms of freeing heap. RIght now, all this RS scan
>work is being spent on not making any progress.
>
>(What are the reasons for non-generational mode no longer being
>supported? Enabling it asserts later on so it is clearly not by
>accident. Is it because of the optimization to not to remember set
>writes for writes in the young generation, that it is not just a
>switch to turn it off? Are there other problems with running it
>non-generationally?)
>
>In the case of (2) perhaps this indicates a fixable problem with the GC.
>
>As an aside, the GC behavior during young generation eviction is, as
>expected, a lot better when I run it against an immutable LRU cache:
>
> http://github.com/scode/plrutest/blob/master/src/org/scode/plrutest/clojure/lrucache.clj
>
>(That is just the cache, not a runnable test.)
>
>
>
More information about the hotspot-gc-use
mailing list