G1GC Full GCs

Tue Jul 6 17:41:14 PDT 2010

On Tue, Jul 6, 2010 at 4:57 PM, Y. S. Ramakrishna <
y.s.ramakrishna at oracle.com> wrote:

> Two questions:
>
> (1) do you enable class unloading with CMS? (I do not see that
>    below in yr option list, but wondered.)
>
>
We don't enable class unloading.

> (2) does your application load classes, or intern a lot of strings?
>
> We do a little bit of class loading at startup - we have some basic
informational web UIs that are in Jetty servlets, and I think Jetty uses its
own ClassLoader. It's nothing fancy, though, and nothing should get
unloaded/reloaded. Perm usually stabilizes around 30MB if I recall correctly
when I use CMS.

> If i am reading the logs right, G1 appears to reclaim less
> and less of the heap in each cycle until a full collection
> intervenes, and I have no real explanation for this behaviour
> except that perhaps there's something in the perm gen that
> keeps stuff in the remainder of the heap artificially live.
> G1 does not incrementally collect the young gen, so this is
> plausible. But CMS does not either by default and I do not see
> that option in the CMS options list you gave below.

We don't enable permgen collection with CMS - the options I gave were from
production instances.

> It would
> be instructive to see what the comparable CMS logs look like.
> May be then you could start with the same heap shapes for the
> two and see if you can get to the bottom of the full gc (which
> as i understand you get to more quickly w/G1 than you did
> w/CMS).
>

Yep, I'll kick off some CMS tests this evening and get back to you with
those logs.

Thanks for the help, all.
-Todd

>
> On 07/06/10 14:24, Todd Lipcon wrote:
>
>> On Tue, Jul 6, 2010 at 2:09 PM, Jon Masamitsu <jon.masamitsu at oracle.com<mailto:
>> jon.masamitsu at oracle.com>> wrote:
>>
>>    Todd,
>>
>>    Could you send a segment of the GC logs from the beginning
>>    through the first dozen or so full GC's?
>>
>>
>> Sure, I just put it online at:
>>
>> http://cloudera-todd.s3.amazonaws.com/gc-log-g1gc.txt
>>
>>
>>    Exactly which version of the JVM are you using?
>>
>>        java -version
>>
>>    will tell us.
>>
>>
>> Latest as of last night:
>>
>> [todd at monster01 ~]$ ./jdk1.7.0/jre/bin/java -version
>> java version "1.7.0-ea"
>> Java(TM) SE Runtime Environment (build 1.7.0-ea-b99)
>> Java HotSpot(TM) 64-Bit Server VM (build 19.0-b03, mixed mode)
>>
>>    Do you have a test setup where you could do some experiments?
>>
>>
>> Sure, I have a five node cluster here where I do lots of testing, happy to
>> try different builds/options/etc (though I probably don't have time to apply
>> patches and rebuild the JDK myself)
>>
>>    Can you send the set of CMS flags you use?  It might tell
>>    us something about the GC behavior of you application.
>>    Might not tell us anything but it's worth a look.
>>
>>
>> Different customers have found different flags to work well for them. One
>> user uses the following:
>>
>>
>> -Xmx12000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC \
>> -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=40 \
>>
>>
>> Another uses:
>>
>>
>> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC
>> -XX:NewSize=64m -XX:MaxNewSize=64m
>> -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails
>>
>>
>>
>>
>>
>>
>>
>> The particular tuning options probably depend on the actual cache workload
>> of the user. I tend to recommend CMSInitiatingOccupancyFraction around 75 or
>> so, since the software maintains about 60% heap usage. I also think a
>> NewSize slightly larger would improve things a bit, but if it gets more than
>> 256m or so, the ParNew pauses start to be too long for a lot of use cases.
>>
>> Regarding CMS logs, I can probably restart this test later this afternoon
>> on CMS and run it for a couple hours, but it isn't likely to hit the
>> multi-minute compaction that quickly. It happens more in the wild.
>>
>> -Todd
>>
>>
>>
>>    On 07/06/10 13:27, Todd Lipcon wrote:
>>
>>>    Hi all,
>>>
>>>    I work on HBase, a distributed database written in Java. We
>>>    generally run on large heaps (8GB+), and our object lifetime
>>>    distribution has proven pretty problematic for garbage collection
>>>    (we manage a multi-GB LRU cache inside the process, so in CMS we
>>>    tenure a lot of byte arrays which later get collected).
>>>
>>>    In Java6, we generally run with the CMS collector, tuning down
>>>    CMSInitiatingOccupancyFraction and constraining MaxNewSize to
>>>    achieve fairly low pause GC, but after a week or two of uptime we
>>>    often run into full heap compaction which takes several minutes
>>>    and wreaks havoc on the system.
>>>
>>>    Needless to say, we've been watching the development of the G1 GC
>>>    with anticipation for the last year or two. Finally it seems in
>>>    the latest build of JDK7 it's stable enough for actual use
>>>    (previously it would segfault within an hour or so). However, in
>>>    my testing I'm seeing a fair amount of 8-10 second Full GC pauses.
>>>
>>>    The flags I'm passing are:
>>>
>>>    -Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
>>>    -XX:GCPauseIntervalMillis=80
>>>
>>>    Most of the pauses I see in the GC log are around 10-20ms as expected:
>>>
>>>    2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial),
>>>    0.01209000 secs]
>>>       [Parallel Time:  10.5 ms]
>>>          [Update RS (Start) (ms):  1680080.2  1680080.1  1680080.2
>>>     1680079.9  1680080.0  1680080.2  1680080.1  1680080.1  1680080.0
>>>     1680080.1  1680080.0  1680079.9  1680081.5]
>>>          [Update RS (ms):  1.4  2.0  2.2  1.8  1.7  1.4  2.5  2.2
>>>     1.9  2.5  1.7  1.7  0.1
>>>           Avg:   1.8, Min:   0.1, Max:   2.5]
>>>             [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
>>>              Sum: 52, Avg: 4, Min: 1, Max: 8]
>>>          [Ext Root Scanning (ms):  0.7  0.5  0.5  0.3  0.5  0.7  0.4
>>>     0.5  0.4  0.5  0.3  0.3  0.0
>>>           Avg:   0.4, Min:   0.0, Max:   0.7]
>>>          [Mark Stack Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0
>>>     0.0  0.0  0.0  0.0  0.0  0.0  0.0
>>>           Avg:   0.0, Min:   0.0, Max:   0.0]
>>>          [Scan RS (ms):  0.9  0.4  0.1  0.7  0.7  0.9  0.0  0.1  0.6
>>>     0.0  0.8  0.8  0.9
>>>           Avg:   0.5, Min:   0.0, Max:   0.9]
>>>          [Object Copy (ms):  7.2  7.2  7.3  7.3  7.1  7.1  7.0  7.2
>>>     7.1  6.9  7.1  7.1  7.0
>>>           Avg:   7.1, Min:   6.9, Max:   7.3]
>>>          [Termination (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
>>>     0.0  0.0  0.0  0.0  0.0
>>>           Avg:   0.0, Min:   0.0, Max:   0.0]
>>>          [Other:   0.6 ms]
>>>       [Clear CT:   0.5 ms]
>>>       [Other:   1.1 ms]
>>>          [Choose CSet:   0.0 ms]
>>>       [ 7677M->7636M(8000M)]
>>>     [Times: user=0.12 sys=0.00, real=0.01 secs]
>>>
>>>    But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
>>>    [todd at monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
>>>    2010-07-06T12:50:41.216-0700: 52521.446: [Full GC
>>>    7934M->4865M(8000M), 9.8907800 secs]
>>>    2010-07-06T12:55:39.802-0700: 52820.032: [Full GC
>>>    7930M->4964M(8000M), 9.9025520 secs]
>>>    2010-07-06T13:02:26.872-0700: 53227.102: [Full GC
>>>    7934M->4882M(8000M), 10.1232190 secs]
>>>    2010-07-06T13:09:41.049-0700: 53661.279: [Full GC
>>>    7938M->5002M(8000M), 10.4997760 secs]
>>>    2010-07-06T13:18:51.531-0700: 54211.761: [Full GC
>>>    7938M->4962M(8000M), 11.0497380 secs]
>>>
>>>    These pauses are pretty unacceptable for soft real time operation.
>>>
>>>    Am I missing some tuning that should be done for G1GC for
>>>    applications like this? Is 20ms out of 80ms too aggressive a
>>>    target for the garbage rates we're generating?
>>>
>>>    My actual live heap usage should be very stable around 5GB - the
>>>    application very carefully accounts its live object set at around
>>>    60% of the max heap (as you can see in the logs above).
>>>
>>>    At this point we are considering doing crazy things like ripping
>>>    out our main memory consumers into a custom slab allocator, and
>>>    manually reference count the byte array slices. But, if we can get
>>>    G1GC to work for us, it will save a lot of engineering on the
>>>    application side!
>>>
>>>    Thanks
>>>    -Todd
>>>
>>>    --     Todd Lipcon
>>>    Software Engineer, Cloudera
>>>
>>>  ------------------------------------------------------------------------
>>>
>>>    _______________________________________________
>>>    hotspot-gc-use mailing list
>>>    hotspot-gc-use at openjdk.java.net <mailto:
>>> hotspot-gc-use at openjdk.java.net>
>>>
>>>    http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>

-- 
Todd Lipcon
Software Engineer, Cloudera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100706/595fba5a/attachment.html