G1GC Full GCs

Tue Jul 6 13:27:36 PDT 2010

Hi all,

I work on HBase, a distributed database written in Java. We generally run on
large heaps (8GB+), and our object lifetime distribution has proven pretty
problematic for garbage collection (we manage a multi-GB LRU cache inside
the process, so in CMS we tenure a lot of byte arrays which later get
collected).

In Java6, we generally run with the CMS collector, tuning down
CMSInitiatingOccupancyFraction and constraining MaxNewSize to achieve fairly
low pause GC, but after a week or two of uptime we often run into full heap
compaction which takes several minutes and wreaks havoc on the system.

Needless to say, we've been watching the development of the G1 GC with
anticipation for the last year or two. Finally it seems in the latest build
of JDK7 it's stable enough for actual use (previously it would segfault
within an hour or so). However, in my testing I'm seeing a fair amount of
8-10 second Full GC pauses.

The flags I'm passing are:

-Xmx8000m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=80

Most of the pauses I see in the GC log are around 10-20ms as expected:

2010-07-05T22:43:19.849-0700: 1680.079: [GC pause (partial), 0.01209000
secs]
   [Parallel Time:  10.5 ms]
      [Update RS (Start) (ms):  1680080.2  1680080.1  1680080.2  1680079.9
 1680080.0  1680080.2  1680080.1  1680080.1  1680080.0  1680080.1  1680080.0
 1680079.9  1680081.5]
      [Update RS (ms):  1.4  2.0  2.2  1.8  1.7  1.4  2.5  2.2  1.9  2.5
 1.7  1.7  0.1
       Avg:   1.8, Min:   0.1, Max:   2.5]
         [Processed Buffers : 8 1 3 1 1 7 3 2 6 2 7 8 3
          Sum: 52, Avg: 4, Min: 1, Max: 8]
      [Ext Root Scanning (ms):  0.7  0.5  0.5  0.3  0.5  0.7  0.4  0.5  0.4
 0.5  0.3  0.3  0.0
       Avg:   0.4, Min:   0.0, Max:   0.7]
      [Mark Stack Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Scan RS (ms):  0.9  0.4  0.1  0.7  0.7  0.9  0.0  0.1  0.6  0.0  0.8
 0.8  0.9
       Avg:   0.5, Min:   0.0, Max:   0.9]
      [Object Copy (ms):  7.2  7.2  7.3  7.3  7.1  7.1  7.0  7.2  7.1  6.9
 7.1  7.1  7.0
       Avg:   7.1, Min:   6.9, Max:   7.3]
      [Termination (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Other:   0.6 ms]
   [Clear CT:   0.5 ms]
   [Other:   1.1 ms]
      [Choose CSet:   0.0 ms]
   [ 7677M->7636M(8000M)]
 [Times: user=0.12 sys=0.00, real=0.01 secs]

But every 5-10 minutes I see a GC pause that lasts 10-15 seconds:
[todd at monster01 logs]$ grep 'Full GC' gc-hbase.log | tail
2010-07-06T12:50:41.216-0700: 52521.446: [Full GC 7934M->4865M(8000M),
9.8907800 secs]
2010-07-06T12:55:39.802-0700: 52820.032: [Full GC 7930M->4964M(8000M),
9.9025520 secs]
2010-07-06T13:02:26.872-0700: 53227.102: [Full GC 7934M->4882M(8000M),
10.1232190 secs]
2010-07-06T13:09:41.049-0700: 53661.279: [Full GC 7938M->5002M(8000M),
10.4997760 secs]
2010-07-06T13:18:51.531-0700: 54211.761: [Full GC 7938M->4962M(8000M),
11.0497380 secs]

These pauses are pretty unacceptable for soft real time operation.

Am I missing some tuning that should be done for G1GC for applications like
this? Is 20ms out of 80ms too aggressive a target for the garbage rates
we're generating?

My actual live heap usage should be very stable around 5GB - the application
very carefully accounts its live object set at around 60% of the max heap
(as you can see in the logs above).

At this point we are considering doing crazy things like ripping out our
main memory consumers into a custom slab allocator, and manually reference
count the byte array slices. But, if we can get G1GC to work for us, it will
save a lot of engineering on the application side!

Thanks
-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100706/9ad3f8a9/attachment.html