AW: Unexplanable events in GC logs

Thu Jan 23 00:38:10 PST 2014

Hey folks,

in one of our recent cluster systems we found 2 unexplainable events within the GC logs. I'd like to address these events to you and ask for your advice. We are running a clustered jBoss System with 4 nodes, every node has the same configuration. We use the following parameters:

-Xms10g
-Xmx10g
-Xmn3g
-Xss2048k
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:ParallelGCThreads=22
-XX:SurvivorRatio=8
-XX:TargetSurvivorRatio=90
-XX:PermSize=512M
-XX:MaxPermSize=512m
-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000

Event 1

The first event is a very frequent CMS collections on the first node for about 11 hours. We are talking here about a peak value of 258 CMS collections per hour. From my knowledge this event starts without any reason and ends without any reason since the old space is below 10% usage in this time frame. I do not know what might have caused this CMS collection. We already experienced similar events on this cluster in the past, but at a much higher heap usage (above 90%) and under high load with most likely a huge heap fragmentation and the frequent CMS collections ended with a single very long Full GC that defragmented the heap. In the current event all this is not the case. I attached the relevant part of the GC log.

Event 2

The second event is just as confusing for me as the first one. At a certain point in time a full GC takes place on all 4 Nodes and I cannot find a reason for it. Here are the relevant parts:

N1

2013-12-20T05:43:28.041+0100: 768231.382: [GC 768231.383: [ParNew: 2618493K->92197K(2831168K), 0.3007160 secs] 3120456K->594160K(10171200K), 0.3017204 secs] [Times: user=0.91 sys=0.00, real=0.30 secs]

2013-12-20T05:45:31.864+0100: 768355.209: [Full GC 768355.210: [CMS: 501963K->496288K(7340032K), 4.2140085 secs] 1397267K->496288K(10171200K), [CMS Perm : 203781K->178528K(524288K)], 4.2148018 secs] [Times: user=4.21 sys=0.00, real=4.21 secs]

2013-12-20T05:52:40.591+0100: 768783.949: [GC 768783.949: [ParNew: 2516608K->48243K(2831168K), 0.2649174 secs] 3012896K->544532K(10171200K), 0.2659039 secs] [Times: user=0.47 sys=0.00, real=0.27 secs]

N2

2013-12-20T04:57:21.524+0100: 765208.310: [GC 765208.311: [ParNew: 924566K->111068K(2831168K), 0.1790573 secs] 1416514K->603015K(10171200K), 0.1797121 secs] [Times: user=1.08 sys=0.00, real=0.19 secs]

2013-12-20T04:57:21.711+0100: 765208.499: [GC [1 CMS-initial-mark: 491947K(7340032K)] 603015K(10171200K), 0.3289639 secs] [Times: user=0.33 sys=0.00, real=0.33 secs]

2013-12-20T04:57:22.039+0100: 765208.828: [CMS-concurrent-mark-start]

2013-12-20T04:57:22.616+0100: 765209.405: [CMS-concurrent-mark: 0.577/0.577 secs] [Times: user=3.53 sys=0.05, real=0.58 secs]

2013-12-20T04:57:22.616+0100: 765209.405: [CMS-concurrent-preclean-start]

2013-12-20T04:57:22.647+0100: 765209.430: [CMS-concurrent-preclean: 0.024/0.024 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]

2013-12-20T04:57:22.647+0100: 765209.430: [CMS-concurrent-abortable-preclean-start]

CMS: abort preclean due to time 2013-12-20T04:57:28.060+0100: 765214.848: [CMS-concurrent-abortable-preclean: 4.208/5.418 secs] [Times: user=4.10 sys=0.03, real=5.41 secs]

2013-12-20T04:57:28.076+0100: 765214.857: [GC[YG occupancy: 124872 K (2831168 K)]765214.857: [Rescan (parallel) , 0.0580906 secs]765214.916: [weak refs processing, 0.0000912 secs]765214.916: [class unloading, 0.0769742 secs]765214.993: [scrub symbol & string tables, 0.0612689 secs] [1 CMS-remark: 491947K(7340032K)] 616820K(10171200K), 0.2256506 secs] [Times: user=0.48 sys=0.00, real=0.22 secs]

2013-12-20T04:57:28.294+0100: 765215.083: [CMS-concurrent-sweep-start]

2013-12-20T04:57:29.043+0100: 765215.834: [CMS-concurrent-sweep: 0.750/0.750 secs] [Times: user=0.76 sys=0.00, real=0.75 secs]

2013-12-20T04:57:29.043+0100: 765215.834: [CMS-concurrent-reset-start]

2013-12-20T04:57:29.074+0100: 765215.856: [CMS-concurrent-reset: 0.022/0.022 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]

2013-12-20T05:47:54.607+0100: 768241.413: [Full GC 768241.414: [CMS: 491947K->464864K(7340032K), 4.5053183 secs] 1910412K->464864K(10171200K), [CMS Perm : 183511K->168997K(524288K)], 4.5059088 secs] [Times: user=4.49 sys=0.02, real=4.51 secs]

2013-12-20T06:47:59.098+0100: 771845.954: [GC 771845.954: [ParNew: 1507774K->58195K(2831168K), 0.2307012 secs] 1972638K->523059K(10171200K), 0.2313931 secs] [Times: user=0.37 sys=0.00, real=0.23 secs]

N3

2013-12-20T05:46:25.441+0100: 767981.526: [GC 767981.526: [ParNew: 2695212K->166641K(2831168K), 0.3278475 secs] 3268057K->739486K(10171200K), 0.3284853 secs] [Times: user=1.62 sys=0.00, real=0.33 secs]

2013-12-20T05:49:55.467+0100: 768191.578: [Full GC 768191.578: [CMS: 572844K->457790K(7340032K), 3.7687762 secs] 1216176K->457790K(10171200K), [CMS Perm : 181711K->169514K(524288K)], 3.7692999 secs] [Times: user=3.76 sys=0.00, real=3.79 secs]

2013-12-20T06:49:59.249+0100: 771795.415: [GC 771795.415: [ParNew: 1585077K->72146K(2831168K), 0.2632945 secs] 2042868K->529936K(10171200K), 0.2639889 secs] [Times: user=0.41 sys=0.00, real=0.27 secs]

N4

2013-12-20T05:48:21.551+0100: 767914.067: [GC 767914.068: [ParNew: 2656327K->119432K(2831168K), 0.2603676 secs] 3222693K->685799K(10171200K), 0.2609581 secs] [Times: user=1.14 sys=0.00, real=0.26 secs]

2013-12-20T05:49:03.939+0100: 767956.457: [Full GC 767956.457: [CMS: 566366K->579681K(7340032K), 6.1324841 secs] 3149011K->579681K(10171200K), [CMS Perm : 190240K->174791K(524288K)], 6.1331389 secs] [Times: user=6.13 sys=0.00, real=6.13 secs]

2013-12-20T05:50:10.262+0100: 768022.762: [GC 768022.763: [ParNew: 2516608K->83922K(2831168K), 0.2157015 secs] 3096289K->663603K(10171200K), 0.2162262 secs] [Times: user=0.41 sys=0.00, real=0.22 secs]

Between these Full GC are only a few minutes or as between N3 and N4 a few seconds. I made some research on possible reasons for a Full GC and this is the list I gathered so far:

1.    Running out of old gen

2.    Running out of perm gen

3.    Calling System.gc() (indicated by System in the ouput)

4.    Not having enough free space in Survivor Space to copy objects from Eden (promotion failed)

5.    Running out of old gen before a concurrent collection can free it (Concurrent Mode Failure)

6.    Having high fragmentation and not enough space for a larger object in old gen

However none of these 6 conditions are fulfilled by any of the above shown full GC. So once again I'm lost and do not have an explanation for this.
If you need the full logs for further analysis please let me know.

Kind Regards
Denny

[cid:image002.png at 01CEF67E.8AEB4630]

Werum Software & Systems AG
Wulf-Werum-Strasse 3 | 21337 Lueneburg | Germany
Tel. +49(0)4131/8900-983 | Fax +49(0)4131/8900-20
mailto:denny.kettwig at werum.de | http://www.werum.de<http://www.werum.de/>

VAT No. DE 116 083 850 | RG Lueneburg HRB 2262
Chairman of Supervisory Board: Johannes Zimmermann
Executive Board: Hartmut Krome, Ruediger Schlierenkaemper, Hans-Peter Subel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140123/b68b9c03/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 1089 bytes
Desc: image001.jpg
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140123/b68b9c03/image001.jpg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 6441 bytes
Desc: image002.png
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140123/b68b9c03/image002.png