Long "stop-the-world" pauses in CMS GC mode

David Tavoularis david.tavoularis at mycom-int.com
Mon Aug 29 02:25:38 PDT 2011


Hi,

I am trying to understand the cause of "stop-the-world" pauses in my application using CMS GC and a large heap (48GB).
The production server SF6900 (24 x dual-core UltraSparc-IV 1.35GHz, 48 working threads, 140GB RAM) is running on Solaris 9 and Java6u25.

I know that there are several possible causes :
1) OldGen fragmentation : to avoid it, I implemented an automatic FullGC in crontab at 2:30am
30 2 * * * /usr/jdk/instances/jdk1.6.0/bin/jmap -d64 -histo:live `/usr/bin/pgrep -f "XXXXXXX"` 2>&1 >/dev/null

2) Weak refs processing : a workaround (not tried yet) is to use -XX:+ParallelRefProcEnabled, as described in the following articles :
http://blogs.oracle.com/jonthecollector/entry/top_10_gc_reasons
http://stackoverflow.com/questions/4101540/how-can-i-lower-the-weak-ref-processing-time-during-gc
I have found out that it could be triggered by the daily unreferencing of a big object containing millions of small objects (using weak references).


The application has been running for almost a week and I can see some "stop-the-world" pauses longer than 10 seconds :
$ egrep "Total time for which application threads were stopped: [0-9][0-9]\." gc_201108232207.log
Total time for which application threads were stopped: 10.8630158 seconds <- due to weak refs
Total time for which application threads were stopped: 18.5259611 seconds
Total time for which application threads were stopped: 10.0777809 seconds <- due to weak refs
Total time for which application threads were stopped: 61.5576519 seconds
Total time for which application threads were stopped: 19.0205127 seconds
Total time for which application threads were stopped: 20.6893643 seconds
Total time for which application threads were stopped: 16.0048075 seconds
Total time for which application threads were stopped: 12.3665083 seconds <- due to weak refs
Total time for which application threads were stopped: 11.5213443 seconds <- due to weak refs
Total time for which application threads were stopped: 37.1018520 seconds <- due to weak refs
Total time for which application threads were stopped: 16.3988783 seconds <- due to weak refs
Total time for which application threads were stopped: 12.4057546 seconds

6 of them have unknown explanation for me.

For your information, here are the 6 "weak refs" log messages :
$ egrep "weak refs processing, [1-9][0-9]?" gc_201108232207.log | more
2011-08-24T10:13:49.641+0100: 43564.409: [GC[YG occupancy: 342791 K (943744 K)]43564.410: [Rescan (non-parallel) 43564.410: [grey object rescan, 0.7358794 secs]43565.146: [root rescan, 1.9033345 secs], 2.6398211 secs]43567.049: [weak refs processing, 8.2148555 secs] [1 CMS-remark: 26914465K(49283072K)] 27257257K(50226816K), 10.8566498 secs] [Times: user=10.85 sys=0.00, real=10.86 secs]
2011-08-25T12:33:22.658+0100: 138336.194: [GC[YG occupancy: 179985 K (943744 K)]138336.195: [Rescan (non-parallel) 138336.195: [grey object rescan, 0.5969886 secs]138336.792: [root rescan, 0.5114118 secs], 1.1089811 secs]138337.304: [weak refs processing, 8.8414246 secs] [1 CMS-remark: 20122279K(49283072K)] 20302264K(5226816K), 9.9514563 secs] [Times: user=9.94 sys=0.01, real=9.95 secs]
2011-08-26T07:22:55.233+0100: 206107.887: [GC[YG occupancy: 177014 K (943744 K)]206107.888: [Rescan (non-parallel) 206107.888: [grey object rescan, 0.4472730 secs]206108.335: [root rescan, 1.5575365 secs], 2.0053337 secs]206109.893: [weak refs processing, 10.3436973 secs] [1 CMS-remark: 19861286K(49283072K)] 20038301K(50226816K), 12.3572481 secs] [Times: user=12.22 sys=0.00, real=12.36 secs]
2011-08-26T07:51:55.531+0100: 207848.163: [GC[YG occupancy: 423184 K (943744 K)]207848.163: [Rescan (non-parallel) 207848.163: [grey object rescan, 0.4466552 secs]207848.610: [root rescan, 3.4207362 secs], 3.8680060 secs]207852.031: [weak refs processing, 7.6403893 secs] [1 CMS-remark: 19714349K(49283072K)] 20137533K(50226816K), 11.5130922 secs] [Times: user=11.51 sys=0.00, real=11.51 secs]
2011-08-27T15:18:48.928+0100: 321060.091: [GC[YG occupancy: 711567 K (943744 K)]321060.092: [Rescan (non-parallel) 321060.092: [grey object rescan, 0.4628955 secs]321060.555: [root rescan, 3.2087381 secs], 3.6721710 secs]321063.764: [weak refs processing, 33.3995481 secs] [1 CMS-remark: 19918243K(49283072K)] 20629810K(50226816K), 37.0910804 secs] [Times: user=37.04 sys=0.00, real=37.09 secs]
2011-08-28T11:17:12.144+0100: 392962.378: [GC[YG occupancy: 811576 K (943744 K)]392962.378: [Rescan (non-parallel) 392962.378: [grey object rescan, 0.4140054 secs]392962.793: [root rescan, 4.4323136 secs], 4.8469694 secs]392967.225: [weak refs processing, 11.5384812 secs] [1 CMS-remark: 19819290K(49283072K)] 20630867K(50226816K), 16.3885374 secs] [Times: user=16.35 sys=0.01, real=16.39 secs]






1. Here is the first pattern : a 61-second pause, but I don't see any suspicious message in GC logs:
2011-08-24T10:24:25.748+0100: 44200.509: [GC 44200.511: [ParNew
Desired survivor size 53673984 bytes, new threshold 1 (max 4)
- age 1: 101879520 bytes, 101879520 total
: 933589K->104832K(943744K), 0.3947382 secs] 21369469K->20703994K(50226816K), 0.3966779 secs] [Times: user=6.43 sys=0.04, real=0.40 secs]
Heap after GC invocations=1187 (full 12):
par new generation total 943744K, used 104832K [0xfffffff353c00000, 0xfffffff393c00000, 0xfffffff393c00000)
eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 0xfffffff386f40000)
from space 104832K, 100% used [0xfffffff386f40000, 0xfffffff38d5a0000, 0xfffffff38d5a0000)
to space 104832K, 0% used [0xfffffff38d5a0000, 0xfffffff38d5a0000, 0xfffffff393c00000)
concurrent mark-sweep generation total 49283072K, used 20599162K [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
concurrent-mark-sweep perm gen total 524288K, used 42905K [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
}
Total time for which application threads were stopped: 0.4110458 seconds
Application time: 39.5906692 seconds
{Heap before GC invocations=1187 (full 12):
par new generation total 943744K, used 943744K [0xfffffff353c00000, 0xfffffff393c00000, 0xfffffff393c00000)
eden space 838912K, 100% used [0xfffffff353c00000, 0xfffffff386f40000, 0xfffffff386f40000)
from space 104832K, 100% used [0xfffffff386f40000, 0xfffffff38d5a0000, 0xfffffff38d5a0000)
to space 104832K, 0% used [0xfffffff38d5a0000, 0xfffffff38d5a0000, 0xfffffff393c00000)
concurrent mark-sweep generation total 49283072K, used 20599162K [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
concurrent-mark-sweep perm gen total 524288K, used 42905K [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
2011-08-24T10:25:07.776+0100: 44242.537: [GC 44301.853: [ParNew
Desired survivor size 53673984 bytes, new threshold 1 (max 4)
- age 1: 99505080 bytes, 99505080 total
: 943744K->104832K(943744K), 0.2010508 secs] 21542906K->20852742K(50226816K), 0.2022636 secs] [Times: user=5.67 sys=0.02, real=59.52 secs]
Heap after GC invocations=1188 (full 12):
par new generation total 943744K, used 104832K [0xfffffff353c00000, 0xfffffff393c00000, 0xfffffff393c00000)
eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 0xfffffff386f40000)
from space 104832K, 100% used [0xfffffff38d5a0000, 0xfffffff393c00000, 0xfffffff393c00000)
to space 104832K, 0% used [0xfffffff386f40000, 0xfffffff386f40000, 0xfffffff38d5a0000)
concurrent mark-sweep generation total 49283072K, used 20747910K [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
concurrent-mark-sweep perm gen total 524288K, used 42905K [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
}
Total time for which application threads were stopped: 61.5576519 seconds
Application time: 0.0245838 seconds
Total time for which application threads were stopped: 9.8331189 seconds
Application time: 0.0012626 seconds
Total time for which application threads were stopped: 0.0090404 seconds
Application time: 0.0008943 seconds
Total time for which application threads were stopped: 0.0020415 seconds
Application time: 0.0008181 seconds
Total time for which application threads were stopped: 0.2338605 seconds
Application time: 0.0018822 seconds

The only suspicious thing is "[Times: user=5.67 sys=0.02, real=59.52 secs]", which means that the "real" duration is a lot higher than "user" CPU time.
Because "sys" duration is low, it also means that the server is not swapping.
What could explain this 61 seconds pause ?



2. Here is the second pattern : a 20-second pause, in the middle of nowhere in GC logs :
{Heap before GC invocations=11132 (full 166):
par new generation total 943744K, used 882686K [0xfffffff353c00000, 0xfffffff393c00000, 0xfffffff393c00000)
eden space 838912K, 100% used [0xfffffff353c00000, 0xfffffff386f40000, 0xfffffff386f40000)
from space 104832K, 41% used [0xfffffff386f40000, 0xfffffff3899ffa48, 0xfffffff38d5a0000)
to space 104832K, 0% used [0xfffffff38d5a0000, 0xfffffff38d5a0000, 0xfffffff393c00000)
concurrent mark-sweep generation total 49283072K, used 19148140K [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
concurrent-mark-sweep perm gen total 524288K, used 44308K [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
2011-08-25T20:07:07.235+0100: 165560.417: [GC 165560.417: [ParNew
Desired survivor size 53673984 bytes, new threshold 4 (max 4)
- age 1: 26189384 bytes, 26189384 total
- age 2: 1713728 bytes, 27903112 total
: 882686K->34449K(943744K), 0.1280202 secs] 20030826K->19182589K(50226816K), 0.1285927 secs] [Times: user=3.94 sys=0.01, real=0.13 secs]
Heap after GC invocations=11133 (full 166):
par new generation total 943744K, used 34449K [0xfffffff353c00000, 0xfffffff393c00000, 0xfffffff393c00000)
eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 0xfffffff386f40000)
from space 104832K, 32% used [0xfffffff38d5a0000, 0xfffffff38f744468, 0xfffffff393c00000)
to space 104832K, 0% used [0xfffffff386f40000, 0xfffffff386f40000, 0xfffffff38d5a0000)
concurrent mark-sweep generation total 49283072K, used 19148140K [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
concurrent-mark-sweep perm gen total 524288K, used 44308K [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
}
Total time for which application threads were stopped: 0.1370098 seconds
Application time: 53.6273550 seconds
Total time for which application threads were stopped: 0.0429426 seconds
Application time: 0.0002318 seconds
Total time for which application threads were stopped: 0.0044294 seconds
Application time: 0.0002250 seconds
Total time for which application threads were stopped: 0.0016478 seconds
Application time: 59.0926108 seconds
Total time for which application threads were stopped: 0.0431387 seconds
Application time: 0.0002193 seconds
Total time for which application threads were stopped: 0.0020966 seconds
Application time: 0.0000956 seconds
Total time for which application threads were stopped: 0.0016358 seconds
Application time: 60.1048190 seconds
Total time for which application threads were stopped: 0.0481582 seconds
Application time: 0.0002207 seconds
Total time for which application threads were stopped: 0.0067752 seconds
Application time: 0.0001073 seconds
Total time for which application threads were stopped: 0.0016387 seconds
Application time: 60.7453974 seconds
Total time for which application threads were stopped: 0.0425995 seconds
Application time: 0.0002457 seconds
Total time for which application threads were stopped: 0.0019724 seconds
Application time: 0.0001005 seconds
Total time for which application threads were stopped: 0.0016210 seconds
Application time: 59.0845530 seconds
Total time for which application threads were stopped: 0.0424095 seconds
Application time: 0.0002314 seconds
Total time for which application threads were stopped: 0.0020107 seconds
Application time: 0.0000959 seconds
Total time for which application threads were stopped: 0.0015940 seconds
Application time: 60.7994458 seconds
Total time for which application threads were stopped: 0.0428210 seconds
Application time: 0.0002210 seconds
Total time for which application threads were stopped: 0.0020541 seconds
Application time: 0.0000974 seconds
Total time for which application threads were stopped: 0.0016126 seconds
Application time: 59.0963098 seconds
Total time for which application threads were stopped: 0.0592795 seconds
Application time: 0.0002622 seconds
Total time for which application threads were stopped: 0.0023229 seconds
Application time: 0.0000926 seconds
Total time for which application threads were stopped: 0.0016296 seconds
Application time: 60.1021141 seconds
Total time for which application threads were stopped: 0.0443986 seconds
Application time: 0.0002462 seconds
Total time for which application threads were stopped: 0.0021135 seconds
Application time: 0.0001076 seconds
Total time for which application threads were stopped: 0.0016165 seconds
Application time: 60.0324234 seconds
Total time for which application threads were stopped: 0.0437486 seconds
Application time: 0.0002286 seconds
Total time for which application threads were stopped: 0.0021017 seconds
Application time: 0.0001073 seconds
Total time for which application threads were stopped: 0.0016570 seconds
Application time: 60.4613330 seconds
Total time for which application threads were stopped: 0.0490276 seconds
Application time: 0.0002947 seconds
Total time for which application threads were stopped: 0.0024618 seconds
Application time: 0.0001238 seconds
Total time for which application threads were stopped: 0.0019863 seconds
Application time: 59.8201422 seconds
Total time for which application threads were stopped: 0.0455540 seconds
Application time: 0.0003668 seconds
Total time for which application threads were stopped: 0.0020906 seconds
Application time: 0.0001126 seconds
Total time for which application threads were stopped: 0.0016693 seconds
Application time: 60.0721521 seconds
Total time for which application threads were stopped: 0.0438111 seconds
Application time: 0.0002660 seconds
Total time for which application threads were stopped: 0.0019814 seconds
Application time: 0.0001018 seconds
Total time for which application threads were stopped: 0.0017817 seconds
Application time: 60.0825886 seconds
Total time for which application threads were stopped: 0.0440386 seconds
Application time: 0.0002197 seconds
Total time for which application threads were stopped: 0.0020655 seconds
Application time: 0.0001093 seconds
Total time for which application threads were stopped: 0.0016122 seconds
Application time: 59.6628580 seconds
Total time for which application threads were stopped: 0.0425082 seconds
Application time: 0.0002121 seconds
Total time for which application threads were stopped: 0.0020967 seconds
Application time: 0.0000935 seconds
Total time for which application threads were stopped: 0.0015909 seconds
Application time: 60.1951548 seconds
Total time for which application threads were stopped: 0.0432125 seconds
Application time: 0.0002274 seconds
Total time for which application threads were stopped: 0.0020316 seconds
Application time: 0.0001062 seconds
Total time for which application threads were stopped: 0.0016534 seconds
Application time: 59.5329171 seconds
Total time for which application threads were stopped: 20.6893643 seconds
Application time: 0.0002839 seconds
Total time for which application threads were stopped: 0.0076240 seconds
Application time: 0.0002137 seconds
Total time for which application threads were stopped: 0.0019918 seconds
Application time: 39.4376656 seconds
Total time for which application threads were stopped: 0.0612671 seconds
Application time: 0.0002478 seconds

Any idea ?


Thanks in advance for your help
--
David Tavoularis






[Annex]
Complete GC log file gc_201108232207.log.gz available here: http://dl.free.fr/gxrxlLsVS

JVM command line extract :
/usr/jdk/instances/jdk1.6.0/jre/bin/sparcv9/java -Dsun.rmi.dgc.checkInterval=2000 -server -Xms49152m -Xmx49152m -XX:PermSize=512m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=40 -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCDateStamps -Xloggc:/logs/gc_201108232207.log -XX:+UseCompressedOops -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/heapdump

$ /usr/jdk/instances/jdk1.6.0/jre/bin/sparcv9/java -version
java version "1.6.0_25"
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

$ /usr/sbin/prtdiag | head -3
System Configuration: Sun Microsystems sun4u Sun Fire E6900
System clock frequency: 150 MHz
Memory size: 143360 Megabytes

$ mpstat | wc -l
49

$ uname -a
SunOS XXX 5.9 Generic_122300-05 sun4u sparc SUNW,Sun-Fire

For your information, Full GC automatically triggered at 2:30am :
$ grep Full gc_201108232207.log
2011-08-24T02:30:02.475+0100: 15737.603: [Full GC 15737.604: [CMS: 11972490K->5028118K(49283072K), 137.9859661 secs] 12141664K->5028118K(50226816K), [CMS Perm : 39558K->39491K(524288K)], 137.9867010 secs] [Times: user=133.02 sys=4.89, real=137.99 secs]
2011-08-25T02:30:05.142+0100: 102139.150: [Full GC 102139.150: [CMS: 18724122K->11970549K(49283072K), 433.4189517 secs] 18976948K->11970549K(50226816K), [CMS Perm : 44256K->42995K(524288K)], 433.4350620 secs] [Times: user=429.00 sys=3.89, real=433.44 secs]
2011-08-26T02:30:05.125+0100: 188538.009: [Full GC 188538.009: [CMS: 15865994K->12528867K(49283072K), 477.0168566 secs] 16343213K->12528867K(50226816K), [CMS Perm : 44324K->43408K(524288K)], 477.0175358 secs] [Times: user=476.76 sys=0.05, real=477.02 secs]
2011-08-27T02:30:03.084+0100: 274934.847: [Full GC 274934.849: [CMS: 14857264K->8811922K(49283072K), 312.4786042 secs] 15546860K->8811922K(50226816K), [CMS Perm : 44557K->43762K(524288K)], 312.4796506 secs] [Times: user=312.38 sys=0.11, real=312.48 secs]
2011-08-28T02:30:04.129+0100: 361334.770: [Full GC 361334.777: [CMS: 16479144K->5767617K(49283072K), 161.5857103 secs] 17318705K->5767617K(50226816K), [CMS Perm : 44127K->43481K(524288K)], 161.5863909 secs] [Times: user=161.21 sys=0.02, real=161.59 secs]
2011-08-29T02:30:03.316+0100: 447732.838: [Full GC 447732.838: [CMS: 13471208K->6989798K(49283072K), 173.7255263 secs] 13700543K->6989798K(50226816K), [CMS Perm : 43709K->43433K(524288K)], 173.7260186 secs] [Times: user=173.48 sys=0.01, real=173.73 secs]


________________________________

This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110829/1b93fd8b/attachment-0001.html 


More information about the hotspot-gc-use mailing list