Long "stop-the-world" pauses in CMS GC mode

Ramki Ramakrishna y.s.ramakrishna at oracle.com
Tue Aug 30 06:53:05 UTC 2011


Hi David -- you have a 48 core machine but you have handicapped yr JVM 
by forcing -XX:-CMSParallelRemarkEnabled,
which forces single--threaded remarks on such a huge heap. Please remove 
that flag so all 48 cores can be
used for the remark pause (and it should disappear from your pause-time 
radar entirely i think). If "ref proc"
turns out to be still a problem, you can then enable 
+ParallelRefProcEnabled to parallelize that sub-phase as well.

As to the 20 s pause in the middle of nowhere, I am clueless, but  
switch on -XX:+PrintSfaepointStatistics
to see what that long pause corresponds to. Perhaps some kind of bulk 
bias revocation perhaps, I am
not sure...

-- ramki


On 8/29/2011 2:25 AM, David Tavoularis wrote:
> Hi,
>
> I am trying to understand the cause of "stop-the-world" pauses in my 
> application using CMS GC and a large heap (48GB).
> The production server SF6900 (24 x dual-core UltraSparc-IV 1.35GHz, 48 
> working threads, 140GB RAM) is running on Solaris 9 and Java6u25.
>
> I know that there are several possible causes :
> 1) OldGen fragmentation : to avoid it, I implemented an automatic 
> FullGC in crontab at 2:30am
> 30 2 * * * /usr/jdk/instances/jdk1.6.0/bin/jmap -d64 -histo:live 
> `/usr/bin/pgrep -f "XXXXXXX"` 2>&1 >/dev/null
>
> 2) Weak refs processing : a workaround (not tried yet) is to use 
> -XX:+ParallelRefProcEnabled, as described in the following articles :
> http://blogs.oracle.com/jonthecollector/entry/top_10_gc_reasons
> http://stackoverflow.com/questions/4101540/how-can-i-lower-the-weak-ref-processing-time-during-gc
> I have found out that it could be triggered by the daily unreferencing 
> of a big object containing millions of small objects (using weak 
> references).
>
>
> The application has been running for almost a week and I can see some 
> "stop-the-world" pauses longer than 10 seconds :
> *$ egrep "Total time for which application threads were stopped: 
> [0-9][0-9]\." gc_201108232207.log*
> Total time for which application threads were stopped: *10*.8630158 
> seconds *<- due to weak refs*
> Total time for which application threads were stopped: *18*.5259611 
> seconds
> Total time for which application threads were stopped: *10*.0777809 
> seconds *<- due to weak refs*
> Total time for which application threads were stopped: *61*.5576519 
> seconds
> Total time for which application threads were stopped: *19*.0205127 
> seconds
> Total time for which application threads were stopped: *20*.6893643 
> seconds
> Total time for which application threads were stopped: *16*.0048075 
> seconds
> Total time for which application threads were stopped: *12*.3665083 
> seconds *<- due to weak refs*
> Total time for which application threads were stopped: *11*.5213443 
> seconds *<- due to weak refs*
> Total time for which application threads were stopped: *37*.1018520 
> seconds *<- due to weak refs*
> Total time for which application threads were stopped: *16*.3988783 
> seconds *<- due to weak refs*
> Total time for which application threads were stopped: *12*.4057546 
> seconds
>
> 6 of them have unknown explanation for me.
>
> For your information, here are the 6 "weak refs" log messages :
> $ egrep "weak refs processing, [1-9][0-9]?" gc_201108232207.log | more
> 2011-08-24T10:13:49.641+0100: 43564.409: [GC[YG occupancy: 342791 K 
> (943744 K)]43564.410: [Rescan (non-parallel) 43564.410: [grey object 
> rescan, 0.7358794 secs]43565.146: [root rescan, 1.9033345 secs], 
> 2.6398211 secs]43567.049: [weak refs processing, 8.2148555 secs] [1 
> CMS-remark: 26914465K(49283072K)] 27257257K(50226816K), 10.8566498 
> secs] *[Times: user=10.85 sys=0.00, real=10.86 secs]*
> 2011-08-25T12:33:22.658+0100: 138336.194: [GC[YG occupancy: 179985 K 
> (943744 K)]138336.195: [Rescan (non-parallel) 138336.195: [grey object 
> rescan, 0.5969886 secs]138336.792: [root rescan, 0.5114118 secs], 
> 1.1089811 secs]138337.304: [weak refs processing, 8.8414246 secs] [1 
> CMS-remark: 20122279K(49283072K)] 20302264K(5226816K), 9.9514563 secs] 
> *[Times: user=9.94 sys=0.01, real=9.95 secs]*
> 2011-08-26T07:22:55.233+0100: 206107.887: [GC[YG occupancy: 177014 K 
> (943744 K)]206107.888: [Rescan (non-parallel) 206107.888: [grey object 
> rescan, 0.4472730 secs]206108.335: [root rescan, 1.5575365 secs], 
> 2.0053337 secs]206109.893: [weak refs processing, 10.3436973 secs] [1 
> CMS-remark: 19861286K(49283072K)] 20038301K(50226816K), 12.3572481 
> secs] *[Times: user=12.22 sys=0.00, real=12.36 secs]*
> 2011-08-26T07:51:55.531+0100: 207848.163: [GC[YG occupancy: 423184 K 
> (943744 K)]207848.163: [Rescan (non-parallel) 207848.163: [grey object 
> rescan, 0.4466552 secs]207848.610: [root rescan, 3.4207362 secs], 
> 3.8680060 secs]207852.031: [weak refs processing, 7.6403893 secs] [1 
> CMS-remark: 19714349K(49283072K)] 20137533K(50226816K), 11.5130922 
> secs] *[Times: user=11.51 sys=0.00, real=11.51 secs]*
> 2011-08-27T15:18:48.928+0100: 321060.091: [GC[YG occupancy: 711567 K 
> (943744 K)]321060.092: [Rescan (non-parallel) 321060.092: [grey object 
> rescan, 0.4628955 secs]321060.555: [root rescan, 3.2087381 secs], 
> 3.6721710 secs]321063.764: [weak refs processing, 33.3995481 secs] [1 
> CMS-remark: 19918243K(49283072K)] 20629810K(50226816K), 37.0910804 
> secs] *[Times: user=37.04 sys=0.00, real=37.09 secs]*
> 2011-08-28T11:17:12.144+0100: 392962.378: [GC[YG occupancy: 811576 K 
> (943744 K)]392962.378: [Rescan (non-parallel) 392962.378: [grey object 
> rescan, 0.4140054 secs]392962.793: [root rescan, 4.4323136 secs], 
> 4.8469694 secs]392967.225: [weak refs processing, 11.5384812 secs] [1 
> CMS-remark: 19819290K(49283072K)] 20630867K(50226816K), 16.3885374 
> secs] *[Times: user=16.35 sys=0.01, real=16.39 secs]*
>
>
>
>
>
>
> *1. Here is the first pattern : a _61-second pause_, but I don't see 
> any suspicious message in GC logs:*
> 2011-08-24T10:24:25.748+0100: 44200.509: [GC 44200.511: [ParNew
> Desired survivor size 53673984 bytes, new threshold 1 (max 4)
> - age 1: 101879520 bytes, 101879520 total
> : 933589K->104832K(943744K), 0.3947382 secs] 
> 21369469K->20703994K(50226816K), 0.3966779 secs] [Times: user=6.43 
> sys=0.04, real=0.40 secs]
> Heap after GC invocations=1187 (full 12):
> par new generation total 943744K, used 104832K [0xfffffff353c00000, 
> 0xfffffff393c00000, 0xfffffff393c00000)
> eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 
> 0xfffffff386f40000)
> from space 104832K, 100% used [0xfffffff386f40000, 0xfffffff38d5a0000, 
> 0xfffffff38d5a0000)
> to space 104832K, 0% used [0xfffffff38d5a0000, 0xfffffff38d5a0000, 
> 0xfffffff393c00000)
> concurrent mark-sweep generation total 49283072K, used 20599162K 
> [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
> concurrent-mark-sweep perm gen total 524288K, used 42905K 
> [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
> }
> Total time for which application threads were stopped: 0.4110458 seconds
> Application time: 39.5906692 seconds
> {Heap before GC invocations=1187 (full 12):
> par new generation total 943744K, used 943744K [0xfffffff353c00000, 
> 0xfffffff393c00000, 0xfffffff393c00000)
> eden space 838912K, 100% used [0xfffffff353c00000, 0xfffffff386f40000, 
> 0xfffffff386f40000)
> from space 104832K, 100% used [0xfffffff386f40000, 0xfffffff38d5a0000, 
> 0xfffffff38d5a0000)
> to space 104832K, 0% used [0xfffffff38d5a0000, 0xfffffff38d5a0000, 
> 0xfffffff393c00000)
> concurrent mark-sweep generation total 49283072K, used 20599162K 
> [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
> concurrent-mark-sweep perm gen total 524288K, used 42905K 
> [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
> 2011-08-24T10:25:07.776+0100: 44242.537: [GC 44301.853: [ParNew
> Desired survivor size 53673984 bytes, new threshold 1 (max 4)
> - age 1: 99505080 bytes, 99505080 total
> : 943744K->104832K(943744K), 0.2010508 secs] 
> 21542906K->20852742K(50226816K), 0.2022636 secs] *[Times: user=5.67 
> sys=0.02, real=59.52 secs]*
> Heap after GC invocations=1188 (full 12):
> par new generation total 943744K, used 104832K [0xfffffff353c00000, 
> 0xfffffff393c00000, 0xfffffff393c00000)
> eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 
> 0xfffffff386f40000)
> from space 104832K, 100% used [0xfffffff38d5a0000, 0xfffffff393c00000, 
> 0xfffffff393c00000)
> to space 104832K, 0% used [0xfffffff386f40000, 0xfffffff386f40000, 
> 0xfffffff38d5a0000)
> concurrent mark-sweep generation total 49283072K, used 20747910K 
> [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
> concurrent-mark-sweep perm gen total 524288K, used 42905K 
> [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
> }
> *Total time for which application threads were stopped: 61.5576519 
> seconds*
> Application time: 0.0245838 seconds
> Total time for which application threads were stopped: 9.8331189 seconds
> Application time: 0.0012626 seconds
> Total time for which application threads were stopped: 0.0090404 seconds
> Application time: 0.0008943 seconds
> Total time for which application threads were stopped: 0.0020415 seconds
> Application time: 0.0008181 seconds
> Total time for which application threads were stopped: 0.2338605 seconds
> Application time: 0.0018822 seconds
>
> The only suspicious thing is "[Times: user=5.67 sys=0.02, real=59.52 
> secs]", which means that the "real" duration is a lot higher than 
> "user" CPU time.
> Because "sys" duration is low, it also means that the server is not 
> swapping.
> What could explain this 61 seconds pause ?
>
>
>
> *2. Here is the second pattern : a 20-second pause, in the middle of 
> nowhere in GC logs :*
> {Heap before GC invocations=11132 (full 166):
> par new generation total 943744K, used 882686K [0xfffffff353c00000, 
> 0xfffffff393c00000, 0xfffffff393c00000)
> eden space 838912K, 100% used [0xfffffff353c00000, 0xfffffff386f40000, 
> 0xfffffff386f40000)
> from space 104832K, 41% used [0xfffffff386f40000, 0xfffffff3899ffa48, 
> 0xfffffff38d5a0000)
> to space 104832K, 0% used [0xfffffff38d5a0000, 0xfffffff38d5a0000, 
> 0xfffffff393c00000)
> concurrent mark-sweep generation total 49283072K, used 19148140K 
> [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
> concurrent-mark-sweep perm gen total 524288K, used 44308K 
> [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
> 2011-08-25T20:07:07.235+0100: 165560.417: [GC 165560.417: [ParNew
> Desired survivor size 53673984 bytes, new threshold 4 (max 4)
> - age 1: 26189384 bytes, 26189384 total
> - age 2: 1713728 bytes, 27903112 total
> : 882686K->34449K(943744K), 0.1280202 secs] 
> 20030826K->19182589K(50226816K), 0.1285927 secs] [Times: user=3.94 
> sys=0.01, real=0.13 secs]
> Heap after GC invocations=11133 (full 166):
> par new generation total 943744K, used 34449K [0xfffffff353c00000, 
> 0xfffffff393c00000, 0xfffffff393c00000)
> eden space 838912K, 0% used [0xfffffff353c00000, 0xfffffff353c00000, 
> 0xfffffff386f40000)
> from space 104832K, 32% used [0xfffffff38d5a0000, 0xfffffff38f744468, 
> 0xfffffff393c00000)
> to space 104832K, 0% used [0xfffffff386f40000, 0xfffffff386f40000, 
> 0xfffffff38d5a0000)
> concurrent mark-sweep generation total 49283072K, used 19148140K 
> [0xfffffff393c00000, 0xffffffff53c00000, 0xffffffff53c00000)
> concurrent-mark-sweep perm gen total 524288K, used 44308K 
> [0xffffffff53c00000, 0xffffffff73c00000, 0xffffffff73c00000)
> }
> Total time for which application threads were stopped: 0.1370098 seconds
> Application time: 53.6273550 seconds
> Total time for which application threads were stopped: 0.0429426 seconds
> Application time: 0.0002318 seconds
> Total time for which application threads were stopped: 0.0044294 seconds
> Application time: 0.0002250 seconds
> Total time for which application threads were stopped: 0.0016478 seconds
> Application time: 59.0926108 seconds
> Total time for which application threads were stopped: 0.0431387 seconds
> Application time: 0.0002193 seconds
> Total time for which application threads were stopped: 0.0020966 seconds
> Application time: 0.0000956 seconds
> Total time for which application threads were stopped: 0.0016358 seconds
> Application time: 60.1048190 seconds
> Total time for which application threads were stopped: 0.0481582 seconds
> Application time: 0.0002207 seconds
> Total time for which application threads were stopped: 0.0067752 seconds
> Application time: 0.0001073 seconds
> Total time for which application threads were stopped: 0.0016387 seconds
> Application time: 60.7453974 seconds
> Total time for which application threads were stopped: 0.0425995 seconds
> Application time: 0.0002457 seconds
> Total time for which application threads were stopped: 0.0019724 seconds
> Application time: 0.0001005 seconds
> Total time for which application threads were stopped: 0.0016210 seconds
> Application time: 59.0845530 seconds
> Total time for which application threads were stopped: 0.0424095 seconds
> Application time: 0.0002314 seconds
> Total time for which application threads were stopped: 0.0020107 seconds
> Application time: 0.0000959 seconds
> Total time for which application threads were stopped: 0.0015940 seconds
> Application time: 60.7994458 seconds
> Total time for which application threads were stopped: 0.0428210 seconds
> Application time: 0.0002210 seconds
> Total time for which application threads were stopped: 0.0020541 seconds
> Application time: 0.0000974 seconds
> Total time for which application threads were stopped: 0.0016126 seconds
> Application time: 59.0963098 seconds
> Total time for which application threads were stopped: 0.0592795 seconds
> Application time: 0.0002622 seconds
> Total time for which application threads were stopped: 0.0023229 seconds
> Application time: 0.0000926 seconds
> Total time for which application threads were stopped: 0.0016296 seconds
> Application time: 60.1021141 seconds
> Total time for which application threads were stopped: 0.0443986 seconds
> Application time: 0.0002462 seconds
> Total time for which application threads were stopped: 0.0021135 seconds
> Application time: 0.0001076 seconds
> Total time for which application threads were stopped: 0.0016165 seconds
> Application time: 60.0324234 seconds
> Total time for which application threads were stopped: 0.0437486 seconds
> Application time: 0.0002286 seconds
> Total time for which application threads were stopped: 0.0021017 seconds
> Application time: 0.0001073 seconds
> Total time for which application threads were stopped: 0.0016570 seconds
> Application time: 60.4613330 seconds
> Total time for which application threads were stopped: 0.0490276 seconds
> Application time: 0.0002947 seconds
> Total time for which application threads were stopped: 0.0024618 seconds
> Application time: 0.0001238 seconds
> Total time for which application threads were stopped: 0.0019863 seconds
> Application time: 59.8201422 seconds
> Total time for which application threads were stopped: 0.0455540 seconds
> Application time: 0.0003668 seconds
> Total time for which application threads were stopped: 0.0020906 seconds
> Application time: 0.0001126 seconds
> Total time for which application threads were stopped: 0.0016693 seconds
> Application time: 60.0721521 seconds
> Total time for which application threads were stopped: 0.0438111 seconds
> Application time: 0.0002660 seconds
> Total time for which application threads were stopped: 0.0019814 seconds
> Application time: 0.0001018 seconds
> Total time for which application threads were stopped: 0.0017817 seconds
> Application time: 60.0825886 seconds
> Total time for which application threads were stopped: 0.0440386 seconds
> Application time: 0.0002197 seconds
> Total time for which application threads were stopped: 0.0020655 seconds
> Application time: 0.0001093 seconds
> Total time for which application threads were stopped: 0.0016122 seconds
> Application time: 59.6628580 seconds
> Total time for which application threads were stopped: 0.0425082 seconds
> Application time: 0.0002121 seconds
> Total time for which application threads were stopped: 0.0020967 seconds
> Application time: 0.0000935 seconds
> Total time for which application threads were stopped: 0.0015909 seconds
> Application time: 60.1951548 seconds
> Total time for which application threads were stopped: 0.0432125 seconds
> Application time: 0.0002274 seconds
> Total time for which application threads were stopped: 0.0020316 seconds
> Application time: 0.0001062 seconds
> Total time for which application threads were stopped: 0.0016534 seconds
> Application time: 59.5329171 seconds
> *Total time for which application threads were stopped: 20.6893643 
> seconds*
> Application time: 0.0002839 seconds
> Total time for which application threads were stopped: 0.0076240 seconds
> Application time: 0.0002137 seconds
> Total time for which application threads were stopped: 0.0019918 seconds
> Application time: 39.4376656 seconds
> Total time for which application threads were stopped: 0.0612671 seconds
> Application time: 0.0002478 seconds
>
> Any idea ?
>
>
> Thanks in advance for your help
> -- 
> David Tavoularis
>
>
>
>
>
>
> [Annex]
> Complete GC log file gc_201108232207.log.gz available here: 
> http://dl.free.fr/gxrxlLsVS
>
> JVM command line extract :
> /usr/jdk/instances/jdk1.6.0/jre/bin/sparcv9/java 
> -Dsun.rmi.dgc.checkInterval=2000 -server -Xms49152m -Xmx49152m 
> -XX:PermSize=512m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
> -XX:+DisableExplicitGC -XX:-CMSParallelRemarkEnabled 
> -XX:CMSInitiatingOccupancyFraction=40 -XX:NewSize=1024m 
> -XX:MaxNewSize=1024m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution 
> -XX:+PrintGCApplicationStoppedTime 
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCDateStamps 
> -Xloggc:/logs/gc_201108232207.log -XX:+UseCompressedOops 
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/heapdump
>
> *$ /usr/jdk/instances/jdk1.6.0/jre/bin/sparcv9/java -version*
> java version "1.6.0_25"
> Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)
>
> *$ /usr/sbin/prtdiag | head -3*
> System Configuration: Sun Microsystems sun4u Sun Fire E6900
> System clock frequency: 150 MHz
> Memory size: 143360 Megabytes
>
> *$ mpstat | wc -l*
> 49
>
> *$ uname -a*
> SunOS XXX 5.9 Generic_122300-05 sun4u sparc SUNW,Sun-Fire
>
> For your information, Full GC automatically triggered at 2:30am :
> *$ grep Full gc_201108232207.log*
> 2011-08-24T02:30:02.475+0100: 15737.603: [Full GC 15737.604: [CMS: 
> 11972490K->5028118K(49283072K), 137.9859661 secs] 
> 12141664K->5028118K(50226816K), [CMS Perm : 39558K->39491K(524288K)], 
> 137.9867010 secs] [Times: user=133.02 sys=4.89, real=137.99 secs]
> 2011-08-25T02:30:05.142+0100: 102139.150: [Full GC 102139.150: [CMS: 
> 18724122K->11970549K(49283072K), 433.4189517 secs] 
> 18976948K->11970549K(50226816K), [CMS Perm : 44256K->42995K(524288K)], 
> 433.4350620 secs] [Times: user=429.00 sys=3.89, real=433.44 secs]
> 2011-08-26T02:30:05.125+0100: 188538.009: [Full GC 188538.009: [CMS: 
> 15865994K->12528867K(49283072K), 477.0168566 secs] 
> 16343213K->12528867K(50226816K), [CMS Perm : 44324K->43408K(524288K)], 
> 477.0175358 secs] [Times: user=476.76 sys=0.05, real=477.02 secs]
> 2011-08-27T02:30:03.084+0100: 274934.847: [Full GC 274934.849: [CMS: 
> 14857264K->8811922K(49283072K), 312.4786042 secs] 
> 15546860K->8811922K(50226816K), [CMS Perm : 44557K->43762K(524288K)], 
> 312.4796506 secs] [Times: user=312.38 sys=0.11, real=312.48 secs]
> 2011-08-28T02:30:04.129+0100: 361334.770: [Full GC 361334.777: [CMS: 
> 16479144K->5767617K(49283072K), 161.5857103 secs] 
> 17318705K->5767617K(50226816K), [CMS Perm : 44127K->43481K(524288K)], 
> 161.5863909 secs] [Times: user=161.21 sys=0.02, real=161.59 secs]
> 2011-08-29T02:30:03.316+0100: 447732.838: [Full GC 447732.838: [CMS: 
> 13471208K->6989798K(49283072K), 173.7255263 secs] 
> 13700543K->6989798K(50226816K), [CMS Perm : 43709K->43433K(524288K)], 
> 173.7260186 secs] [Times: user=173.48 sys=0.01, real=173.73 secs]
>
>
> ------------------------------------------------------------------------
>
> This electronic message contains information from Mycom which may be 
> privileged or confidential. The information is intended to be for the 
> use of the individual(s) or entity named above. If you are not the 
> intended recipient, be aware that any disclosure, copying, 
> distribution or any other use of the contents of this information is 
> prohibited. If you have received this electronic message in error, 
> please notify us by post or telephone (to the numbers or 
> correspondence address above) or by email (at the email address above) 
> immediately.
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20110829/88ca702d/attachment.htm>
-------------- next part --------------
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


More information about the hotspot-gc-dev mailing list