long unaccounted pause during concurrent mark phase of ParNew cycle

Fri Feb 5 19:14:19 UTC 2010

atulksh wrote:
> Hi Ramki,
> I will give as try with that option as you have asked. If I disable the
> Concurrent Multi Threading in CMS with this flag is it going to not
> multi-thread any phase in general?

No, that flag affects just the MT-ness of the concurrent phases.

> 
> We have been using -XX:-CMSParallelRemarkEnabled  in our production
> environment for a long time. After we had upgraded to Update 2 we had seen
> crashes which were attributed to the remark phase. Using this flag had
> stopped those crashes. This issue was eventually fixed in Update 4 but after
> we had tried Update 4 we got more frequent crashes and since we had never
> tried to update our VM.

I believe you had run into 6722113 and cousins, which were fixed in 6u14.
Certainly in 6u18.

> 
> Now that we have been seeing these pauses, we have been trying to upgrade to
> Update 17. What we see is our pause behavior did not change a bit from
> Update 2 to Update 17, so it looks like that is something inherent and has
> not changed.

I see. Yes, it does look like behaviour that has not been reported before
and which we were not independently aware of.

> 
> Did you checkout my last response where I have given more details about our
> pause? It looks like our pause is coinciding with the perm space scanning
> occurring at the end of the concurrent mark phase. Any insights as to what
> normally goes on in the GC at that time (is that STW phase) and if yes what
> can we do about it?

See my private response to you. No this is not a STW pause.
Something is preventing a foreground collection from happening
perhaps because of the CMS activity locking out a foreground
collection. Sans space in the young gen to allocate, the application
appears to stall.

> 
> I am happy to provide you with any further information you may want. Your
> help and insights greatly appreciated.

Yes, thanks. As I noted earlier: (1) a gc log was not attached with yr
original report (you said one was, but you may have perhaps forgotten to
attach it or the openjdk mailing list may have stripped it.
Send it to me and/or Jon directly. Also the other info
I requested in that email.

Thanks.
-- ramki

> 
> Thank you very much.
> 
> Atul
> 
> y.s.ramakrishna wrote:
>> Hello Atul --
>>
>> Could you add -XX:-CMSConcurrentMTEnabled to the mix
>> and see if it affects the behaviour in any manner?
>>
>> PS: I notice that you are using -XX:-CMSParallelRemarkEnabled.
>> What was the reason for using that flag?
>>
>> thanks.
>> -- ramki
>>
>> atulksh wrote:
>>> Hi Jon,
>>> I have some more (probably more precise) information for you. I have been
>>> able to reproduce this issue in our test environment (up till now we were
>>> only seeing it in our production setup). Along with the parameters that I
>>> have specified above I added the following.
>>>
>>>
>>>
>>> After loading up the app server (it is Tomcat 6.0.14) I got a pause
>>> during
>>> the CMS cycle of about 32 seconds. The way I noticed the pause is like
>>> this.
>>> I kept on running a script that does some basic page refresh (small data
>>> page) in the application continously. This is like a ping to the server.
>>> Then I parallely do operations that load up the memory in the application
>>> server. These are the real hits that users do in whom memory requirement
>>> is
>>> usually high and that drives the app server to the CMS threshold.
>>>
>>> In my case once the CMS kicked in, I saw that the period where the
>>> applilcation was in the "pause" no requests of my script were processed.
>>> They were just queued up to Tomcat - meaning Apache got those requests
>>> and
>>> logged that they were sent to Tomcat. But Tomcat did not register of
>>> receiving them. Then after the app went out of the "pause" then
>>> everything
>>> came back and all these pending requests got processed - like the app
>>> just
>>> had been unfrozen.
>>>
>>> During this freeze, what I see in the GC log is that the CMS cycle is
>>> almost
>>> at the end of the concurrent mark phase. It is doing the GC is doing the
>>> "perm space scanning". The pause exactly coincides with the time reported
>>> for the "perm space scanning". I am attaching the full gc log snippt
>>> right
>>> from the start of the CMS cycle. Below is just the place where the pause
>>> appears.
>>>
>>> 28676.784: [GC 28676.784: [ParNew
>>> Desired survivor size 4358144 bytes, new threshold 2 (max 4)
>>> - age   1:    3742608 bytes,    3742608 total
>>> - age   2:    3451152 bytes,    7193760 total
>>> : 76672K->8512K(76672K), 0.2045430 secs] 20262376K->20197311K(27254464K),
>>> 0.2048030 secs] [Times: user=0.77 sys=0.01, real=0.21 secs]
>>> Total time for which application threads were stopped: 0.2055080 seconds
>>> Finished perm space scanning in 2th thread: 0.367 sec
>>> Finished perm space scanning in 0th thread: 0.362 sec
>>> Application time: 0.1481860 seconds
>>> 28708.626: [GC 28708.627: [ParNew
>>> Desired survivor size 4358144 bytes, new threshold 2 (max 4)
>>> - age   1:    2079760 bytes,    2079760 total
>>> - age   2:    3657248 bytes,    5737008 total
>>> : 76672K->6795K(76672K), 0.1937080 secs] 20265471K->20198965K(27254464K),
>>> 0.1939480 secs] [Times: user=0.74 sys=0.00, real=0.20 secs]
>>> Total time for which application threads were stopped: 0.1948820 seconds
>>> Finished perm space scanning in 3th thread: 32.153 sec
>>> Finished perm space scanning in 1th thread: 32.097 sec
>>> Finished work stealing in 1th thread: 0.000 sec
>>> Finished work stealing in 3th thread: 0.002 sec
>>> Finished work stealing in 2th thread: 31.791 sec
>>> Finished work stealing in 0th thread: 31.791 sec
>>> 28708.825: [CMS-concurrent-mark: 88.906/158.952 secs]
>>> (CMS-concurrent-mark
>>> yielded 454 times)
>>>  [Times: user=544.33 sys=4.59, real=158.95 secs]
>>>
>>> What is "perm space scanning" and is this where the gc would freeze the
>>> application (STW type). Is this a bug or something that is expected?
>>>
>>> Please let me know. This has become quite critical for our application.
>>> Please see the full gc log file attached. Your help very much
>>> appreciated.
>>>
>>> Thank you,
>>>
>>> Atul
>>>
>>>
>>> Jon Masamitsu wrote:
>>>> Atul,
>>>>
>>>> Do you mean the 36 secs here?
>>>>
>>>>  > Thu Feb  4 10:13:35 2010: [CMS-concurrent-mark: 36.272/36.715 secs] 
>>>> [Times:
>>>>  > user=56.34 sys=0.34, real=36.71 secs]
>>>>
>>>> That is a concurrent phase (i.e., the application threads
>>>> are still running although they have to share the hardware
>>>> threads with the concurrent GC threads).
>>>>
>>>> Jon
>>>>
>>>> On 02/04/10 12:37, atulksh wrote:
>>>>> Hi,
>>>>> We have an application that is freezing (pausing) for a very long (tens
>>>>> of
>>>>> seconds) time during the concurrent mark phase of the ParNew cycle.
>>>>> This
>>>>> pause is consistently seen when there is good amount of activity on the
>>>>> server. We tried to upgrade the JVM (JRE 6) from Update 02 to Update 17
>>>>> keeping all the parameters same, but the pauses did not go away (they
>>>>> became
>>>>> more consistent and slightly longer). We even tried to enable
>>>>> CompressedOops
>>>>> option, thinking that that would reduce memory requirements, but that
>>>>> has
>>>>> not helped.
>>>>>
>>>>> We are running on 64bit SUSE 10.1 platform with 32Gb of RAM and 8 core
>>>>> machine. 26g is allocated to the java process. The full set of VM
>>>>> parameters
>>>>> are here
>>>>>
>>>>>
>>>>> -Xmx26g
>>>>> -Xms26g
>>>>> -server 
>>>>> -XX:+UseConcMarkSweepGC
>>>>> -XX:PermSize=256m
>>>>> -XX:MaxPermSize=256m
>>>>> -XX:+PrintTenuringDistribution
>>>>> -XX:+PrintGCDetails
>>>>> -XX:+PrintGCTimeStamps
>>>>> -XX:CMSInitiatingOccupancyFraction=70
>>>>> -XX:-CMSParallelRemarkEnabled
>>>>>
>>>>> The nature of the pause is like this. All the application threads that
>>>>> are
>>>>> being processed seem to stop responding. New requests that come to our
>>>>> app
>>>>> server are not processed and get queued. When the pause ends then
>>>>> everything
>>>>> frees up and seems to go on normally henceforth. In the GC you see a
>>>>> gap
>>>>> between the ParNews. The following are two examples of the pauses.
>>>>> Notice
>>>>> the pause from 10:13:06 to 10:13:35 (29 seconds) in the first one and
>>>>> 11:29:04 to 11:29:17 (13 seconds) in the second one. These pauses only
>>>>> occur
>>>>> in the concurrent mark phase. They do not seem to occur at any other
>>>>> time
>>>>> or
>>>>> anywhere else in the CMS cycle.
>>>>>
>>>>> ....
>>>>> Thu Feb  4 10:12:59 2010: [GC [1 CMS-initial-mark:
>>>>> 18976170K(27092608K)]
>>>>> 18989786K(27245952K), 0.0059140 secs] [Times: user=0.01 sys=0.00,
>>>>> real=0.01
>>>>> secs] 
>>>>> Thu Feb  4 10:12:59 2010: [CMS-concurrent-mark-start]
>>>>> Thu Feb  4 10:12:59 2010 [18,980,009 K]: [GC 42035.293: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:   
>>>>> 1833064
>>>>> bytes,    1833064 total - age   2:      18472 bytes,    1851536 total :
>>>>> 148315K->3839K(153344K), 0.0400230 secs]
>>>>> 19124486K->18980009K(27245952K),
>>>>> 0.0401580 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] 
>>>>> Thu Feb  4 10:12:59 2010 [18,979,036 K]: [GC 42035.415: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:   
>>>>> 2068376
>>>>> bytes,    2068376 total - age   2:      23792 bytes,    2092168 total -
>>>>> age  
>>>>> 3:      18472 bytes,    2110640 total : 140159K->2866K(153344K),
>>>>> 0.0352200
>>>>> secs] 19116329K->18979036K(27245952K), 0.0353520 secs] [Times:
>>>>> user=0.29
>>>>> sys=0.00, real=0.03 secs] 
>>>>> Thu Feb  4 10:12:59 2010 [18,979,103 K]: [GC 42035.535: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:   
>>>>> 1782608
>>>>> bytes,    1782608 total - age   2:      22032 bytes,    1804640 total -
>>>>> age  
>>>>> 3:      23792 bytes,    1828432 total - age   4:      18472 bytes,   
>>>>> 1846904 total : 139186K->2933K(153344K), 0.0498480 secs]
>>>>> 19115356K->18979103K(27245952K), 0.0499850 secs] [Times: user=0.24
>>>>> sys=0.00,
>>>>> real=0.05 secs] 
>>>>> Thu Feb  4 10:12:59 2010 [18,979,181 K]: [GC 42035.673: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:   
>>>>> 2161000
>>>>> bytes,    2161000 total - age   2:      22032 bytes,    2183032 total -
>>>>> age  
>>>>> 3:      22032 bytes,    2205064 total - age   4:      23792 bytes,   
>>>>> 2228856 total : 139253K->2992K(153344K), 0.0400420 secs]
>>>>> 19115423K->18979181K(27245952K), 0.0401730 secs] [Times: user=0.26
>>>>> sys=0.00,
>>>>> real=0.04 secs] 
>>>>> Thu Feb  4 10:13:06 2010 [18,979,876 K]: [GC 42042.243: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:   
>>>>> 2113272
>>>>> bytes,    2113272 total - age   2:      23792 bytes,    2137064 total -
>>>>> age  
>>>>> 3:      22032 bytes,    2159096 total - age   4:      22032 bytes,   
>>>>> 2181128 total : 138855K->3662K(153344K), 0.0465500 secs]
>>>>> 19115044K->18979876K(27245952K), 0.0466840 secs] [Times: user=0.27
>>>>> sys=0.00,
>>>>> real=0.04 secs] 
>>>>> Thu Feb  4 10:13:35 2010 [18,979,051 K]: [GC 42071.809: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:    
>>>>> 219056
>>>>> bytes,     219056 total - age   2:    1867632 bytes,    2086688 total -
>>>>> age  
>>>>> 3:      23792 bytes,    2110480 total - age   4:      22032 bytes,   
>>>>> 2132512 total : 139982K->2814K(153344K), 0.0426920 secs]
>>>>> 19116196K->18979051K(27245952K), 0.0428200 secs] [Times: user=0.28
>>>>> sys=0.00,
>>>>> real=0.05 secs] 
>>>>> Thu Feb  4 10:13:35 2010: [CMS-concurrent-mark: 36.272/36.715 secs]
>>>>> [Times:
>>>>> user=56.34 sys=0.34, real=36.71 secs] 
>>>>> ....
>>>>>
>>>>>
>>>>> ....
>>>>> Thu Feb  4 11:28:59 2010: [GC [1 CMS-initial-mark:
>>>>> 19012252K(27092608K)]
>>>>> 19032719K(27245952K), 0.0279440 secs] [Times: user=0.02 sys=0.00,
>>>>> real=0.03
>>>>> secs] 
>>>>> Thu Feb  4 11:28:59 2010: [CMS-concurrent-mark-start]
>>>>> Thu Feb  4 11:28:59 2010 [19,154,661 K]: [GC 46595.614: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 1 (max 4) - age   1:  
>>>>> 17431976
>>>>> bytes,   17431976 total : 153344K->17024K(153344K), 0.9259390 secs]
>>>>> 19165596K->19154661K(27245952K), 0.9260890 secs] [Times: user=1.23
>>>>> sys=0.05,
>>>>> real=0.93 secs] 
>>>>> Thu Feb  4 11:29:00 2010 [19,199,696 K]: [GC 46596.973: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 1 (max 4) - age   1:  
>>>>> 17448432
>>>>> bytes,   17448432 total : 153344K->17024K(153344K), 0.3054460 secs]
>>>>> 19290981K->19199696K(27245952K), 0.3055930 secs] [Times: user=0.79
>>>>> sys=0.07,
>>>>> real=0.31 secs] 
>>>>> Thu Feb  4 11:29:01 2010 [19,229,358 K]: [GC 46597.722: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 1 (max 4) - age   1:  
>>>>> 10331408
>>>>> bytes,   10331408 total : 153344K->17024K(153344K), 0.1449040 secs]
>>>>> 19336016K->19229358K(27245952K), 0.1450630 secs] [Times: user=0.59
>>>>> sys=0.02,
>>>>> real=0.15 secs] 
>>>>> Thu Feb  4 11:29:02 2010 [19,240,015 K]: [GC 46598.038: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 1 (max 4) - age   1:  
>>>>> 17354120
>>>>> bytes,   17354120 total : 153344K->17024K(153344K), 0.0935440 secs]
>>>>> 19365678K->19240015K(27245952K), 0.0936870 secs] [Times: user=0.59
>>>>> sys=0.00,
>>>>> real=0.10 secs] 
>>>>> Thu Feb  4 11:29:02 2010 [19,302,509 K]: [GC 46598.347: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 1 (max 4) - age   1:  
>>>>> 17432360
>>>>> bytes,   17432360 total : 151519K->17024K(153344K), 0.4619430 secs]
>>>>> 19374510K->19302509K(27245952K), 0.4620890 secs] [Times: user=1.19
>>>>> sys=0.02,
>>>>> real=0.46 secs] 
>>>>> Thu Feb  4 11:29:03 2010 [19,329,594 K]: [GC 46599.030: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 1 (max 4) - age   1:  
>>>>> 11933296
>>>>> bytes,   11933296 total : 153344K->17024K(153344K), 0.1492670 secs]
>>>>> 19438829K->19329594K(27245952K), 0.1494200 secs] [Times: user=0.77
>>>>> sys=0.03,
>>>>> real=0.15 secs] 
>>>>> Thu Feb  4 11:29:03 2010 [19,337,176 K]: [GC 46599.423: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:   
>>>>> 5719032
>>>>> bytes,    5719032 total : 153344K->12852K(153344K), 0.0675180 secs]
>>>>> 19465914K->19337176K(27245952K), 0.0676670 secs] [Times: user=0.46
>>>>> sys=0.02,
>>>>> real=0.07 secs] 
>>>>> Thu Feb  4 11:29:03 2010 [19,338,256 K]: [GC 46599.754: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 2 (max 4) - age   1:   
>>>>> 6783616
>>>>> bytes,    6783616 total - age   2:    5570080 bytes,   12353696 total :
>>>>> 149172K->13931K(153344K), 0.0650520 secs]
>>>>> 19473496K->19338256K(27245952K),
>>>>> 0.0652060 secs] [Times: user=0.42 sys=0.01, real=0.07 secs] 
>>>>> Thu Feb  4 11:29:04 2010 [19,340,264 K]: [GC 46600.144: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 2 (max 4) - age   1:   
>>>>> 2190136
>>>>> bytes,    2190136 total - age   2:    6762648 bytes,    8952784 total :
>>>>> 150251K->10375K(153344K), 0.0709470 secs]
>>>>> 19474576K->19340264K(27245952K),
>>>>> 0.0710920 secs] [Times: user=0.48 sys=0.01, real=0.07 secs] 
>>>>> Thu Feb  4 11:29:17 2010 [19,345,928 K]: [GC 46613.628: [ParNew Desired
>>>>> survivor size 8716288 bytes, new threshold 4 (max 4) - age   1:    
>>>>> 760488
>>>>> bytes,     760488 total - age   2:    2173472 bytes,    2933960 total :
>>>>> 146695K->9279K(153344K), 0.0689390 secs]
>>>>> 19476584K->19345928K(27245952K),
>>>>> 0.0691070 secs] [Times: user=0.45 sys=0.00, real=0.07 secs] 
>>>>> Thu Feb  4 11:29:17 2010: [CMS-concurrent-mark: 15.853/18.260 secs]
>>>>> [Times:
>>>>> user=29.26 sys=0.55, real=18.26 secs] 
>>>>> Thu Feb  4 11:29:17 2010: [CMS-concurrent-preclean-start]
>>>>> ....
>>>>>
>>>>> Any idea what may causing these pauses and what possible JVM tuning
>>>>> would
>>>>> be
>>>>> neccessary to avoid these pauses? Your help greatly appreciated.
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Atul
>>> http://old.nabble.com/file/p27463304/long.pause.in.cms.gc.log
>>> long.pause.in.cms.gc.log 
>>
>>
>