Intermittent issue with long concurrent marking phase

Thu Sep 29 08:24:33 UTC 2011

Hi Srini -- (inline below)

On 9/28/2011 4:50 AM, Srini Padman wrote:
> Hello,
>
> We are seeing intermittent pauses in our application that correlate 
> very well with the concurrent mark phase of the CMS GC. At the outset, 
> this seems to me to be a case of Bug # 6692906 - but given that our 
> issue is now occuring with alarming regularity on production systems, 
> this is not something I want to assume.
>
> Details on the environment:
> ---------------------------
>
> The servers are running Tomcat 6.0.29 + JRE 1.6.0_20 on Windows 2003 
> R2 64 Bit service pack 2 servers with 8 or 16 cores (i.e., the issue 
> has been seen on both kinds of servers). As a matter of policy, we try 
> not to use custom JVM options unless proven absolutely necessary. The 
> current custom options are as follows:
>
> -server -Xms1400m -Xmx1400m -Xmn350m -Xss330k -verbose:gc 
> -XX:+DisableExplicitGC -XX:+ForceTimeHighResolution 
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:../../logs/gc.log 
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>
> Details on the issue:
> -------------------------
>
> There are really long concurrent mark phases now and then, and this 
> causes many application threads to be paused.
>
> GC log snippet showing one case where this phase took ~ 135 seconds:
>
> 848463.696: [GC [1 CMS-initial-mark: 989633K(1075200K)] 
> 1003551K(1397760K), 6.3838946 secs] [Times: user=0.06 sys=0.01, 
> real=6.39 secs]
> 848470.085: [CMS-concurrent-mark-start]
> 848605.512: [CMS-concurrent-mark: 135.377/135.426 secs] [Times: 
> user=7.83 sys=9.92, real=135.44 secs]
> 848605.512: [CMS-concurrent-preclean-start]
> 848605.726: [CMS-concurrent-preclean: 0.207/0.214 secs] [Times: 
> user=0.06 sys=0.00, real=0.22 secs]
> 848605.726: [CMS-concurrent-abortable-preclean-start]
>  CMS: abort preclean due to time 848610.846: 
> [CMS-concurrent-abortable-preclean: 1.014/5.120 secs] [Times: 
> user=1.05 sys=0.08, real=5.13 secs]
> 848610.871: [GC[YG occupancy: 160577 K (322560 K)]848610.871: [Rescan 
> (parallel) , 0.1280728 secs]848610.999: [weak refs processing, 
> 1.3120556 secs] [1 CMS-remark: 989633K(1075200K)] 1150210K(1397760K), 
> 1.4413516 secs]
>
> In our application logs, we are able to trace an initial STW pause due 
> to the initial mark - although already long at 6+ seconds, this *at 
> least* makes sense (and it is generally not this long). So although 
> not ideal, we can live with it until the bigger issue is resolved.
>
> [09-08 21:14:57] https-xxxxx (INFO) 16ms spid:295 .....
> [There is a six second gap in the logs, constituting a clear pause - 
> matching the duration of the STW initial mark phase]
> [09-08 21:15:03] cluster-xxxxx (INFO) ....
>
> The bigger issue is that, subsequent to this, some mission-critical 
> threads are paused as we get close to the end of the CM phase - 
> snippets from the application logs:
>
> 09-08 21:15:03 ==> Time stamp corresponding to when concurrent marking 
> phase begins
> [09-08 21:16:19] health-check (INFO) .....
> [There is a 60 second gap here in this thread - some other threads are 
> executing but there is generally a great reduction in the number of 
> active threads at this point]
> 09-08 21:17:18 ==> Time stamp corresponding to when concurrent marking 
> phase ends
> [09-08 21:17:19] health-check (INFO) .....
>
> This timing seems to suggest a clear correlation between the long 
> duration CM phase and the pausing of the application threads. [The 
> nature of our application is that a pausing of this thread for such a 
> long time necessitates a controlled restart of the app, which explains 
> you see nothing in the GC logs shortly after this point.]
>
> Questions:
>
> 1\ is it clear based on the description above that the issue is 
> identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
>

Very likely the same bug.

> 2\ will we benefit by upgrading to a more recent JRE [1.6.0_26 being 
> the one under consideration]?
>

Definitely worth trying.

> 3\ I have seen recommendations to use "-XX:-CMSConcurrentMTEnabled" on 
> some web forums - but I have concerns about this; if we don't allow 
> for concurrent marking to use multiple threads, then isnt there a 
> danger of marking proceeding so slowly that we might end up running 
> out of memory i.e., garbage created much faster than it is collected]?
>

Your concerns are very legitimate (especially given the length of the 
concurrent mark phase) and the number of cores you have.

>
> Any help is greatly appreciated. Please let me know if any additional 
> information is needed at all. I haven't attached the full GC log (it 
> caused problems with posting) but will gladly send it directly to 
> anybody who would like.
>

The long initial mark pause is definitely concerning -- Does it show up 
regularly
in the GC logs or is the snippet above an anomaly? Curisously, as the 
process time
shows, the user and system time are both low but the elapsed time is 
very large.
That looks like a total stall of the process, and I have no conjectures
based on available data.

I suggest talking with your Java support folk if you reproduce this 
after upgrading to
6u28 (or whatever).

best regards.
-- ramki

>
> Regards,
> Srini.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20110929/b0e160b5/attachment.htm>