Intermittent issue with long concurrent marking phase

Wed Sep 28 11:50:22 UTC 2011

Hello,

We are seeing intermittent pauses in our application that correlate very well with the concurrent mark phase of the CMS GC. At the outset, this seems to me to be a case of Bug # 6692906 - but given that our issue is now occuring with alarming regularity on production systems, this is not something I want to assume. 

Details on the environment:
---------------------------

The servers are running Tomcat 6.0.29 + JRE 1.6.0_20 on Windows 2003 R2 64 Bit service pack 2 servers with 8 or 16 cores (i.e., the issue has been seen on both kinds of servers). As a matter of policy, we try not to use custom JVM options unless proven absolutely necessary. The current custom options are as follows:

-server -Xms1400m -Xmx1400m -Xmn350m -Xss330k -verbose:gc -XX:+DisableExplicitGC -XX:+ForceTimeHighResolution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:../../logs/gc.log -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 

Details on the issue:
-------------------------

There are really long concurrent mark phases now and then, and this causes many application threads to be paused.

GC log snippet showing one case where this phase took ~ 135 seconds:

848463.696: [GC [1 CMS-initial-mark: 989633K(1075200K)] 1003551K(1397760K), 6.3838946 secs] [Times: user=0.06 sys=0.01, real=6.39 secs] 
848470.085: [CMS-concurrent-mark-start]
848605.512: [CMS-concurrent-mark: 135.377/135.426 secs] [Times: user=7.83 sys=9.92, real=135.44 secs] 
848605.512: [CMS-concurrent-preclean-start]
848605.726: [CMS-concurrent-preclean: 0.207/0.214 secs] [Times: user=0.06 sys=0.00, real=0.22 secs] 
848605.726: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 848610.846: [CMS-concurrent-abortable-preclean: 1.014/5.120 secs] [Times: user=1.05 sys=0.08, real=5.13 secs] 
848610.871: [GC[YG occupancy: 160577 K (322560 K)]848610.871: [Rescan (parallel) , 0.1280728 secs]848610.999: [weak refs processing, 1.3120556 secs] [1 CMS-remark: 989633K(1075200K)] 1150210K(1397760K), 1.4413516 secs]

In our application logs, we are able to trace an initial STW pause due to the initial mark - although already long at 6+ seconds, this *at least* makes sense (and it is generally not this long). So although not ideal, we can live with it until the bigger issue is resolved.

[09-08 21:14:57] https-xxxxx (INFO) 16ms spid:295 .....
[There is a six second gap in the logs, constituting a clear pause - matching the duration of the STW initial mark phase]
[09-08 21:15:03] cluster-xxxxx (INFO) ....

The bigger issue is that, subsequent to this, some mission-critical threads are paused as we get close to the end of the CM phase - snippets from the application logs:

09-08 21:15:03 ==> Time stamp corresponding to when concurrent marking phase begins
[09-08 21:16:19] health-check (INFO) .....
[There is a 60 second gap here in this thread - some other threads are executing but there is generally a great reduction in the number of active threads at this point]
09-08 21:17:18 ==> Time stamp corresponding to when concurrent marking phase ends
[09-08 21:17:19] health-check (INFO) .....

This timing seems to suggest a clear correlation between the long duration CM phase and the pausing of the application threads. [The nature of our application is that a pausing of this thread for such a long time necessitates a controlled restart of the app, which explains you see nothing in the GC logs shortly after this point.]

Questions:

1\ is it clear based on the description above that the issue is identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
2\ will we benefit by upgrading to a more recent JRE [1.6.0_26 being the one under consideration]?
3\ I have seen recommendations to use "-XX:-CMSConcurrentMTEnabled" on some web forums - but I have concerns about this; if we don't allow for concurrent marking to use multiple threads, then isnt there a danger of marking proceeding so slowly that we might end up running out of memory i.e., garbage created much faster than it is collected]?

Any help is greatly appreciated. Please let me know if any additional information is needed at all. I haven't attached the full GC log (it caused problems with posting) but will gladly send it directly to anybody who would like.

Regards,
Srini.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20110928/9957cb01/attachment.htm>