<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi Srini -- (inline below)<br>
<br>
On 9/28/2011 4:50 AM, Srini Padman wrote:
<blockquote
cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">Hello,<br>
<br>
We are seeing intermittent pauses in our application that
correlate very well with the concurrent mark phase of the
CMS GC. At the outset, this seems to me to be a case of
Bug # 6692906 - but given that our issue is now occuring
with alarming regularity on production systems, this is
not something I want to assume. <br>
<br>
Details on the environment:<br>
---------------------------<br>
<br>
The servers are running Tomcat 6.0.29 + JRE 1.6.0_20 on
Windows 2003 R2 64 Bit service pack 2 servers with 8 or 16
cores (i.e., the issue has been seen on both kinds of
servers). As a matter of policy, we try not to use custom
JVM options unless proven absolutely necessary. The
current custom options are as follows:<br>
<br>
-server -Xms1400m -Xmx1400m -Xmn350m -Xss330k -verbose:gc
-XX:+DisableExplicitGC -XX:+ForceTimeHighResolution
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Xloggc:../../logs/gc.log -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC <br>
<br>
Details on the issue:<br>
-------------------------<br>
<br>
There are really long concurrent mark phases now and then,
and this causes many application threads to be paused.<br>
<br>
GC log snippet showing one case where this phase took ~
135 seconds:<br>
<br>
848463.696: [GC [1 CMS-initial-mark: 989633K(1075200K)]
1003551K(1397760K), 6.3838946 secs] [Times: user=0.06
sys=0.01, real=6.39 secs] <br>
848470.085: [CMS-concurrent-mark-start]<br>
848605.512: [CMS-concurrent-mark: 135.377/135.426 secs]
[Times: user=7.83 sys=9.92, real=135.44 secs] <br>
848605.512: [CMS-concurrent-preclean-start]<br>
848605.726: [CMS-concurrent-preclean: 0.207/0.214 secs]
[Times: user=0.06 sys=0.00, real=0.22 secs] <br>
848605.726: [CMS-concurrent-abortable-preclean-start]<br>
CMS: abort preclean due to time 848610.846:
[CMS-concurrent-abortable-preclean: 1.014/5.120 secs]
[Times: user=1.05 sys=0.08, real=5.13 secs] <br>
848610.871: [GC[YG occupancy: 160577 K (322560
K)]848610.871: [Rescan (parallel) , 0.1280728
secs]848610.999: [weak refs processing, 1.3120556 secs] [1
CMS-remark: 989633K(1075200K)] 1150210K(1397760K),
1.4413516 secs]<br>
<br>
In our application logs, we are able to trace an initial
STW pause due to the initial mark - although already long
at 6+ seconds, this *at least* makes sense (and it is
generally not this long). So although not ideal, we can
live with it until the bigger issue is resolved.<br>
<br>
[09-08 21:14:57] https-xxxxx (INFO) 16ms spid:295 .....<br>
[There is a six second gap in the logs, constituting a
clear pause - matching the duration of the STW initial
mark phase]<br>
[09-08 21:15:03] cluster-xxxxx (INFO) ....<br>
<br>
The bigger issue is that, subsequent to this, some
mission-critical threads are paused as we get close to the
end of the CM phase - snippets from the application logs:<br>
<br>
09-08 21:15:03 ==> Time stamp corresponding to when
concurrent marking phase begins<br>
[09-08 21:16:19] health-check (INFO) .....<br>
[There is a 60 second gap here in this thread - some other
threads are executing but there is generally a great
reduction in the number of active threads at this point]<br>
09-08 21:17:18 ==> Time stamp corresponding to when
concurrent marking phase ends<br>
[09-08 21:17:19] health-check (INFO) .....<br>
<br>
This timing seems to suggest a clear correlation between
the long duration CM phase and the pausing of the
application threads. [The nature of our application is
that a pausing of this thread for such a long time
necessitates a controlled restart of the app, which
explains you see nothing in the GC logs shortly after this
point.]<br>
<br>
Questions:<br>
<br>
1\ is it clear based on the description above that the
issue is identical to 6692906 (<a moz-do-not-send="true"
href="http://bugs.sun.com/view_bug.do?bug_id=6692906">http://bugs.sun.com/view_bug.do?bug_id=6692906</a>)?<br>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
Very likely the same bug.<br>
<br>
<blockquote
cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">2\ will we benefit
by upgrading to a more recent JRE [1.6.0_26 being the one
under consideration]?<br>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
Definitely worth trying.<br>
<br>
<blockquote
cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">3\ I have seen
recommendations to use "-XX:-CMSConcurrentMTEnabled" on
some web forums - but I have concerns about this; if we
don't allow for concurrent marking to use multiple
threads, then isnt there a danger of marking proceeding so
slowly that we might end up running out of memory i.e.,
garbage created much faster than it is collected]?<br>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
Your concerns are very legitimate (especially given the length of
the concurrent mark phase) and the number of cores you have.<br>
<br>
<blockquote
cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top"> <br>
Any help is greatly appreciated. Please let me know if any
additional information is needed at all. I haven't
attached the full GC log (it caused problems with posting)
but will gladly send it directly to anybody who would
like.<br>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
The long initial mark pause is definitely concerning -- Does it show
up regularly<br>
in the GC logs or is the snippet above an anomaly? Curisously, as
the process time<br>
shows, the user and system time are both low but the elapsed time is
very large.<br>
That looks like a total stall of the process, and I have no
conjectures<br>
based on available data. <br>
<br>
I suggest talking with your Java support folk if you reproduce this
after upgrading to<br>
6u28 (or whatever).<br>
<br>
best regards.<br>
-- ramki<br>
<br>
<blockquote
cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top"><br>
Regards,<br>
Srini.</td>
</tr>
</tbody>
</table>
</blockquote>
</body>
</html>