<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    Hi Srini -- (inline below)<br>

    <br>

    On 9/28/2011 4:50 AM, Srini Padman wrote:

    <blockquote

cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"

      type="cite">

      <table border="0" cellpadding="0" cellspacing="0">

        <tbody>

          <tr>

            <td style="font: inherit;" valign="top">Hello,<br>

               <br>

              We are seeing intermittent pauses in our application that

              correlate very well with the concurrent mark phase of the

              CMS GC. At the outset, this seems to me to be a case of

              Bug # 6692906 - but given that our issue is now occuring

              with alarming regularity on production systems, this is

              not something I want to assume. <br>

               <br>

              Details on the environment:<br>

              ---------------------------<br>

               <br>

              The servers are running Tomcat 6.0.29 + JRE 1.6.0_20 on

              Windows 2003 R2 64 Bit service pack 2 servers with 8 or 16

              cores (i.e., the issue has been seen on both kinds of

              servers). As a matter of policy, we try not to use custom

              JVM options unless proven absolutely necessary. The

              current custom options are as follows:<br>

               <br>

              -server -Xms1400m -Xmx1400m -Xmn350m -Xss330k -verbose:gc

              -XX:+DisableExplicitGC -XX:+ForceTimeHighResolution

              -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

              -Xloggc:../../logs/gc.log -XX:+UseParNewGC

              -XX:+UseConcMarkSweepGC <br>

               <br>

              Details on the issue:<br>

              -------------------------<br>

               <br>

              There are really long concurrent mark phases now and then,

              and this causes many application threads to be paused.<br>

               <br>

              GC log snippet showing one case where this phase took ~

              135 seconds:<br>

               <br>

              848463.696: [GC [1 CMS-initial-mark: 989633K(1075200K)]

              1003551K(1397760K), 6.3838946 secs] [Times: user=0.06

              sys=0.01, real=6.39 secs] <br>

              848470.085: [CMS-concurrent-mark-start]<br>

              848605.512: [CMS-concurrent-mark: 135.377/135.426 secs]

              [Times: user=7.83 sys=9.92, real=135.44 secs] <br>

              848605.512: [CMS-concurrent-preclean-start]<br>

              848605.726: [CMS-concurrent-preclean: 0.207/0.214 secs]

              [Times: user=0.06 sys=0.00, real=0.22 secs] <br>

              848605.726: [CMS-concurrent-abortable-preclean-start]<br>

               CMS: abort preclean due to time 848610.846:

              [CMS-concurrent-abortable-preclean: 1.014/5.120 secs]

              [Times: user=1.05 sys=0.08, real=5.13 secs] <br>

              848610.871: [GC[YG occupancy: 160577 K (322560

              K)]848610.871: [Rescan (parallel) , 0.1280728

              secs]848610.999: [weak refs processing, 1.3120556 secs] [1

              CMS-remark: 989633K(1075200K)] 1150210K(1397760K),

              1.4413516 secs]<br>

               <br>

              In our application logs, we are able to trace an initial

              STW pause due to the initial mark - although already long

              at 6+ seconds, this *at least* makes sense (and it is

              generally not this long). So although not ideal, we can

              live with it until the bigger issue is resolved.<br>

               <br>

              [09-08 21:14:57] https-xxxxx (INFO) 16ms spid:295 .....<br>

              [There is a six second gap in the logs, constituting a

              clear pause - matching the duration of the STW initial

              mark phase]<br>

              [09-08 21:15:03] cluster-xxxxx (INFO) ....<br>

               <br>

              The bigger issue is that, subsequent to this, some

              mission-critical threads are paused as we get close to the

              end of the CM phase - snippets from the application logs:<br>

               <br>

              09-08 21:15:03 ==> Time stamp corresponding to when

              concurrent marking phase begins<br>

              [09-08 21:16:19] health-check (INFO) .....<br>

              [There is a 60 second gap here in this thread - some other

              threads are executing but there is generally a great

              reduction in the number of active threads at this point]<br>

              09-08 21:17:18 ==> Time stamp corresponding to when

              concurrent marking phase ends<br>

              [09-08 21:17:19] health-check (INFO) .....<br>

               <br>

              This timing seems to suggest a clear correlation between

              the long duration CM phase and the pausing of the

              application threads. [The nature of our application is

              that a pausing of this thread for such a long time

              necessitates a controlled restart of the app, which

              explains you see nothing in the GC logs shortly after this

              point.]<br>

               <br>

              Questions:<br>

               <br>

              1\ is it clear based on the description above that the

              issue is identical to 6692906 (<a moz-do-not-send="true"

                href="http://bugs.sun.com/view_bug.do?bug_id=6692906">http://bugs.sun.com/view_bug.do?bug_id=6692906</a>)?<br>

            </td>

          </tr>

        </tbody>

      </table>

    </blockquote>

    <br>

    Very likely the same bug.<br>

    <br>

    <blockquote

cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"

      type="cite">

      <table border="0" cellpadding="0" cellspacing="0">

        <tbody>

          <tr>

            <td style="font: inherit;" valign="top">2\ will we benefit

              by upgrading to a more recent JRE [1.6.0_26 being the one

              under consideration]?<br>

            </td>

          </tr>

        </tbody>

      </table>

    </blockquote>

    <br>

    Definitely worth trying.<br>

    <br>

    <blockquote

cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"

      type="cite">

      <table border="0" cellpadding="0" cellspacing="0">

        <tbody>

          <tr>

            <td style="font: inherit;" valign="top">3\ I have seen

              recommendations to use "-XX:-CMSConcurrentMTEnabled" on

              some web forums - but I have concerns about this; if we

              don't allow for concurrent marking to use multiple

              threads, then isnt there a danger of marking proceeding so

              slowly that we might end up running out of memory i.e.,

              garbage created much faster than it is collected]?<br>

            </td>

          </tr>

        </tbody>

      </table>

    </blockquote>

    <br>

    Your concerns are very legitimate (especially given the length of

    the concurrent mark phase) and the number of cores you have.<br>

    <br>

    <blockquote

cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"

      type="cite">

      <table border="0" cellpadding="0" cellspacing="0">

        <tbody>

          <tr>

            <td style="font: inherit;" valign="top"> <br>

              Any help is greatly appreciated. Please let me know if any

              additional information is needed at all. I haven't

              attached the full GC log (it caused problems with posting)

              but will gladly send it directly to anybody who would

              like.<br>

            </td>

          </tr>

        </tbody>

      </table>

    </blockquote>

    <br>

    The long initial mark pause is definitely concerning -- Does it show

    up regularly<br>

    in the GC logs or is the snippet above an anomaly? Curisously, as

    the process time<br>

    shows, the user and system time are both low but the elapsed time is

    very large.<br>

    That looks like a total stall of the process, and I have no

    conjectures<br>

    based on available data. <br>

    <br>

    I suggest talking with your Java support folk if you reproduce this

    after upgrading to<br>

    6u28 (or whatever).<br>

    <br>

    best regards.<br>

    -- ramki<br>

    <br>

    <blockquote

cite="mid:1317210622.27249.YahooMailClassic@web162020.mail.bf1.yahoo.com"

      type="cite">

      <table border="0" cellpadding="0" cellspacing="0">

        <tbody>

          <tr>

            <td style="font: inherit;" valign="top"><br>

              Regards,<br>

              Srini.</td>

          </tr>

        </tbody>

      </table>

    </blockquote>

  </body>

</html>