From darius.ski at gmail.com  Fri Sep  2 09:18:46 2011
From: darius.ski at gmail.com (Darius D.)
Date: Fri, 2 Sep 2011 19:18:46 +0300
Subject: G1C strange Full GC behavior
Message-ID: <CAKt3ReLLTvzF8Bjn-nEJ2yvM1fC4aKya+7tz86pkzV+ZFe1UBg@mail.gmail.com>

Hi,

i am puzzled by strange G1C behavior, everything runs perfectly, G1C
is doing great job keeping heap usage stable etc.

But once in a while we get the following:

...
556882.040: [Full GC 1650M->1552M(3072M), 5.7839410 secs]
 [Times: user=9.77 sys=0.00, real=5.78 secs]
556887.825: [GC concurrent-mark-abort]
...

And of course JVM grinds to halt for 5+s.

Why Full GC is happening when there seems to be no apparent pressure
on heap? And GC resulted in collecting just 98M (sometimes even less)?

At first i thought that app is trying to allocate large amount of
memory, but young gc immediately after Full GC is:

556887.928: [GC pause (young), 0.07898900 secs]
   [Parallel Time:  74.1 ms]
      [GC Worker Start Time (ms):  556887929.4  556887929.4
556887929.4  556887929.5  556887929.5  556887929.6  556887929.6
556887929.7  556887929.7  556887929.7  556887929.8  556887929.8
556887929.9  556887930.0  556887930.0  556887930.1  556887930.2
556887930.2]
      [Update RS (ms):  0.0  34.8  36.9  36.9  8.1  8.0  37.5  30.2
35.0  35.9  0.0  34.2  35.8  35.2  39.8  34.9  32.7  32.7
       Avg:  28.3, Min:   0.0, Max:  39.8]
         [Processed Buffers : 0 52 50 56 25 13 56 57 49 60 0 59 64 62
57 49 47 47
          Sum: 803, Avg: 44, Min: 0, Max: 64]
      [Ext Root Scanning (ms):  73.3  25.1  23.3  23.2  52.4  52.2
23.2  30.1  21.6  24.5  73.0  25.4  23.6  24.1  19.8  24.5  26.4  26.5
       Avg:  32.9, Min:  19.8, Max:  73.3]
      [Mark Stack Scanning (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Scan RS (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.1  0.1  0.0  0.0  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.1]
      [Object Copy (ms):  0.1  3.0  2.5  2.7  2.3  2.4  2.0  2.3  5.9
2.2  0.0  2.8  2.9  2.9  2.6  2.7  3.0  2.9
       Avg:   2.5, Min:   0.0, Max:   5.9]
      [Termination (ms):  0.0  10.5  10.5  10.5  10.5  10.5  10.5
10.5  10.5  10.5  0.0  10.5  10.5  10.5  10.5  10.5  10.5  10.5
       Avg:   9.3, Min:   0.0, Max:  10.5]
         [Termination Attempts : 1 3 2 2 1 2 2 2 2 2 1 2 2 2 2 3 3 2
          Sum: 36, Avg: 2, Min: 1, Max: 3]
      [GC Worker End Time (ms):  556888002.8  556888003.0  556888003.2
 556888002.9  556888002.8  556888002.8  556888003.2  556888003.1
556888003.1  556888003.3  556888003.3  556888002.8  556888002.8
556888003.0  556888002.8  556888002.8  556888003.2  556888002.8]
      [Other:   1.1 ms]
   [Clear CT:   1.3 ms]
   [Other:   3.6 ms]
      [Choose CSet:   0.0 ms]
   [ 1655M->1569M(3072M)]
 [Times: user=1.06 sys=0.00, real=0.08 secs]


Running Sun JDK 1.6 update 27 with the following options:

-Xloggc:/opt/gclog_tomcat.txt -XX:+UseG1GC -XX:+PrintGCDetails
-Xmx3072m -Xms3072m -XX:NewSize=768m -XX:MaxNewSize=768m
-XX:SurvivorRatio=5 -XX:MaxPermSize=768m
-XX:+HeapDumpOnOutOfMemoryError -XX:+UnlockExperimentalVMOptions
-XX:+AggressiveOpts -XX:+UseCompressedOops -XX:+UseFastAccessorMethods
-XX:+DoEscapeAnalysis"


Best Regards,

Darius.

From nhann at chalmers.se  Tue Sep 13 06:52:30 2011
From: nhann at chalmers.se (Dang Nhan Nguyen)
Date: Tue, 13 Sep 2011 15:52:30 +0200
Subject: PromotionInfo initialization and reset() are not consistent,
In-Reply-To: <AANLkTin8Mqp+aXPCODH0o14JFyH3ANk4aL_DckRjAZD7@mail.gmail.com>
References: <AANLkTiniKm1ghON3AIl5JPNzOVqyasTxxQaRsGCPBnba@mail.gmail.com>
	<AANLkTinjYAdubXNSV2tdic1Xn-jlTFs_iswOBpU3xNGQ@mail.gmail.com>
	<4C3CB3E9.4040305@oracle.com>
	<AANLkTim63XWKdILkuMcALP5rOPL_ga_rbCfYN3AAYalT@mail.gmail.com>
	<AANLkTimKNhq7vvx0-S1yiJj6GA_O1ySin2Oys-LeYA8H@mail.gmail.com>
	<AANLkTi=x3OGMOoUoHFNmqqX9L+et02T=fS2XQ9RRvoDr@mail.gmail.com>
	<AANLkTikjyHQOVqUh2_1JVCj+ix37aEqEkgxQ5FJfiLLg@mail.gmail.com>
	<AANLkTimenxnZo0Pf64g0OsyEU8Bh=AY8hjA=-ewAuh_k@mail.gmail.com>
	<AANLkTinO-EdBNTtKoGbCTqa78sdOEcJtRp87BM07D4O9@mail.gmail.com>
	<AANLkTin8Mqp+aXPCODH0o14JFyH3ANk4aL_DckRjAZD7@mail.gmail.com>
Message-ID: <416B678CFDED2C43ADFFAC692465DAC13C7CF6291C@MAPI01.ita.chalmers.se>

Hi all,

When I look at PromotionInfo class, I saw a small difference in resetting values (_firstIndex and _nextIndex) between the construction and reset(): 

PromotionInfo() :
    _tracking(0), _space(NULL),
    _promoHead(NULL), _promoTail(NULL),
    _spoolHead(NULL), _spoolTail(NULL),
    _spareSpool(NULL), _firstIndex(1),
    _nextIndex(1) {}

And reset():
void reset() {
    _promoHead = NULL;
    _promoTail = NULL;
    _spoolHead = NULL;
    _spoolTail = NULL;
    _spareSpool = NULL;
    _firstIndex = 0;
    _nextIndex = 0;

  }

My question is that when promotionInfo.reset() is called, then many consistency checks like the below in PromotionInfo will failed:
PromotionInfo.cpp[Line 323] guarantee(_spoolTail != NULL || _nextIndex == 1,
            "Inconsistency between _spoolTail and _nextIndex");

What is the purpose of resetting the _firstIndex and _nextIndex to 0? Or is it a mistake?

Thanks,
--Nhan Nguyen

> -----Original Message-----
> From: hotspot-gc-dev-bounces at openjdk.java.net [mailto:hotspot-gc-dev-
> bounces at openjdk.java.net] On Behalf Of Peter Schuller
> Sent: Friday, July 30, 2010 11:59 PM
> To: Todd Lipcon
> Cc: hotspot-gc-use at openjdk.java.net
> Subject: Re: G1GC Full GCs
> 
> > Yep, I've seen JRRT also "abort compaction" on most compactions. I
> couldn't
> > quite figure out how to tell it that it was fine to pause more often
> for
> > compaction, so long as each pause was short.
> 
> FWIW, I got the impression at the time (but I don't remember why; I
> think I was half-guessing based on assumptions about what it does and
> several iterations through the documentation) that it was
> fundamentally only *able* to do compaction during the stop-the-world
> pause after a concurrent mark phase. I.e., I don't think you can make
> it spread the work out (but I can most definitely be wrong).
> 
> --
> / Peter Schuller
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From junnifan at gmail.com  Sun Sep 25 09:07:42 2011
From: junnifan at gmail.com (Junni Fan)
Date: Sun, 25 Sep 2011 09:07:42 -0700
Subject: G1 garbage collector help
Message-ID: <CAHdESEJcWaW9HGs-Vsuc7i-mQS4befCqMmbV2S6FsaM9TWnDbw@mail.gmail.com>

Hey list, we used to use CMS for our java server. Our memory activity
is very high, we use 4 CMS threads, and set initial fraction to 1 to
aggressively collect old gen memory. The hardware we use has 72G
memory and 8 core. The settings we use is:

java -Xms64G -Xmx64G -Xmn256M -verbose:gc -XX:+PrintGCDateStamps
-XX:+PrintGCDetails  -XX:+UseCMSInitiatingOccupancyOnly
-XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:+AggressiveOpts
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=1 -XX:+CMSParallelRemarkEnabled
-XX:ParallelCMSThreads=4

But since CMS has the fragmentation problem the server occasionally
goes to full GC still.

So we are considerring to test our G1 collector. But appears that G1
falls back to full GC even quicker. I set the following G1 params:

 -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -XX:GCPauseIntervalMillis=300
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=8

It appears also that there are two marker threads and 8 refinement
threads running, which does not seem aggressive enough ...

Is there anyway that we can make G1 do more work? Like the option we
had for CMS ParallelCMSThreads? Also are there good reads on G1? Like
a comprehensive list of options that we can tune for G1 and what they
are?

Thanks a million!

From junnifan at gmail.com  Sun Sep 25 09:08:29 2011
From: junnifan at gmail.com (Junni Fan)
Date: Sun, 25 Sep 2011 09:08:29 -0700
Subject: G1 garbage collector help
In-Reply-To: <CAHdESEJcWaW9HGs-Vsuc7i-mQS4befCqMmbV2S6FsaM9TWnDbw@mail.gmail.com>
References: <CAHdESEJcWaW9HGs-Vsuc7i-mQS4befCqMmbV2S6FsaM9TWnDbw@mail.gmail.com>
Message-ID: <CAHdESELRw1E3tgK3Bs+vB9OqHt1mW0CMKANa=hNxtUxL4sv9kw@mail.gmail.com>

BTW, we are using jdk 7 118

On Sun, Sep 25, 2011 at 9:07 AM, Junni Fan <junnifan at gmail.com> wrote:
> Hey list, we used to use CMS for our java server. Our memory activity
> is very high, we use 4 CMS threads, and set initial fraction to 1 to
> aggressively collect old gen memory. The hardware we use has 72G
> memory and 8 core. The settings we use is:
>
> java -Xms64G -Xmx64G -Xmn256M -verbose:gc -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails ?-XX:+UseCMSInitiatingOccupancyOnly
> -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:+AggressiveOpts
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=1 -XX:+CMSParallelRemarkEnabled
> -XX:ParallelCMSThreads=4
>
> But since CMS has the fragmentation problem the server occasionally
> goes to full GC still.
>
> So we are considerring to test our G1 collector. But appears that G1
> falls back to full GC even quicker. I set the following G1 params:
>
> ?-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
> -XX:MaxGCPauseMillis=200 -XX:GCPauseIntervalMillis=300
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=8
>
> It appears also that there are two marker threads and 8 refinement
> threads running, which does not seem aggressive enough ...
>
> Is there anyway that we can make G1 do more work? Like the option we
> had for CMS ParallelCMSThreads? Also are there good reads on G1? Like
> a comprehensive list of options that we can tune for G1 and what they
> are?
>
> Thanks a million!
>

From y.s.ramakrishna at oracle.com  Thu Sep 29 17:54:35 2011
From: y.s.ramakrishna at oracle.com (Y. S. Ramakrishna)
Date: Thu, 29 Sep 2011 17:54:35 -0700
Subject: Intermittent issue with long concurrent marking phase
In-Reply-To: <1317307491.86326.YahooMailClassic@web162016.mail.bf1.yahoo.com>
References: <1317307491.86326.YahooMailClassic@web162016.mail.bf1.yahoo.com>
Message-ID: <4E85134B.2010206@oracle.com>

Hi Srini -- As I indicated, if you cannot upgrade easily to test
if the issue is fixed, you should probably engage JVM support to
get to a proper diagnosis of the issue affecting your production systems.

more inline below ...

On 09/29/11 07:44, Srini Padman wrote:
> Hi Ramki,
>  
> Thank you very much for your reply.
>  
> It is not *always* that the concurrent marking phase takes this long, 
> although it happens often enough. For example, in the full GC log 
> corresponding to the snippet I pasted in my posting (attached, zipped) 
> there is only that one instance.

I see that there is one instance of the long _initial-mark_ pause
and as i stated the whole process seems stalled at that time.
It's definitely not the stall/livelock issue of CR 6692906.

[In other words, you may be dealing with several different issues
here and you will need to disentangle them.]

>  
> I think I know why you are asking - based on my understanding of Bug # 
> 6692906 (more accurately, based on discussions around it on this list), 
> I was under the impression that such long CM phases will happen all the 
> time (if they happen at all). Does the fact that it is intermittent 
> raise the possibility that this is a different issue? I realize that you 
> might not be able to answer this based on the bits of information you 
> have, but perhaps the full GC log will tell you something that you don't 
> already know.

Again, my question about once/many was only about the long initial mark pause
of which there is exactly one in your log.

The stall of mutators during concurrent marking, which you are
conjecturing is 6692906, is a different issue: for that you should
either upgrade and test, or seek JVM support help. It is definitely
possible that the symptoms of 6692906 will happen infrequently or
intermittently. A definite symptom of 6692906 can be diagnosed if
the JVM completely stalls into a livelock (a few threads in the
JVM intermittently active, but your application will forever
stop forward progress from that point on). It doesn't look like
you have observed that latter symptom however?

Sorry I can't really help more at this time, but perhaps someone
from the community may be able to. But I really suggest either
an upgrade or seeking JVM support help. (This is not a professional
support alias.)

In general, the GC-use alias is better suited to questions such as this.
GC-dev should be used for GC development questions involving
the main development trunk. Issues of uses in the field of
older versions should be addressed to hotspot-gc-use at o.j.n
So I've taken the liberty to send this to hotspot-gc-use at o.j.n

All the best!
-- ramki

>  
> Regards,
> Srini.
> 
> --- On *Thu, 9/29/11, Ramki Ramakrishna /<y.s.ramakrishna at oracle.com>/* 
> wrote:
> 
> 
>     From: Ramki Ramakrishna <y.s.ramakrishna at oracle.com>
>     Subject: Re: Intermittent issue with long concurrent marking phase
>     To: "Srini Padman" <srini_was at yahoo.com>
>     Cc: hotspot-gc-dev at openjdk.java.net
>     Date: Thursday, September 29, 2011, 4:24 AM
> 
>     Hi Srini -- (inline below)
> 
>     On 9/28/2011 4:50 AM, Srini Padman wrote:
>>
>>     Questions:
>>      
>>     1\ is it clear based on the description above that the issue is
>>     identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
>>
> 
>     Very likely the same bug.
> 
>>     2\ will we benefit by upgrading to a more recent JRE [1.6.0_26
>>     being the one under consideration]?
>>
> 
>     Definitely worth trying.
> 
>>     3\ I have seen recommendations to use
>>     "-XX:-CMSConcurrentMTEnabled" on some web forums - but I have
>>     concerns about this; if we don't allow for concurrent marking to
>>     use multiple threads, then isnt there a danger of marking
>>     proceeding so slowly that we might end up running out of memory
>>     i.e., garbage created much faster than it is collected]?
>>
> 
>     Your concerns are very legitimate (especially given the length of
>     the concurrent mark phase) and the number of cores you have.
> 
>>      
>>     Any help is greatly appreciated. Please let me know if any
>>     additional information is needed at all. I haven't attached the
>>     full GC log (it caused problems with posting) but will gladly send
>>     it directly to anybody who would like.
>>
> 
>     The long initial mark pause is definitely concerning -- Does it show
>     up regularly
>     in the GC logs or is the snippet above an anomaly? Curisously, as
>     the process time
>     shows, the user and system time are both low but the elapsed time is
>     very large.
>     That looks like a total stall of the process, and I have no conjectures
>     based on available data.
> 
>     I suggest talking with your Java support folk if you reproduce this
>     after upgrading to
>     6u28 (or whatever).
> 
>     best regards.
>     -- ramki
> 
>>
>>     Regards,
>>     Srini.
>>

From y.s.ramakrishna at oracle.com  Fri Sep 30 08:50:02 2011
From: y.s.ramakrishna at oracle.com (Ramki Ramakrishna)
Date: Fri, 30 Sep 2011 08:50:02 -0700
Subject: Intermittent issue with long concurrent marking phase
In-Reply-To: <1317387913.58856.YahooMailClassic@web162018.mail.bf1.yahoo.com>
References: <1317387913.58856.YahooMailClassic@web162018.mail.bf1.yahoo.com>
Message-ID: <4E85E52A.4050105@oracle.com>

Hi Srini --

It is true that the livelock can in fact resolve itself in some cases 
provided
there are external actions by mutators that result in that resolution.
So longish stalls that resolve themselves are indeed possible.
However, sometimes that action will never be taken by mutators because all
of them are already stalled, waiting for GC, so they can complete an 
allocation.

Thus, the lack of a complete stall does not rule out 6692906,
nor the presence of intermittent stalls confirm 6692906.

If your "stall" during the concurrent marking phase is easily detectable
by an external agent and you can take a pstack dump of the process in
that stalled phase, we may be able to identify with high confidence if 
this is
6692906 from a stack retrace of the process.

As regards experimenting with the workaround of turning off
CMSConcurrentMTEnabled.  It's possible that the overly long
marking cycle may have been a consequence of the bug, so turning
off MT-marking via -XX:-CMSConcurrentMTEnabled might well
be a workable solution for you without having to upgrade to get
the fix for 6692906.

It is of course easiest if you could reproduce this problem in a test 
set-up,
so you could do these experiments easily and make the right
determination.

-- ramki

On 9/30/2011 6:05 AM, Srini Padman wrote:
> Hi Ramki,
>   
> Apologies, both for mis-reading your original response (re: long initial-mark phases) and for choosing the wrong list. Thank you very much for redirecting it to gc-use.
>
> I just want to clarify a couple of points from your last response, for the record.
>
> To answer the question about the long stop-the-world initial marking phase: this is the longest I know of, but we have seen other instances where it lasted 3-4 seconds. In those cases as well, the "user/sys" times were much smaller than the "real" time so things clearly seem to be completely stalled. Also as a matter of background - the reason we moved to using the CMS collectors was that, prior to this, we were occasionally seeing extremely long (sometimes lasting more than a minute) full GCs. It is quite possible that the same factors that caused such long full GCs in the past are causing somewhat shorter (but still not _short_) initial mark with the CMS collector. In any case, I didn't view this as being related to 6692906 to begin with, and am glad to get confirmation that you don't think it is either.
>
> Regarding your point "A definite symptom of 6692906 can be diagnosed if the JVM completely stalls into a livelock (a few threads in the JVM intermittently active, but your application will forever stop  forward progress from that point on). It doesn't look like you have observed that latter symptom however?" - my understanding of the symptom (based on the summary at http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2010-February/001549.html, for example) was that the wait/stall resolves itself after tens of seconds for reasons unknown. Our symptom is closer to this, in that the application does not stop _forever_ from that point onwards. [It is entirely possible that there were further findings beyond what is in the posting that I am not aware of.]
>
> We will initiate efforts to reach out to JVM support - of course, in the meanwhile, any feedback or help on this forum is very welcome!
>
> Regards,
> Srini.
>
> --- On Thu, 9/29/11, Y. S. Ramakrishna<y.s.ramakrishna at oracle.com>  wrote:
>
>
> From: Y. S. Ramakrishna<y.s.ramakrishna at oracle.com>
> Subject: Re: Intermittent issue with long concurrent marking phase
> To: "Srini Padman"<srini_was at yahoo.com>
> Cc: hotspot-gc-use at openjdk.java.net
> Date: Thursday, September 29, 2011, 8:54 PM
>
>
> Hi Srini -- As I indicated, if you cannot upgrade easily to test
> if the issue is fixed, you should probably engage JVM support to
> get to a proper diagnosis of the issue affecting your production systems.
>
> more inline below ...
>
> On 09/29/11 07:44, Srini Padman wrote:
>> Hi Ramki,
>>    Thank you very much for your reply.
>>    It is not *always* that the concurrent marking phase takes this long, although it happens often enough. For example, in the full GC log corresponding to the snippet I pasted in my posting (attached, zipped) there is only that one instance.
> I see that there is one instance of the long _initial-mark_ pause
> and as i stated the whole process seems stalled at that time.
> It's definitely not the stall/livelock issue of CR 6692906.
>
> [In other words, you may be dealing with several different issues
> here and you will need to disentangle them.]
>
>>    I think I know why you are asking - based on my understanding of Bug # 6692906 (more accurately, based on discussions around it on this list), I was under the impression that such long CM phases will happen all the time (if they happen at all). Does the fact that it is intermittent raise the possibility that this is a different issue? I realize that you might not be able to answer this based on the bits of information you have, but perhaps the full GC log will tell you something that you don't already know.
> Again, my question about once/many was only about the long initial mark pause
> of which there is exactly one in your log.
>
> The stall of mutators during concurrent marking, which you are
> conjecturing is 6692906, is a different issue: for that you should
> either upgrade and test, or seek JVM support help. It is definitely
> possible that the symptoms of 6692906 will happen infrequently or
> intermittently. A definite symptom of 6692906 can be diagnosed if
> the JVM completely stalls into a livelock (a few threads in the
> JVM intermittently active, but your application will forever
> stop forward progress from that point on). It doesn't look like
> you have observed that latter symptom however?
>
> Sorry I can't really help more at this time, but perhaps someone
> from the community may be able to. But I really suggest either
> an upgrade or seeking JVM support help. (This is not a professional
> support alias.)
>
> In general, the GC-use alias is better suited to questions such as this.
> GC-dev should be used for GC development questions involving
> the main development trunk. Issues of uses in the field of
> older versions should be addressed to hotspot-gc-use at o.j.n
> So I've taken the liberty to send this to hotspot-gc-use at o.j.n
>
> All the best!
> -- ramki
>
>>    Regards,
>> Srini.
>>
>> --- On *Thu, 9/29/11, Ramki Ramakrishna /<y.s.ramakrishna at oracle.com>/* wrote:
>>
>>
>>       From: Ramki Ramakrishna<y.s.ramakrishna at oracle.com>
>>       Subject: Re: Intermittent issue with long concurrent marking phase
>>       To: "Srini Padman"<srini_was at yahoo.com>
>>       Cc: hotspot-gc-dev at openjdk.java.net
>>       Date: Thursday, September 29, 2011, 4:24 AM
>>
>>       Hi Srini -- (inline below)
>>
>>       On 9/28/2011 4:50 AM, Srini Padman wrote:
>>>       Questions:
>>>            1\ is it clear based on the description above that the issue is
>>>       identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
>>>
>>       Very likely the same bug.
>>
>>>       2\ will we benefit by upgrading to a more recent JRE [1.6.0_26
>>>       being the one under consideration]?
>>>
>>       Definitely worth trying.
>>
>>>       3\ I have seen recommendations to use
>>>       "-XX:-CMSConcurrentMTEnabled" on some web forums - but I have
>>>       concerns about this; if we don't allow for concurrent marking to
>>>       use multiple threads, then isnt there a danger of marking
>>>       proceeding so slowly that we might end up running out of memory
>>>       i.e., garbage created much faster than it is collected]?
>>>
>>       Your concerns are very legitimate (especially given the length of
>>       the concurrent mark phase) and the number of cores you have.
>>
>>>            Any help is greatly appreciated. Please let me know if any
>>>       additional information is needed at all. I haven't attached the
>>>       full GC log (it caused problems with posting) but will gladly send
>>>       it directly to anybody who would like.
>>>
>>       The long initial mark pause is definitely concerning -- Does it show
>>       up regularly
>>       in the GC logs or is the snippet above an anomaly? Curisously, as
>>       the process time
>>       shows, the user and system time are both low but the elapsed time is
>>       very large.
>>       That looks like a total stall of the process, and I have no conjectures
>>       based on available data.
>>
>>       I suggest talking with your Java support folk if you reproduce this
>>       after upgrading to
>>       6u28 (or whatever).
>>
>>       best regards.
>>       -- ramki
>>
>>>       Regards,
>>>       Srini.
>>>

From srini_was at yahoo.com  Fri Sep 30 08:57:46 2011
From: srini_was at yahoo.com (Srini Padman)
Date: Fri, 30 Sep 2011 08:57:46 -0700 (PDT)
Subject: Intermittent issue with long concurrent marking phase
Message-ID: <1317398266.52316.YahooMailClassic@web162020.mail.bf1.yahoo.com>

A quick follow up on this:

Testing with JRE 1.6.0_26 (which we were hoping to upgrade to), our QE team has reported a JVM crash (zipped crash dump file attached). This seems almost exactly the same as the one reported here (supposedly introduced in update 25 and fixed in update 27, which would put us right in the crosshairs):

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7042582
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7013538

Before I assume that: the suggested workaround is -XX:-DoEscapeAnalysis - does this mean that the default is +DoEscapeAnalysis? In other words, is that bug something that affects everybody or only those who explicitly use -XX:+DoEscapeAnalysis? We don't use +DoEscapeAnalysis explicitly, so there is no reason for it to affect us if this is not an option enabled by default.

Of course, we are confirming that this is not merely a matter of memory constraints (seems unlikely, since the tests are the same as were run on previous JREs).

Regards,
Srini.

--- On Fri, 9/30/11, Srini Padman <srini_was at yahoo.com> wrote:

> From: Srini Padman <srini_was at yahoo.com>
> Subject: Re: Intermittent issue with long concurrent marking phase
> To: Y.S.Ramakrishna at oracle.com
> Cc: hotspot-gc-use at openjdk.java.net
> Date: Friday, September 30, 2011, 9:05 AM
> Hi Ramki,
> ?
> Apologies, both for mis-reading your original response (re:
> long initial-mark phases) and for choosing the wrong list.
> Thank you very much for redirecting it to gc-use. 
> 
> I just want to clarify a couple of points from your last
> response, for the record.
> 
> To answer the question about the long stop-the-world
> initial marking phase: this is the longest I know of, but we
> have seen other instances where it lasted 3-4 seconds. In
> those cases as well, the "user/sys" times were much smaller
> than the "real" time so things clearly seem to be completely
> stalled. Also as a matter of background - the reason we
> moved to using the CMS collectors was that, prior to this,
> we were occasionally seeing extremely long (sometimes
> lasting more than a minute) full GCs. It is quite possible
> that the same factors that caused such long full GCs in the
> past are causing somewhat shorter (but still not _short_)
> initial mark with the CMS collector. In any case, I didn't
> view this as being related to 6692906 to begin with, and am
> glad to get confirmation that you don't think it is either.
> 
> Regarding your point "A definite symptom of 6692906 can be
> diagnosed if the JVM completely stalls into a livelock (a
> few threads in the JVM intermittently active, but your
> application will forever stop? forward progress from that
> point on). It doesn't look like you have observed that
> latter symptom however?" - my understanding of the symptom
> (based on the summary at http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2010-February/001549.html,
> for example) was that the wait/stall resolves itself after
> tens of seconds for reasons unknown. Our symptom is closer
> to this, in that the application does not stop _forever_
> from that point onwards. [It is entirely possible that there
> were further findings beyond what is in the posting that I
> am not aware of.]
> 
> We will initiate efforts to reach out to JVM support - of
> course, in the meanwhile, any feedback or help on this forum
> is very welcome!
> 
> Regards,
> Srini.
> 
> --- On Thu, 9/29/11, Y. S. Ramakrishna <y.s.ramakrishna at oracle.com>
> wrote:
> 
> 
> From: Y. S. Ramakrishna <y.s.ramakrishna at oracle.com>
> Subject: Re: Intermittent issue with long concurrent
> marking phase
> To: "Srini Padman" <srini_was at yahoo.com>
> Cc: hotspot-gc-use at openjdk.java.net
> Date: Thursday, September 29, 2011, 8:54 PM
> 
> 
> Hi Srini -- As I indicated, if you cannot upgrade easily to
> test
> if the issue is fixed, you should probably engage JVM
> support to
> get to a proper diagnosis of the issue affecting your
> production systems.
> 
> more inline below ...
> 
> On 09/29/11 07:44, Srini Padman wrote:
> > Hi Ramki,
> >? Thank you very much for your reply.
> >? It is not *always* that the concurrent marking phase
> takes this long, although it happens often enough. For
> example, in the full GC log corresponding to the snippet I
> pasted in my posting (attached, zipped) there is only that
> one instance.
> 
> I see that there is one instance of the long _initial-mark_
> pause
> and as i stated the whole process seems stalled at that
> time.
> It's definitely not the stall/livelock issue of CR
> 6692906.
> 
> [In other words, you may be dealing with several different
> issues
> here and you will need to disentangle them.]
> 
> >? I think I know why you are asking - based on my
> understanding of Bug # 6692906 (more accurately, based on
> discussions around it on this list), I was under the
> impression that such long CM phases will happen all the time
> (if they happen at all). Does the fact that it is
> intermittent raise the possibility that this is a different
> issue? I realize that you might not be able to answer this
> based on the bits of information you have, but perhaps the
> full GC log will tell you something that you don't already
> know.
> 
> Again, my question about once/many was only about the long
> initial mark pause
> of which there is exactly one in your log.
> 
> The stall of mutators during concurrent marking, which you
> are
> conjecturing is 6692906, is a different issue: for that you
> should
> either upgrade and test, or seek JVM support help. It is
> definitely
> possible that the symptoms of 6692906 will happen
> infrequently or
> intermittently. A definite symptom of 6692906 can be
> diagnosed if
> the JVM completely stalls into a livelock (a few threads in
> the
> JVM intermittently active, but your application will
> forever
> stop forward progress from that point on). It doesn't look
> like
> you have observed that latter symptom however?
> 
> Sorry I can't really help more at this time, but perhaps
> someone
> from the community may be able to. But I really suggest
> either
> an upgrade or seeking JVM support help. (This is not a
> professional
> support alias.)
> 
> In general, the GC-use alias is better suited to questions
> such as this.
> GC-dev should be used for GC development questions
> involving
> the main development trunk. Issues of uses in the field of
> older versions should be addressed to hotspot-gc-use at o.j.n
> So I've taken the liberty to send this to hotspot-gc-use at o.j.n
> 
> All the best!
> -- ramki
> 
> >? Regards,
> > Srini.
> > 
> > --- On *Thu, 9/29/11, Ramki Ramakrishna /<y.s.ramakrishna at oracle.com>/*
> wrote:
> > 
> > 
> >? ???From: Ramki Ramakrishna <y.s.ramakrishna at oracle.com>
> >? ???Subject: Re: Intermittent issue with long
> concurrent marking phase
> >? ???To: "Srini Padman" <srini_was at yahoo.com>
> >? ???Cc: hotspot-gc-dev at openjdk.java.net
> >? ???Date: Thursday, September 29, 2011, 4:24 AM
> > 
> >? ???Hi Srini -- (inline below)
> > 
> >? ???On 9/28/2011 4:50 AM, Srini Padman wrote:
> >> 
> >>? ???Questions:
> >>? ? ? ? ? 1\ is it clear based on the
> description above that the issue is
> >>? ???identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
> >> 
> > 
> >? ???Very likely the same bug.
> > 
> >>? ???2\ will we benefit by upgrading to a more
> recent JRE [1.6.0_26
> >>? ???being the one under consideration]?
> >> 
> > 
> >? ???Definitely worth trying.
> > 
> >>? ???3\ I have seen recommendations to use
> >>? ???"-XX:-CMSConcurrentMTEnabled" on some web
> forums - but I have
> >>? ???concerns about this; if we don't allow for
> concurrent marking to
> >>? ???use multiple threads, then isnt there a
> danger of marking
> >>? ???proceeding so slowly that we might end up
> running out of memory
> >>? ???i.e., garbage created much faster than it
> is collected]?
> >> 
> > 
> >? ???Your concerns are very legitimate (especially
> given the length of
> >? ???the concurrent mark phase) and the number of
> cores you have.
> > 
> >>? ? ? ? ? Any help is greatly appreciated.
> Please let me know if any
> >>? ???additional information is needed at all. I
> haven't attached the
> >>? ???full GC log (it caused problems with
> posting) but will gladly send
> >>? ???it directly to anybody who would like.
> >> 
> > 
> >? ???The long initial mark pause is definitely
> concerning -- Does it show
> >? ???up regularly
> >? ???in the GC logs or is the snippet above an
> anomaly? Curisously, as
> >? ???the process time
> >? ???shows, the user and system time are both low
> but the elapsed time is
> >? ???very large.
> >? ???That looks like a total stall of the process,
> and I have no conjectures
> >? ???based on available data.
> > 
> >? ???I suggest talking with your Java support folk
> if you reproduce this
> >? ???after upgrading to
> >? ???6u28 (or whatever).
> > 
> >? ???best regards.
> >? ???-- ramki
> > 
> >> 
> >>? ???Regards,
> >>? ???Srini.
> >> 
> 
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: app02_hs_err_pid29448.zip
Type: application/x-zip-compressed
Size: 7741 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110930/76d081b1/attachment-0001.bin 

From y.s.ramakrishna at oracle.com  Fri Sep 30 09:51:56 2011
From: y.s.ramakrishna at oracle.com (Ramki Ramakrishna)
Date: Fri, 30 Sep 2011 09:51:56 -0700
Subject: Intermittent issue with long concurrent marking phase
In-Reply-To: <1317398266.52316.YahooMailClassic@web162020.mail.bf1.yahoo.com>
References: <1317398266.52316.YahooMailClassic@web162020.mail.bf1.yahoo.com>
Message-ID: <4E85F3AC.4010208@oracle.com>

I am sorry, you should probably talk with Java support -- there are too many
entangled vectors here, and it wouldn't be appropriate for us to use this
list for your support problems. I hope you understand.

Having said that, to find out whether your JVM has a certain option on 
or off by default,
you can do -XX:+PrintFlagsFinal and grep for the flag of yr choice, to 
find the answer to your question:-

% $java -XX:+PrintFlagsFinal -version | grep -i escape
      bool DoEscapeAnalysis                          = true            
{C2 product}
      bool EstimateArgEscape                         = true            
{product}
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)

-- ramki

On 9/30/2011 8:57 AM, Srini Padman wrote:
> A quick follow up on this:
>
> Testing with JRE 1.6.0_26 (which we were hoping to upgrade to), our QE team has reported a JVM crash (zipped crash dump file attached). This seems almost exactly the same as the one reported here (supposedly introduced in update 25 and fixed in update 27, which would put us right in the crosshairs):
>
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7042582
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7013538
>
> Before I assume that: the suggested workaround is -XX:-DoEscapeAnalysis - does this mean that the default is +DoEscapeAnalysis? In other words, is that bug something that affects everybody or only those who explicitly use -XX:+DoEscapeAnalysis? We don't use +DoEscapeAnalysis explicitly, so there is no reason for it to affect us if this is not an option enabled by default.
>
> Of course, we are confirming that this is not merely a matter of memory constraints (seems unlikely, since the tests are the same as were run on previous JREs).
>
> Regards,
> Srini.
>
> --- On Fri, 9/30/11, Srini Padman<srini_was at yahoo.com>  wrote:
>
>> From: Srini Padman<srini_was at yahoo.com>
>> Subject: Re: Intermittent issue with long concurrent marking phase
>> To: Y.S.Ramakrishna at oracle.com
>> Cc: hotspot-gc-use at openjdk.java.net
>> Date: Friday, September 30, 2011, 9:05 AM
>> Hi Ramki,
>>   
>> Apologies, both for mis-reading your original response (re:
>> long initial-mark phases) and for choosing the wrong list.
>> Thank you very much for redirecting it to gc-use.
>>
>> I just want to clarify a couple of points from your last
>> response, for the record.
>>
>> To answer the question about the long stop-the-world
>> initial marking phase: this is the longest I know of, but we
>> have seen other instances where it lasted 3-4 seconds. In
>> those cases as well, the "user/sys" times were much smaller
>> than the "real" time so things clearly seem to be completely
>> stalled. Also as a matter of background - the reason we
>> moved to using the CMS collectors was that, prior to this,
>> we were occasionally seeing extremely long (sometimes
>> lasting more than a minute) full GCs. It is quite possible
>> that the same factors that caused such long full GCs in the
>> past are causing somewhat shorter (but still not _short_)
>> initial mark with the CMS collector. In any case, I didn't
>> view this as being related to 6692906 to begin with, and am
>> glad to get confirmation that you don't think it is either.
>>
>> Regarding your point "A definite symptom of 6692906 can be
>> diagnosed if the JVM completely stalls into a livelock (a
>> few threads in the JVM intermittently active, but your
>> application will forever stop  forward progress from that
>> point on). It doesn't look like you have observed that
>> latter symptom however?" - my understanding of the symptom
>> (based on the summary at http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2010-February/001549.html,
>> for example) was that the wait/stall resolves itself after
>> tens of seconds for reasons unknown. Our symptom is closer
>> to this, in that the application does not stop _forever_
>> from that point onwards. [It is entirely possible that there
>> were further findings beyond what is in the posting that I
>> am not aware of.]
>>
>> We will initiate efforts to reach out to JVM support - of
>> course, in the meanwhile, any feedback or help on this forum
>> is very welcome!
>>
>> Regards,
>> Srini.
>>
>> --- On Thu, 9/29/11, Y. S. Ramakrishna<y.s.ramakrishna at oracle.com>
>> wrote:
>>
>>
>> From: Y. S. Ramakrishna<y.s.ramakrishna at oracle.com>
>> Subject: Re: Intermittent issue with long concurrent
>> marking phase
>> To: "Srini Padman"<srini_was at yahoo.com>
>> Cc: hotspot-gc-use at openjdk.java.net
>> Date: Thursday, September 29, 2011, 8:54 PM
>>
>>
>> Hi Srini -- As I indicated, if you cannot upgrade easily to
>> test
>> if the issue is fixed, you should probably engage JVM
>> support to
>> get to a proper diagnosis of the issue affecting your
>> production systems.
>>
>> more inline below ...
>>
>> On 09/29/11 07:44, Srini Padman wrote:
>>> Hi Ramki,
>>>    Thank you very much for your reply.
>>>    It is not *always* that the concurrent marking phase
>> takes this long, although it happens often enough. For
>> example, in the full GC log corresponding to the snippet I
>> pasted in my posting (attached, zipped) there is only that
>> one instance.
>>
>> I see that there is one instance of the long _initial-mark_
>> pause
>> and as i stated the whole process seems stalled at that
>> time.
>> It's definitely not the stall/livelock issue of CR
>> 6692906.
>>
>> [In other words, you may be dealing with several different
>> issues
>> here and you will need to disentangle them.]
>>
>>>    I think I know why you are asking - based on my
>> understanding of Bug # 6692906 (more accurately, based on
>> discussions around it on this list), I was under the
>> impression that such long CM phases will happen all the time
>> (if they happen at all). Does the fact that it is
>> intermittent raise the possibility that this is a different
>> issue? I realize that you might not be able to answer this
>> based on the bits of information you have, but perhaps the
>> full GC log will tell you something that you don't already
>> know.
>>
>> Again, my question about once/many was only about the long
>> initial mark pause
>> of which there is exactly one in your log.
>>
>> The stall of mutators during concurrent marking, which you
>> are
>> conjecturing is 6692906, is a different issue: for that you
>> should
>> either upgrade and test, or seek JVM support help. It is
>> definitely
>> possible that the symptoms of 6692906 will happen
>> infrequently or
>> intermittently. A definite symptom of 6692906 can be
>> diagnosed if
>> the JVM completely stalls into a livelock (a few threads in
>> the
>> JVM intermittently active, but your application will
>> forever
>> stop forward progress from that point on). It doesn't look
>> like
>> you have observed that latter symptom however?
>>
>> Sorry I can't really help more at this time, but perhaps
>> someone
>> from the community may be able to. But I really suggest
>> either
>> an upgrade or seeking JVM support help. (This is not a
>> professional
>> support alias.)
>>
>> In general, the GC-use alias is better suited to questions
>> such as this.
>> GC-dev should be used for GC development questions
>> involving
>> the main development trunk. Issues of uses in the field of
>> older versions should be addressed to hotspot-gc-use at o.j.n
>> So I've taken the liberty to send this to hotspot-gc-use at o.j.n
>>
>> All the best!
>> -- ramki
>>
>>>    Regards,
>>> Srini.
>>>
>>> --- On *Thu, 9/29/11, Ramki Ramakrishna /<y.s.ramakrishna at oracle.com>/*
>> wrote:
>>>
>>>       From: Ramki Ramakrishna<y.s.ramakrishna at oracle.com>
>>>       Subject: Re: Intermittent issue with long
>> concurrent marking phase
>>>       To: "Srini Padman"<srini_was at yahoo.com>
>>>       Cc: hotspot-gc-dev at openjdk.java.net
>>>       Date: Thursday, September 29, 2011, 4:24 AM
>>>
>>>       Hi Srini -- (inline below)
>>>
>>>       On 9/28/2011 4:50 AM, Srini Padman wrote:
>>>>       Questions:
>>>>            1\ is it clear based on the
>> description above that the issue is
>>>>       identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
>>>>
>>>       Very likely the same bug.
>>>
>>>>       2\ will we benefit by upgrading to a more
>> recent JRE [1.6.0_26
>>>>       being the one under consideration]?
>>>>
>>>       Definitely worth trying.
>>>
>>>>       3\ I have seen recommendations to use
>>>>       "-XX:-CMSConcurrentMTEnabled" on some web
>> forums - but I have
>>>>       concerns about this; if we don't allow for
>> concurrent marking to
>>>>       use multiple threads, then isnt there a
>> danger of marking
>>>>       proceeding so slowly that we might end up
>> running out of memory
>>>>       i.e., garbage created much faster than it
>> is collected]?
>>>       Your concerns are very legitimate (especially
>> given the length of
>>>       the concurrent mark phase) and the number of
>> cores you have.
>>>>            Any help is greatly appreciated.
>> Please let me know if any
>>>>       additional information is needed at all. I
>> haven't attached the
>>>>       full GC log (it caused problems with
>> posting) but will gladly send
>>>>       it directly to anybody who would like.
>>>>
>>>       The long initial mark pause is definitely
>> concerning -- Does it show
>>>       up regularly
>>>       in the GC logs or is the snippet above an
>> anomaly? Curisously, as
>>>       the process time
>>>       shows, the user and system time are both low
>> but the elapsed time is
>>>       very large.
>>>       That looks like a total stall of the process,
>> and I have no conjectures
>>>       based on available data.
>>>
>>>       I suggest talking with your Java support folk
>> if you reproduce this
>>>       after upgrading to
>>>       6u28 (or whatever).
>>>
>>>       best regards.
>>>       -- ramki
>>>
>>>>       Regards,
>>>>       Srini.
>>>>
> >
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20110930/516b7fbe/attachment.html 

From srini_was at yahoo.com  Fri Sep 30 06:05:13 2011
From: srini_was at yahoo.com (Srini Padman)
Date: Fri, 30 Sep 2011 06:05:13 -0700 (PDT)
Subject: Intermittent issue with long concurrent marking phase
Message-ID: <1317387913.58856.YahooMailClassic@web162018.mail.bf1.yahoo.com>

Hi Ramki,
?
Apologies, both for mis-reading your original response (re: long initial-mark phases) and for choosing the wrong list. Thank you very much for redirecting it to gc-use. 

I just want to clarify a couple of points from your last response, for the record.

To answer the question about the long stop-the-world initial marking phase: this is the longest I know of, but we have seen other instances where it lasted 3-4 seconds. In those cases as well, the "user/sys" times were much smaller than the "real" time so things clearly seem to be completely stalled. Also as a matter of background - the reason we moved to using the CMS collectors was that, prior to this, we were occasionally seeing extremely long (sometimes lasting more than a minute) full GCs. It is quite possible that the same factors that caused such long full GCs in the past are causing somewhat shorter (but still not _short_) initial mark with the CMS collector. In any case, I didn't view this as being related to 6692906 to begin with, and am glad to get confirmation that you don't think it is either.

Regarding your point "A definite symptom of 6692906 can be diagnosed if the JVM completely stalls into a livelock (a few threads in the JVM intermittently active, but your application will forever stop? forward progress from that point on). It doesn't look like you have observed that latter symptom however?" - my understanding of the symptom (based on the summary at http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2010-February/001549.html, for example) was that the wait/stall resolves itself after tens of seconds for reasons unknown. Our symptom is closer to this, in that the application does not stop _forever_ from that point onwards. [It is entirely possible that there were further findings beyond what is in the posting that I am not aware of.]

We will initiate efforts to reach out to JVM support - of course, in the meanwhile, any feedback or help on this forum is very welcome!

Regards,
Srini.

--- On Thu, 9/29/11, Y. S. Ramakrishna <y.s.ramakrishna at oracle.com> wrote:


From: Y. S. Ramakrishna <y.s.ramakrishna at oracle.com>
Subject: Re: Intermittent issue with long concurrent marking phase
To: "Srini Padman" <srini_was at yahoo.com>
Cc: hotspot-gc-use at openjdk.java.net
Date: Thursday, September 29, 2011, 8:54 PM


Hi Srini -- As I indicated, if you cannot upgrade easily to test
if the issue is fixed, you should probably engage JVM support to
get to a proper diagnosis of the issue affecting your production systems.

more inline below ...

On 09/29/11 07:44, Srini Padman wrote:
> Hi Ramki,
>? Thank you very much for your reply.
>? It is not *always* that the concurrent marking phase takes this long, although it happens often enough. For example, in the full GC log corresponding to the snippet I pasted in my posting (attached, zipped) there is only that one instance.

I see that there is one instance of the long _initial-mark_ pause
and as i stated the whole process seems stalled at that time.
It's definitely not the stall/livelock issue of CR 6692906.

[In other words, you may be dealing with several different issues
here and you will need to disentangle them.]

>? I think I know why you are asking - based on my understanding of Bug # 6692906 (more accurately, based on discussions around it on this list), I was under the impression that such long CM phases will happen all the time (if they happen at all). Does the fact that it is intermittent raise the possibility that this is a different issue? I realize that you might not be able to answer this based on the bits of information you have, but perhaps the full GC log will tell you something that you don't already know.

Again, my question about once/many was only about the long initial mark pause
of which there is exactly one in your log.

The stall of mutators during concurrent marking, which you are
conjecturing is 6692906, is a different issue: for that you should
either upgrade and test, or seek JVM support help. It is definitely
possible that the symptoms of 6692906 will happen infrequently or
intermittently. A definite symptom of 6692906 can be diagnosed if
the JVM completely stalls into a livelock (a few threads in the
JVM intermittently active, but your application will forever
stop forward progress from that point on). It doesn't look like
you have observed that latter symptom however?

Sorry I can't really help more at this time, but perhaps someone
from the community may be able to. But I really suggest either
an upgrade or seeking JVM support help. (This is not a professional
support alias.)

In general, the GC-use alias is better suited to questions such as this.
GC-dev should be used for GC development questions involving
the main development trunk. Issues of uses in the field of
older versions should be addressed to hotspot-gc-use at o.j.n
So I've taken the liberty to send this to hotspot-gc-use at o.j.n

All the best!
-- ramki

>? Regards,
> Srini.
> 
> --- On *Thu, 9/29/11, Ramki Ramakrishna /<y.s.ramakrishna at oracle.com>/* wrote:
> 
> 
>? ???From: Ramki Ramakrishna <y.s.ramakrishna at oracle.com>
>? ???Subject: Re: Intermittent issue with long concurrent marking phase
>? ???To: "Srini Padman" <srini_was at yahoo.com>
>? ???Cc: hotspot-gc-dev at openjdk.java.net
>? ???Date: Thursday, September 29, 2011, 4:24 AM
> 
>? ???Hi Srini -- (inline below)
> 
>? ???On 9/28/2011 4:50 AM, Srini Padman wrote:
>> 
>>? ???Questions:
>>? ? ? ? ? 1\ is it clear based on the description above that the issue is
>>? ???identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
>> 
> 
>? ???Very likely the same bug.
> 
>>? ???2\ will we benefit by upgrading to a more recent JRE [1.6.0_26
>>? ???being the one under consideration]?
>> 
> 
>? ???Definitely worth trying.
> 
>>? ???3\ I have seen recommendations to use
>>? ???"-XX:-CMSConcurrentMTEnabled" on some web forums - but I have
>>? ???concerns about this; if we don't allow for concurrent marking to
>>? ???use multiple threads, then isnt there a danger of marking
>>? ???proceeding so slowly that we might end up running out of memory
>>? ???i.e., garbage created much faster than it is collected]?
>> 
> 
>? ???Your concerns are very legitimate (especially given the length of
>? ???the concurrent mark phase) and the number of cores you have.
> 
>>? ? ? ? ? Any help is greatly appreciated. Please let me know if any
>>? ???additional information is needed at all. I haven't attached the
>>? ???full GC log (it caused problems with posting) but will gladly send
>>? ???it directly to anybody who would like.
>> 
> 
>? ???The long initial mark pause is definitely concerning -- Does it show
>? ???up regularly
>? ???in the GC logs or is the snippet above an anomaly? Curisously, as
>? ???the process time
>? ???shows, the user and system time are both low but the elapsed time is
>? ???very large.
>? ???That looks like a total stall of the process, and I have no conjectures
>? ???based on available data.
> 
>? ???I suggest talking with your Java support folk if you reproduce this
>? ???after upgrading to
>? ???6u28 (or whatever).
> 
>? ???best regards.
>? ???-- ramki
> 
>> 
>>? ???Regards,
>>? ???Srini.
>> 


From kirk at kodewerk.com  Fri Sep 30 10:55:56 2011
From: kirk at kodewerk.com (Charles K Pepperdine)
Date: Fri, 30 Sep 2011 19:55:56 +0200
Subject: Intermittent issue with long concurrent marking phase
In-Reply-To: <1317387913.58856.YahooMailClassic@web162018.mail.bf1.yahoo.com>
References: <1317387913.58856.YahooMailClassic@web162018.mail.bf1.yahoo.com>
Message-ID: <F8EC47AA-2774-48CA-9898-3FCF4D6D8B10@kodewerk.com>

Hi,

Apparent GC stalls can be caused by a number of conditions that make it almost impossible to diagnose over a webforum like this. To get through this you'd need a through examination of the environment. You're using Windows Server 2003? Well thread scheduling sucks in that environment. Add in virtualization where thread scheduling can do funny things and you've got a sucky situation on top of another sucky situation., Are you usings SANS? How many vms are running on the box, thread pool sizes? sources of unbounded thread pooling??? not enough memory... the list of possibles goes on before you start even thinking about the JVM. Hitting a list with fragments of information simply isn't a substitute for a through architectural review. Yes I do these and sorry, I've no time to take on another one this year.

Kind regards,
Kirk Pepperdine

On Sep 30, 2011, at 3:05 PM, Srini Padman wrote:

> Hi Ramki,
>  
> Apologies, both for mis-reading your original response (re: long initial-mark phases) and for choosing the wrong list. Thank you very much for redirecting it to gc-use. 
> 
> I just want to clarify a couple of points from your last response, for the record.
> 
> To answer the question about the long stop-the-world initial marking phase: this is the longest I know of, but we have seen other instances where it lasted 3-4 seconds. In those cases as well, the "user/sys" times were much smaller than the "real" time so things clearly seem to be completely stalled. Also as a matter of background - the reason we moved to using the CMS collectors was that, prior to this, we were occasionally seeing extremely long (sometimes lasting more than a minute) full GCs. It is quite possible that the same factors that caused such long full GCs in the past are causing somewhat shorter (but still not _short_) initial mark with the CMS collector. In any case, I didn't view this as being related to 6692906 to begin with, and am glad to get confirmation that you don't think it is either.
> 
> Regarding your point "A definite symptom of 6692906 can be diagnosed if the JVM completely stalls into a livelock (a few threads in the JVM intermittently active, but your application will forever stop  forward progress from that point on). It doesn't look like you have observed that latter symptom however?" - my understanding of the symptom (based on the summary at http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2010-February/001549.html, for example) was that the wait/stall resolves itself after tens of seconds for reasons unknown. Our symptom is closer to this, in that the application does not stop _forever_ from that point onwards. [It is entirely possible that there were further findings beyond what is in the posting that I am not aware of.]
> 
> We will initiate efforts to reach out to JVM support - of course, in the meanwhile, any feedback or help on this forum is very welcome!
> 
> Regards,
> Srini.
> 
> --- On Thu, 9/29/11, Y. S. Ramakrishna <y.s.ramakrishna at oracle.com> wrote:
> 
> 
> From: Y. S. Ramakrishna <y.s.ramakrishna at oracle.com>
> Subject: Re: Intermittent issue with long concurrent marking phase
> To: "Srini Padman" <srini_was at yahoo.com>
> Cc: hotspot-gc-use at openjdk.java.net
> Date: Thursday, September 29, 2011, 8:54 PM
> 
> 
> Hi Srini -- As I indicated, if you cannot upgrade easily to test
> if the issue is fixed, you should probably engage JVM support to
> get to a proper diagnosis of the issue affecting your production systems.
> 
> more inline below ...
> 
> On 09/29/11 07:44, Srini Padman wrote:
>> Hi Ramki,
>>   Thank you very much for your reply.
>>   It is not *always* that the concurrent marking phase takes this long, although it happens often enough. For example, in the full GC log corresponding to the snippet I pasted in my posting (attached, zipped) there is only that one instance.
> 
> I see that there is one instance of the long _initial-mark_ pause
> and as i stated the whole process seems stalled at that time.
> It's definitely not the stall/livelock issue of CR 6692906.
> 
> [In other words, you may be dealing with several different issues
> here and you will need to disentangle them.]
> 
>>   I think I know why you are asking - based on my understanding of Bug # 6692906 (more accurately, based on discussions around it on this list), I was under the impression that such long CM phases will happen all the time (if they happen at all). Does the fact that it is intermittent raise the possibility that this is a different issue? I realize that you might not be able to answer this based on the bits of information you have, but perhaps the full GC log will tell you something that you don't already know.
> 
> Again, my question about once/many was only about the long initial mark pause
> of which there is exactly one in your log.
> 
> The stall of mutators during concurrent marking, which you are
> conjecturing is 6692906, is a different issue: for that you should
> either upgrade and test, or seek JVM support help. It is definitely
> possible that the symptoms of 6692906 will happen infrequently or
> intermittently. A definite symptom of 6692906 can be diagnosed if
> the JVM completely stalls into a livelock (a few threads in the
> JVM intermittently active, but your application will forever
> stop forward progress from that point on). It doesn't look like
> you have observed that latter symptom however?
> 
> Sorry I can't really help more at this time, but perhaps someone
> from the community may be able to. But I really suggest either
> an upgrade or seeking JVM support help. (This is not a professional
> support alias.)
> 
> In general, the GC-use alias is better suited to questions such as this.
> GC-dev should be used for GC development questions involving
> the main development trunk. Issues of uses in the field of
> older versions should be addressed to hotspot-gc-use at o.j.n
> So I've taken the liberty to send this to hotspot-gc-use at o.j.n
> 
> All the best!
> -- ramki
> 
>>   Regards,
>> Srini.
>> 
>> --- On *Thu, 9/29/11, Ramki Ramakrishna /<y.s.ramakrishna at oracle.com>/* wrote:
>> 
>> 
>>      From: Ramki Ramakrishna <y.s.ramakrishna at oracle.com>
>>      Subject: Re: Intermittent issue with long concurrent marking phase
>>      To: "Srini Padman" <srini_was at yahoo.com>
>>      Cc: hotspot-gc-dev at openjdk.java.net
>>      Date: Thursday, September 29, 2011, 4:24 AM
>> 
>>      Hi Srini -- (inline below)
>> 
>>      On 9/28/2011 4:50 AM, Srini Padman wrote:
>>> 
>>>      Questions:
>>>           1\ is it clear based on the description above that the issue is
>>>      identical to 6692906 (http://bugs.sun.com/view_bug.do?bug_id=6692906)?
>>> 
>> 
>>      Very likely the same bug.
>> 
>>>      2\ will we benefit by upgrading to a more recent JRE [1.6.0_26
>>>      being the one under consideration]?
>>> 
>> 
>>      Definitely worth trying.
>> 
>>>      3\ I have seen recommendations to use
>>>      "-XX:-CMSConcurrentMTEnabled" on some web forums - but I have
>>>      concerns about this; if we don't allow for concurrent marking to
>>>      use multiple threads, then isnt there a danger of marking
>>>      proceeding so slowly that we might end up running out of memory
>>>      i.e., garbage created much faster than it is collected]?
>>> 
>> 
>>      Your concerns are very legitimate (especially given the length of
>>      the concurrent mark phase) and the number of cores you have.
>> 
>>>           Any help is greatly appreciated. Please let me know if any
>>>      additional information is needed at all. I haven't attached the
>>>      full GC log (it caused problems with posting) but will gladly send
>>>      it directly to anybody who would like.
>>> 
>> 
>>      The long initial mark pause is definitely concerning -- Does it show
>>      up regularly
>>      in the GC logs or is the snippet above an anomaly? Curisously, as
>>      the process time
>>      shows, the user and system time are both low but the elapsed time is
>>      very large.
>>      That looks like a total stall of the process, and I have no conjectures
>>      based on available data.
>> 
>>      I suggest talking with your Java support folk if you reproduce this
>>      after upgrading to
>>      6u28 (or whatever).
>> 
>>      best regards.
>>      -- ramki
>> 
>>> 
>>>      Regards,
>>>      Srini.
>>> 
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use