CMS collection keep working during holiday

Wed Oct 8 07:15:26 UTC 2008

Hi Ken --

You do see of course that in the gc.zip that you made available,
there were 4 promotion failures, followed by compacting collections,
as listed below in chronological order:-

163116.821: [GC 163116.821: [ParNew: 873856K->174720K(873856K), 0.4830286 secs] 1508484K->961405K(4019584K), 0.4831137 secs] [Times: user=1.59 sys=0.01, real=0.
49 secs]
...
165310.759: [GC 165310.759: [ParNew (promotion failed): 873856K->873856K(873856K), 0.8919148 secs]165311.651:[CMS: 1911965K->785641K(3145728K), 2.9784355 secs] 2733287K->785641K(4019584K), 3.8704903 secs] [Times: user=4.80 sys=0.16, real=3.88 secs]
...
231411.554: [GC 231411.554: [ParNew: 873856K->174720K(873856K), 0.5915618 secs] 1676990K->1122182K(4019584K), 0.5916525 secs] [Times: user=1.45 sys=0.00, real=0
.59 secs]
...
233946.300: [GC 233946.300: [ParNew (promotion failed): 873856K->873856K(873856K), 0.9428034 secs]233947.243: [CMS: 2172134K->570964K(3145728K), 2.7655578 secs] 2969591K->570964K(4019584K), 3.7084973 secs] [Times: user=5.17 sys=0.09, real=3.70 secs]
...
247864.029: [GC 247864.029: [ParNew (promotion failed): 873856K->873855K(873856K), 0.8641325 secs]247864.893: [CMS: 1845477K->499108K(3145728K), 2.2010342 secs] 2628538K->499108K(4019584K), 3.0652834 secs] [Times: user=4.30 sys=0.06, real=3.06 secs]
...
248187.856: [GC 248187.856: [ParNew: 699136K->174719K(873856K), 0.3926926 secs] 1050016K->610565K(4019584K), 0.3927797 secs] [Times: user=1.45 sys=0.03, real=0.
39 secs]
...
265611.034: [GC 265611.034: [ParNew (promotion failed): 873856K->873856K(873856K), 0.6459451 secs]265611.680: [CMS: 1854348K->502128K(3145728K), 2.2556597 secs] 2714667K->502128K(4019584K), 2.9017069 secs] [Times: user=4.19 sys=0.08, real=2.89 secs]
...
502712.808: [GC 502712.808: [ParNew: 873856K->174720K(873856K), 1.2972687 secs] 3806995K->3167193K(4019584K), 1.2973536 secs] [Times: user=1.75 sys=0.00, real=1
.30 secs]
...
504379.249: [GC 504379.249: [ParNew: 873856K->873856K(873856K), 0.0000241 secs]504379.249: [CMS504379.358: [CMS-concurrent-abortable-preclean: 0.209/2.471 secs] [Times: user=1.28 sys=0.00, real=2.48 secs] (concurrent mode failure): 2938816K->424257K(3145728K), 5.2334306 secs] 3812672K->424257K(4019584K), 5.2335621 secs] [Times:
 user=5.11 sys=0.02, real=5.23 secs]

In other words, CMS often does not manage to clear the allocated
memory. But the CMS collection preceding the promotion failure
indicates heap usage only a little more than what you see after
the mark-compact collection. There is almost certainly something
fishy going on here. It may be related to Reference objects and their
referents being split across generations, but it's not completely clear.

[Perhaps you can check if your application makes use of Soft/Weak/References
in some funny way, so as to perhaps cause the Reference and referent to be
split between the old and new generations?]

Now one thing we know about the difference between CMS and a full mark-compact
collection is that CMS treats the young gen as a source of roots for the
CMS and perm gen collection, whereas this is not so for the latter.
In pathological cases, a dead cycle spanning the young gen and the
old gen would remain uncollected by CMS but be collected by the mark-compact
collection. Now, I don't think that would be happening in your case, but I
do see that the frequency of ParNew collections in your case is fairly low --
about 1 every 6 minutes or so. With a tenuring threshold of 11, it's possible
for a dead cycle to live for a full hour before it gets promoted into the
old generation whence it would be eligible for collection.

In some cases, the CMS initiating occupancy fraction has not been hit
(and a CMS collection not even attempte) before a promotion failure
occurs. This would appear to call for more frequent CMS collections
following ParNew collections, even something drastic like:-

    CMSInitiatingOccupancyFraction=25

This would of course cause your back to back CMS collections that
would peg your CPU, in order to avoid that, you might disable the
CMS "timeout" via:-

     CMSWaitDuration=1200000

i.e. make it 20 minutes (=120000 ms).

In addition, I find that your scavenges are unnecessarily copying
survivors back-and-forth 11 times or so, I'd just scrunch that down
to:-

      MaxTenuringThreshold=1 (or at most 2)

This may or may not address the problem you are seeing here but
it's at least a start towards eliminating at least some of
the suspects and of course of eliminating the "busy work"
that CMS is doing during the holidays.

-- ramki

----- Original Message -----
From: "Ken-- at newsgroupstats.hk" <dragonken at gmail.com>
Date: Tuesday, October 7, 2008 11:14 pm
Subject: Re: CMS collection keep working during holiday
To: hotspot-gc-dev at openjdk.java.net

> Thanks to your quick response. I will do it now but it require 1 - 2 
> days to
> reproduce the result.
> 
> It seems that my case is very special.
> 
> What the application is simply a stock quote client server and store daily
> stock quote and chart data in memory. The lived objects need around 80M
> memory at day open and 500M memory at day end.
> 
> 
> 
> 
> y.s.ramakrishna wrote:
> > 
> > 
> > Could you run with the additional option +PrintReferenceGC and see if
> > it sheds additional light on the problem? (Please share the resulting
> > GC logs.)
> > 
> > If at all possible, you might try rev'ing down to 6.0 and seeing
> > if the same behaviour reproduces or not?
> > 
> > I understand that some or all of this may not be possible in a
> > production setting.
> > 
> > -- ramki
> > 
> > ----- Original Message -----
> > From: "Ken-- at newsgroupstats.hk" <dragonken at gmail.com>
> > Date: Tuesday, October 7, 2008 10:17 pm
> > Subject: Re: CMS collection keep working during holiday
> > To: hotspot-gc-dev at openjdk.java.net
> > 
> > 
> >> Follow up to the jconsole above. This gif is captured today. You 
> can 
> >> see the
> >> first CMS after a FULL GC is working. The 2nd one is working too 
> but the
> >> memory free from 2nd generation is much less than the first CMS. The
> >> memory
> >> free by 3rd CMS is further decreased.
> >> 
> >> The 5th CMS cannot free enough memory to make occanpcy in old-gen 
> to less
> >> than 70%. So, continue non-stopping CMS happen again until a Full GC
> >> (triggered by Concurrent Mode Failure).
> >> 
> >> http://www.nabble.com/file/p19872463/jconsole_20081008.jpg 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> -- 
> >> View this message in context:
> >> http://www.nabble.com/CMS-collection-keep-working-during-holiday-tp19773575p19872463.html
> >> Sent from the OpenJDK Hotspot Garbage Collection mailing list 
> archive 
> >> at Nabble.com.
> >> 
> > 
> > 
> 
> -- 
> View this message in context: http://www.nabble.com/CMS-collection-keep-working-during-holiday-tp19773575p19872900.html
> Sent from the OpenJDK Hotspot Garbage Collection mailing list archive 
> at Nabble.com.
>