CMS collection keep working during holiday

Wed Oct 8 08:36:40 UTC 2008

Hi Srinivas, Ken,

Nice analysis. Maybe I've missed it but I've not seen anything that 
rules out a simple object leak. Until the leak has been ruled out would 
you prefer to keep objects out of tenured rather than quickly promoting 
them to a space that is already under enormous pressure? Do you see 
extra copying as even being relevant to this problem?

Regards,
Kirk

Y Srinivas Ramakrishna wrote:
> Hi Ken --
>
> You do see of course that in the gc.zip that you made available,
> there were 4 promotion failures, followed by compacting collections,
> as listed below in chronological order:-
>
> 163116.821: [GC 163116.821: [ParNew: 873856K->174720K(873856K), 0.4830286 secs] 1508484K->961405K(4019584K), 0.4831137 secs] [Times: user=1.59 sys=0.01, real=0.
> 49 secs]
> ...
> 165310.759: [GC 165310.759: [ParNew (promotion failed): 873856K->873856K(873856K), 0.8919148 secs]165311.651:[CMS: 1911965K->785641K(3145728K), 2.9784355 secs] 2733287K->785641K(4019584K), 3.8704903 secs] [Times: user=4.80 sys=0.16, real=3.88 secs]
> ...
> 231411.554: [GC 231411.554: [ParNew: 873856K->174720K(873856K), 0.5915618 secs] 1676990K->1122182K(4019584K), 0.5916525 secs] [Times: user=1.45 sys=0.00, real=0
> .59 secs]
> ...
> 233946.300: [GC 233946.300: [ParNew (promotion failed): 873856K->873856K(873856K), 0.9428034 secs]233947.243: [CMS: 2172134K->570964K(3145728K), 2.7655578 secs] 2969591K->570964K(4019584K), 3.7084973 secs] [Times: user=5.17 sys=0.09, real=3.70 secs]
> ...
> 247864.029: [GC 247864.029: [ParNew (promotion failed): 873856K->873855K(873856K), 0.8641325 secs]247864.893: [CMS: 1845477K->499108K(3145728K), 2.2010342 secs] 2628538K->499108K(4019584K), 3.0652834 secs] [Times: user=4.30 sys=0.06, real=3.06 secs]
> ...
> 248187.856: [GC 248187.856: [ParNew: 699136K->174719K(873856K), 0.3926926 secs] 1050016K->610565K(4019584K), 0.3927797 secs] [Times: user=1.45 sys=0.03, real=0.
> 39 secs]
> ...
> 265611.034: [GC 265611.034: [ParNew (promotion failed): 873856K->873856K(873856K), 0.6459451 secs]265611.680: [CMS: 1854348K->502128K(3145728K), 2.2556597 secs] 2714667K->502128K(4019584K), 2.9017069 secs] [Times: user=4.19 sys=0.08, real=2.89 secs]
> ...
> 502712.808: [GC 502712.808: [ParNew: 873856K->174720K(873856K), 1.2972687 secs] 3806995K->3167193K(4019584K), 1.2973536 secs] [Times: user=1.75 sys=0.00, real=1
> .30 secs]
> ...
> 504379.249: [GC 504379.249: [ParNew: 873856K->873856K(873856K), 0.0000241 secs]504379.249: [CMS504379.358: [CMS-concurrent-abortable-preclean: 0.209/2.471 secs] [Times: user=1.28 sys=0.00, real=2.48 secs] (concurrent mode failure): 2938816K->424257K(3145728K), 5.2334306 secs] 3812672K->424257K(4019584K), 5.2335621 secs] [Times:
>  user=5.11 sys=0.02, real=5.23 secs]
>
> In other words, CMS often does not manage to clear the allocated
> memory. But the CMS collection preceding the promotion failure
> indicates heap usage only a little more than what you see after
> the mark-compact collection. There is almost certainly something
> fishy going on here. It may be related to Reference objects and their
> referents being split across generations, but it's not completely clear.
>
> [Perhaps you can check if your application makes use of Soft/Weak/References
> in some funny way, so as to perhaps cause the Reference and referent to be
> split between the old and new generations?]
>
> Now one thing we know about the difference between CMS and a full mark-compact
> collection is that CMS treats the young gen as a source of roots for the
> CMS and perm gen collection, whereas this is not so for the latter.
> In pathological cases, a dead cycle spanning the young gen and the
> old gen would remain uncollected by CMS but be collected by the mark-compact
> collection. Now, I don't think that would be happening in your case, but I
> do see that the frequency of ParNew collections in your case is fairly low --
> about 1 every 6 minutes or so. With a tenuring threshold of 11, it's possible
> for a dead cycle to live for a full hour before it gets promoted into the
> old generation whence it would be eligible for collection.
>
> In some cases, the CMS initiating occupancy fraction has not been hit
> (and a CMS collection not even attempte) before a promotion failure
> occurs. This would appear to call for more frequent CMS collections
> following ParNew collections, even something drastic like:-
>
>     CMSInitiatingOccupancyFraction=25
>
> This would of course cause your back to back CMS collections that
> would peg your CPU, in order to avoid that, you might disable the
> CMS "timeout" via:-
>
>      CMSWaitDuration=1200000
>
> i.e. make it 20 minutes (=120000 ms).
>
> In addition, I find that your scavenges are unnecessarily copying
> survivors back-and-forth 11 times or so, I'd just scrunch that down
> to:-
>
>       MaxTenuringThreshold=1 (or at most 2)
>
> This may or may not address the problem you are seeing here but
> it's at least a start towards eliminating at least some of
> the suspects and of course of eliminating the "busy work"
> that CMS is doing during the holidays.
>
> -- ramki
>
> ----- Original Message -----
> From: "Ken-- at newsgroupstats.hk" <dragonken at gmail.com>
> Date: Tuesday, October 7, 2008 11:14 pm
> Subject: Re: CMS collection keep working during holiday
> To: hotspot-gc-dev at openjdk.java.net
>
>
>   
>> Thanks to your quick response. I will do it now but it require 1 - 2 
>> days to
>> reproduce the result.
>>
>> It seems that my case is very special.
>>
>> What the application is simply a stock quote client server and store daily
>> stock quote and chart data in memory. The lived objects need around 80M
>> memory at day open and 500M memory at day end.
>>
>>
>>
>>
>> y.s.ramakrishna wrote:
>>     
>>> Could you run with the additional option +PrintReferenceGC and see if
>>> it sheds additional light on the problem? (Please share the resulting
>>> GC logs.)
>>>
>>> If at all possible, you might try rev'ing down to 6.0 and seeing
>>> if the same behaviour reproduces or not?
>>>
>>> I understand that some or all of this may not be possible in a
>>> production setting.
>>>
>>> -- ramki
>>>
>>> ----- Original Message -----
>>> From: "Ken-- at newsgroupstats.hk" <dragonken at gmail.com>
>>> Date: Tuesday, October 7, 2008 10:17 pm
>>> Subject: Re: CMS collection keep working during holiday
>>> To: hotspot-gc-dev at openjdk.java.net
>>>
>>>
>>>       
>>>> Follow up to the jconsole above. This gif is captured today. You 
>>>>         
>> can 
>>     
>>>> see the
>>>> first CMS after a FULL GC is working. The 2nd one is working too 
>>>>         
>> but the
>>     
>>>> memory free from 2nd generation is much less than the first CMS. The
>>>> memory
>>>> free by 3rd CMS is further decreased.
>>>>
>>>> The 5th CMS cannot free enough memory to make occanpcy in old-gen 
>>>>         
>> to less
>>     
>>>> than 70%. So, continue non-stopping CMS happen again until a Full GC
>>>> (triggered by Concurrent Mode Failure).
>>>>
>>>> http://www.nabble.com/file/p19872463/jconsole_20081008.jpg 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> View this message in context:
>>>> http://www.nabble.com/CMS-collection-keep-working-during-holiday-tp19773575p19872463.html
>>>> Sent from the OpenJDK Hotspot Garbage Collection mailing list 
>>>>         
>> archive 
>>     
>>>> at Nabble.com.
>>>>
>>>>         
>>>       
>> -- 
>> View this message in context: http://www.nabble.com/CMS-collection-keep-working-during-holiday-tp19773575p19872900.html
>> Sent from the OpenJDK Hotspot Garbage Collection mailing list archive 
>> at Nabble.com.
>>
>>     
>
>