CMS collection keep working during holiday

Wed Oct 8 09:30:32 UTC 2008

Y Srinivas Ramakrishna wrote:
> Hi Kirk --
>
>   
>> Nice analysis. Maybe I've missed it but I've not seen anything that 
>> rules out a simple object leak. Until the leak has been ruled out 
>> would 
>> you prefer to keep objects out of tenured rather than quickly 
>> promoting 
>> them to a space that is already under enormous pressure? Do you see 
>> extra copying as even being relevant to this problem?
>>     
>
> The extra copying is relevant only in as much as avoiding it will
> reduce the scavenge pauses to approximately 1/5 (or lower) of
> their current value. However, you are right that this will not
> change by much the promotion _rate_ into the tenured gen.
> (it will increase by a tad, but not by much.) Thus, tenuring
> after aging for a single scavenge appears to be a win-win.
>   
ok, that makes sense.
> Now the reason I suggested the quick tenuring was because
> i suspected that the difference between CMS and mark-compact
> collections is showing up because the new gen is treated
> as a source of roots into the old gen and perhaps the quicker
> tenuring will reduce the # "false roots" if you will
> (see my reference to Reference objects and to cycles
> spanning across the generations; the idea is to avoid
> carrying around such potential garbage).
>   
ok, but the jconsole snaps show that old is quite full and isn't emptying...
> You are right though that perhaps it's a simple object leak.
> What makes it strange, and makes it somewhat unlikely to be a simple
> memory leak is how a mark-compact collection reclaims the
> space whereas a concurrent collection does not (even accounting
> for some small amount of floating garbage because of concurrent
> marking -- of which there should be little in the present case
> of long inter-scavenge durations and frequent back-to-back CMS
> collections).
>   
Ok, this is the bit that I missed. So this actually appears to be strong 
evidence that CMS is failing. Same application only two different 
implementations of memory management, one reclaims and the other 
doesn't. I was thinking in relation to your comment on false roots was 
that forcing more young collections might actually result in more being 
released if your hypothesis is correct in this case.
> As regards pressure on the tenured gen: the logs actually indicate
> that the promotion rate into the old gen is not that great (note
> that each scavenge promotes only about 10-11 MB or so of data
> into the old gen and scavenges happen only about once every 6 minutes,
> making for a promotion rate of about 30 KB/s which is fairly low.
> As you will have noted the CMS cycles are essentially "busy work"
> cycles (in that the work is unnecessary and does not accomplish
> anything; thus must not be mistaken for the need for a lot of
> work). It is true, however, that there is some "pressure"
> in the sense that the occupancy, as described by Ken, keeps
> increasing over the course of the day ...
>   
right, I wasn't concerned with rate only with occupancy. Throwing stuff 
into tenured when it is mostly full...
What do you think the effect of creating large survivor space might have 
on the problem? I'm wondering if a limited Eden would trigger more 
scavenges which wouldn't have to deal with "false" roots? Of course this 
would increase the overall copying cost but I have managed to trick JVMs 
into being stable for long periods of time using this technique in the 
past. No matter, it would beinteresting to bang through a days worth of 
activity with the vm under a scope.

Regards,
Kirk
> -- ramki
>
>   
>> Regards,
>> Kirk
>>
>> Y Srinivas Ramakrishna wrote:
>>     
>>> Hi Ken --
>>>
>>> You do see of course that in the gc.zip that you made available,
>>> there were 4 promotion failures, followed by compacting collections,
>>> as listed below in chronological order:-
>>>
>>> 163116.821: [GC 163116.821: [ParNew: 873856K->174720K(873856K), 
>>>       
>> 0.4830286 secs] 1508484K->961405K(4019584K), 0.4831137 secs] [Times: 
>> user=1.59 sys=0.01, real=0.
>>     
>>> 49 secs]
>>> ...
>>> 165310.759: [GC 165310.759: [ParNew (promotion failed): 
>>>       
>> 873856K->873856K(873856K), 0.8919148 secs]165311.651:[CMS: 
>> 1911965K->785641K(3145728K), 2.9784355 secs] 
>> 2733287K->785641K(4019584K), 3.8704903 secs] [Times: user=4.80 
>> sys=0.16, real=3.88 secs]
>>     
>>> ...
>>> 231411.554: [GC 231411.554: [ParNew: 873856K->174720K(873856K), 
>>>       
>> 0.5915618 secs] 1676990K->1122182K(4019584K), 0.5916525 secs] [Times: 
>> user=1.45 sys=0.00, real=0
>>     
>>> .59 secs]
>>> ...
>>> 233946.300: [GC 233946.300: [ParNew (promotion failed): 
>>>       
>> 873856K->873856K(873856K), 0.9428034 secs]233947.243: [CMS: 
>> 2172134K->570964K(3145728K), 2.7655578 secs] 
>> 2969591K->570964K(4019584K), 3.7084973 secs] [Times: user=5.17 
>> sys=0.09, real=3.70 secs]
>>     
>>> ...
>>> 247864.029: [GC 247864.029: [ParNew (promotion failed): 
>>>       
>> 873856K->873855K(873856K), 0.8641325 secs]247864.893: [CMS: 
>> 1845477K->499108K(3145728K), 2.2010342 secs] 
>> 2628538K->499108K(4019584K), 3.0652834 secs] [Times: user=4.30 
>> sys=0.06, real=3.06 secs]
>>     
>>> ...
>>> 248187.856: [GC 248187.856: [ParNew: 699136K->174719K(873856K), 
>>>       
>> 0.3926926 secs] 1050016K->610565K(4019584K), 0.3927797 secs] [Times: 
>> user=1.45 sys=0.03, real=0.
>>     
>>> 39 secs]
>>> ...
>>> 265611.034: [GC 265611.034: [ParNew (promotion failed): 
>>>       
>> 873856K->873856K(873856K), 0.6459451 secs]265611.680: [CMS: 
>> 1854348K->502128K(3145728K), 2.2556597 secs] 
>> 2714667K->502128K(4019584K), 2.9017069 secs] [Times: user=4.19 
>> sys=0.08, real=2.89 secs]
>>     
>>> ...
>>> 502712.808: [GC 502712.808: [ParNew: 873856K->174720K(873856K), 
>>>       
>> 1.2972687 secs] 3806995K->3167193K(4019584K), 1.2973536 secs] [Times: 
>> user=1.75 sys=0.00, real=1
>>     
>>> .30 secs]
>>> ...
>>> 504379.249: [GC 504379.249: [ParNew: 873856K->873856K(873856K), 
>>>       
>> 0.0000241 secs]504379.249: [CMS504379.358: 
>> [CMS-concurrent-abortable-preclean: 0.209/2.471 secs] [Times: 
>> user=1.28 sys=0.00, real=2.48 secs] (concurrent mode failure): 
>> 2938816K->424257K(3145728K), 5.2334306 secs] 
>> 3812672K->424257K(4019584K), 5.2335621 secs] [Times:
>>     
>>>  user=5.11 sys=0.02, real=5.23 secs]
>>>
>>> In other words, CMS often does not manage to clear the allocated
>>> memory. But the CMS collection preceding the promotion failure
>>> indicates heap usage only a little more than what you see after
>>> the mark-compact collection. There is almost certainly something
>>> fishy going on here. It may be related to Reference objects and their
>>> referents being split across generations, but it's not completely clear.
>>>
>>> [Perhaps you can check if your application makes use of Soft/Weak/References
>>> in some funny way, so as to perhaps cause the Reference and referent 
>>>       
>> to be
>>     
>>> split between the old and new generations?]
>>>
>>> Now one thing we know about the difference between CMS and a full mark-compact
>>> collection is that CMS treats the young gen as a source of roots for 
>>>       
>> the
>>     
>>> CMS and perm gen collection, whereas this is not so for the latter.
>>> In pathological cases, a dead cycle spanning the young gen and the
>>> old gen would remain uncollected by CMS but be collected by the mark-compact
>>> collection. Now, I don't think that would be happening in your case, 
>>>       
>> but I
>>     
>>> do see that the frequency of ParNew collections in your case is 
>>>       
>> fairly low --
>>     
>>> about 1 every 6 minutes or so. With a tenuring threshold of 11, it's 
>>>       
>> possible
>>     
>>> for a dead cycle to live for a full hour before it gets promoted 
>>>       
>> into the
>>     
>>> old generation whence it would be eligible for collection.
>>>
>>> In some cases, the CMS initiating occupancy fraction has not been hit
>>> (and a CMS collection not even attempte) before a promotion failure
>>> occurs. This would appear to call for more frequent CMS collections
>>> following ParNew collections, even something drastic like:-
>>>
>>>     CMSInitiatingOccupancyFraction=25
>>>
>>> This would of course cause your back to back CMS collections that
>>> would peg your CPU, in order to avoid that, you might disable the
>>> CMS "timeout" via:-
>>>
>>>      CMSWaitDuration=1200000
>>>
>>> i.e. make it 20 minutes (=120000 ms).
>>>
>>> In addition, I find that your scavenges are unnecessarily copying
>>> survivors back-and-forth 11 times or so, I'd just scrunch that down
>>> to:-
>>>
>>>       MaxTenuringThreshold=1 (or at most 2)
>>>
>>> This may or may not address the problem you are seeing here but
>>> it's at least a start towards eliminating at least some of
>>> the suspects and of course of eliminating the "busy work"
>>> that CMS is doing during the holidays.
>>>
>>> -- ramki
>>>
>>> ----- Original Message -----
>>> From: "Ken-- at newsgroupstats.hk" <dragonken at gmail.com>
>>> Date: Tuesday, October 7, 2008 11:14 pm
>>> Subject: Re: CMS collection keep working during holiday
>>> To: hotspot-gc-dev at openjdk.java.net
>>>
>>>
>>>   
>>>       
>>>> Thanks to your quick response. I will do it now but it require 1 - 
>>>>         
>> 2 
>>     
>>>> days to
>>>> reproduce the result.
>>>>
>>>> It seems that my case is very special.
>>>>
>>>> What the application is simply a stock quote client server and 
>>>>         
>> store daily
>>     
>>>> stock quote and chart data in memory. The lived objects need around 
>>>>         
>> 80M
>>     
>>>> memory at day open and 500M memory at day end.
>>>>
>>>>
>>>>
>>>>
>>>> y.s.ramakrishna wrote:
>>>>     
>>>>         
>>>>> Could you run with the additional option +PrintReferenceGC and see 
>>>>>           
>> if
>>     
>>>>> it sheds additional light on the problem? (Please share the resulting
>>>>> GC logs.)
>>>>>
>>>>> If at all possible, you might try rev'ing down to 6.0 and seeing
>>>>> if the same behaviour reproduces or not?
>>>>>
>>>>> I understand that some or all of this may not be possible in a
>>>>> production setting.
>>>>>
>>>>> -- ramki
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Ken-- at newsgroupstats.hk" <dragonken at gmail.com>
>>>>> Date: Tuesday, October 7, 2008 10:17 pm
>>>>> Subject: Re: CMS collection keep working during holiday
>>>>> To: hotspot-gc-dev at openjdk.java.net
>>>>>
>>>>>
>>>>>       
>>>>>           
>>>>>> Follow up to the jconsole above. This gif is captured today. You 
>>>>>>             
>>>>>>         
>>>>>>             
>>>> can 
>>>>     
>>>>         
>>>>>> see the
>>>>>> first CMS after a FULL GC is working. The 2nd one is working too 
>>>>>>             
>>>>>>         
>>>>>>             
>>>> but the
>>>>     
>>>>         
>>>>>> memory free from 2nd generation is much less than the first CMS. 
>>>>>>             
>> The
>>     
>>>>>> memory
>>>>>> free by 3rd CMS is further decreased.
>>>>>>
>>>>>> The 5th CMS cannot free enough memory to make occanpcy in old-gen 
>>>>>>             
>>>>>>         
>>>>>>             
>>>> to less
>>>>     
>>>>         
>>>>>> than 70%. So, continue non-stopping CMS happen again until a Full 
>>>>>>             
>> GC
>>     
>>>>>> (triggered by Concurrent Mode Failure).
>>>>>>
>>>>>> http://www.nabble.com/file/p19872463/jconsole_20081008.jpg 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/CMS-collection-keep-working-during-holiday-tp19773575p19872463.html
>>>>>> Sent from the OpenJDK Hotspot Garbage Collection mailing list 
>>>>>>         
>>>>>>             
>>>> archive 
>>>>     
>>>>         
>>>>>> at Nabble.com.
>>>>>>
>>>>>>         
>>>>>>             
>>>>>       
>>>>>           
>>>> -- 
>>>> View this message in context: http://www.nabble.com/CMS-collection-keep-working-during-holiday-tp19773575p19872900.html
>>>> Sent from the OpenJDK Hotspot Garbage Collection mailing list 
>>>>         
>> archive 
>>     
>>>> at Nabble.com.
>>>>
>>>>     
>>>>         
>>>   
>>>       
>
>