Perplexing GC Time Growth

Thu Jan 3 02:12:58 PST 2008

One other thing of interest to note, beyond the classloading/ 
reflection work, we do quite a bit of JNI interaction as well.  Not  
sure how that would affect GC times, but wanted to make sure that  
information was in the mix as well.   I have done some cursory audits  
of that code, it appears that all heap-allocated objects are being  
freed properly from the native side of things, I think we should be OK  
in that regard.

-jason

On Jan 2, 2008, at 17:58 , Y.S.Ramakrishna at Sun.COM wrote:

> Jason Vasquez wrote:
>> I've enabled the flag and am starting the test now.  It takes a  
>> few  hours before we can see definite trending, but I'll let you  
>> know what  we see.
>> And yes, we do a fair bit of classloading/reflection work in the   
>> application.
>
> On the hunch that it was the accumulation of classes in the perm gen
> and the system dictionary that was the root cause
> of the troubles here (which the explicit class unloading will now
> hopefully take care of -- recall that cms does not by default unload
> classes to keep cms remark pause times low), you might want to monitor
> the size of the perm gen (with and without the  
> CMSClassUnloadingEnabled
> option) by means of "jstat -gc ..." (which will show perm gen
> utilization under the "PU" column).
>
> This (lack of class unloading w/CMS unless explicitly requested)
> will hopefully be much less of an issue going forward as
> a result of work currently being done towards 6543076
> (some of which occurred under 6634032 recently discussed
> on this list); until that time you might find the feature
> in 6541037 (available as of 6u4 in a production JVM) useful.
>
> I look forward to the results of yr test; thanks!
> -- ramki
>
>> Thanks,
>> jason
>> On Jan 2, 2008, at 17:02 , Y.S.Ramakrishna at Sun.COM wrote:
>>> Very cool log, showing a near linear and monotonic growth
>>> not only in scavenge times, but also in initial mark and
>>> remark times, see attached plots produced by the gc histogram
>>> tool (from Tony Printezis, John Coomes et al). Then, looking
>>> also at the log file, one also sees that the live data following
>>> a GC is also increasing monotonically.
>>>
>>> While we think of potential reasons for this, or mull over
>>> appropriate sensors that can lay bare the root cause here,
>>> could you, on a quick hunch, do a quick experiment and tell
>>> me if adding the options -XX:+CMSClassUnloadingEnabled makes
>>> any difference to the observations above. [Does your application
>>> do lots of class loading or reflection or string interning?]
>>>
>>> thanks.
>>> -- ramki
>>>
>>> Jason Vasquez wrote:
>>>> I've attached a representative garbage log.   To my eye, I don't   
>>>> see anything specific that seems to indicate an event, but   
>>>> hopefully more experienced eyes will see something different :)
>>>> Thanks,
>>>> -jason
>>>> On Jan 1, 2008, at 15:46 , Y Srinivas Ramakrishna wrote:
>>>>> It's probably a combination of card-scanning times and  
>>>>> allocation  slow-down
>>>>> (but probably more of the former).
>>>>>
>>>>> We've had some internal instrumentation of card-scanning times  
>>>>> in  the JVM which
>>>>> unfortunately has not made into the JVM code proper because the   
>>>>> instrumentation
>>>>> is not as lightweight as to be enabled in production. Perhaps a   
>>>>> spin on a test system
>>>>> with the card-scanning times explicitly called out might shed  
>>>>> light.
>>>>>
>>>>> Basically what happens with CMS is that allocation is from free   
>>>>> lists, and lacking
>>>>> something like big bag of pages (BBOP) allocation, this has   
>>>>> traditionally tended to
>>>>> scatter the allocated objects over a large number of pages.  
>>>>> This  increases card-scanning
>>>>> times, although one would normally expect that this would  
>>>>> eventually
>>>>> stabilize.
>>>>>
>>>>> Do the scavenge times increase suddenly after a specific event  
>>>>> or  do they just
>>>>> creep up slowly after each scavenge? The complete GC log would  
>>>>> be  useful to
>>>>> look at to answer that question.
>>>>>
>>>>> -- ramki
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: Jason Vasquez <jason at mugfu.com>
>>>>> Date: Monday, December 31, 2007 10:21 am
>>>>> Subject: Perplexing GC Time Growth
>>>>> To: hotspot-gc-use at openjdk.java.net
>>>>>
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm having a perplexing problem -- the garbage collector  
>>>>>> appears  to be
>>>>>>
>>>>>> functioning well, with a nice object/garbage lifecycle, yet  
>>>>>> minor  GC
>>>>>>
>>>>>> times increase over the life of the process inexplicably.  We are
>>>>>> working with telephony hardware with this application, so  
>>>>>> keeping  GC
>>>>>>
>>>>>> pauses very low is quite important.  (keeping well below 100  
>>>>>> ms  would
>>>>>>
>>>>>> be ideal)
>>>>>>
>>>>>> Here is the current configuration we are using:
>>>>>>
>>>>>> -server \
>>>>>> -Xloggc:garbage.log \
>>>>>> -XX:+PrintGCDetails \
>>>>>> -Dsun.rmi.dgc.server.gcInterval=3600000 \
>>>>>> -Dsun.rmi.dgc.client.gcInterval=3600000 \
>>>>>> -XX:ParallelGCThreads=8 \
>>>>>> -XX:+UseParNewGC \
>>>>>> -XX:+UseConcMarkSweepGC \
>>>>>> -XX:+PrintGCTimeStamps \
>>>>>> -XX:-TraceClassUnloading \
>>>>>> -XX:+AggressiveOpts \
>>>>>> -Xmx512M \
>>>>>> -Xms512M \
>>>>>> -Xmn128M \
>>>>>> -XX:MaxTenuringThreshold=6 \
>>>>>> -XX:+ExplicitGCInvokesConcurrent
>>>>>>
>>>>>> A large number of our bigger objects size-wise live for   
>>>>>> approximately
>>>>>>
>>>>>> 4-5 minutes, thus the larger young generation, and tenuring   
>>>>>> threshold.
>>>>>>
>>>>>> This seems to be successful in filtering most objects before they
>>>>>> reach the tenured gen. (8 core x86 server, running 1.6.0_03- 
>>>>>> b05  on 32-
>>>>>>
>>>>>> bit Linux, kernel rev 2.6.18)
>>>>>>
>>>>>> Here is a representative snippet of our garbage log:
>>>>>>
>>>>>> 487.135: [GC 487.135: [ParNew: 112726K->7290K(118016K), 0.0218110
>>>>>> secs] 134494K->29058K(511232K), 0.0220520 secs]
>>>>>> 557.294: [GC 557.294: [ParNew: 112250K->7976K(118016K), 0.0204220
>>>>>> secs] 134018K->29744K(511232K), 0.0206690 secs]
>>>>>> 607.025: [GC 607.025: [ParNew: 112936K->7831K(118016K), 0.0231230
>>>>>> secs] 134704K->30003K(511232K), 0.0233670 secs]
>>>>>> 672.522: [GC 672.522: [ParNew: 112791K->7361K(118016K), 0.0253620
>>>>>> secs] 134963K->29533K(511232K), 0.0256080 secs]
>>>>>> ...
>>>>>> 4006.635: [GC 4006.635: [ParNew: 112983K->7386K(118016K),  
>>>>>> 0.0385960
>>>>>> secs] 141969K->36608K(511232K), 0.0388460 secs]
>>>>>> 4083.066: [GC 4083.066: [ParNew: 112346K->8439K(118016K),  
>>>>>> 0.0365940
>>>>>> secs] 141568K->37661K(511232K), 0.0368340 secs]
>>>>>> 4158.457: [GC 4158.457: [ParNew: 113399K->7152K(118016K),  
>>>>>> 0.0360540
>>>>>> secs] 142621K->36374K(511232K), 0.0362990 secs]
>>>>>> 4228.312: [GC 4228.313: [ParNew: 112112K->8738K(118016K),  
>>>>>> 0.0362510
>>>>>> secs] 141334K->38083K(511232K), 0.0365050 secs]
>>>>>> 4293.800: [GC 4293.800: [ParNew: 113698K->8294K(118016K),  
>>>>>> 0.0368700
>>>>>> secs] 143043K->37917K(511232K), 0.0371160 secs]
>>>>>> ...
>>>>>> 10489.555: [GC 10489.556: [ParNew: 112701K->7770K(118016K),   
>>>>>> 0.0639770
>>>>>>
>>>>>> secs] 151966K->47156K(511232K), 0.0642210 secs]
>>>>>> 10562.544: [GC 10562.544: [ParNew: 112730K->9267K(118016K),   
>>>>>> 0.0625900
>>>>>>
>>>>>> secs] 152116K->48772K(511232K), 0.0628470 secs]
>>>>>> 10622.558: [GC 10622.558: [ParNew: 114227K->8361K(118016K),   
>>>>>> 0.0675730
>>>>>>
>>>>>> secs] 153732K->48381K(511232K), 0.0678220 secs]
>>>>>> 10678.842: [GC 10678.842: [ParNew: 113056K->7214K(118016K),   
>>>>>> 0.0669330
>>>>>>
>>>>>> secs] 153076K->47234K(511232K), 0.0671800 secs]
>>>>>> ...
>>>>>> 177939.062: [GC 177939.062: [ParNew: 112608K->8620K(118016K),
>>>>>> 0.7681440 secs] 466132K->362144K(511232K), 0.7684030 secs]
>>>>>> 178005.483: [GC 178005.483: [ParNew: 113449K->7731K(118016K),
>>>>>> 0.7677300 secs] 466973K->361893K(511232K), 0.7679890 secs]
>>>>>> 178069.658: [GC 178069.658: [ParNew: 112670K->6814K(118016K),
>>>>>> 0.7700020 secs] 466832K->360976K(511232K), 0.7702590 secs]
>>>>>> 178133.513: [GC 178133.513: [ParNew: 111717K->7920K(118016K),
>>>>>> 0.7702920 secs] 465879K->362082K(511232K), 0.7705560 secs]
>>>>>>
>>>>>> As you can see, the gc times continue to increase over time, on  
>>>>>> the
>>>>>> order of about 10-20ms per hour.  CMS runs are spaced very far   
>>>>>> apart,
>>>>>>
>>>>>> in fact, since most objects die before reaching the tenured
>>>>>> generation, the CMS is triggered more by RMI DGC runs then by  
>>>>>> heap
>>>>>> growth.   (We were getting serial GCs, apparently due to RMI DGC
>>>>>> before adding -XX:+ExplicitGCInvokesConcurrent)
>>>>>>
>>>>>> Here's some representative output from `jstat -gcutil -t -h10   
>>>>>> <pid> 2s`:
>>>>>>
>>>>>> Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT
>>>>>> 11067.6 55.74 0.00 89.32 9.73 59.84 168 7.471 8 0.280 7.751
>>>>>> 11069.6 55.74 0.00 93.65 9.73 59.84 168 7.471 8 0.280 7.751
>>>>>> 11071.6 55.74 0.00 99.34 9.73 59.84 168 7.471 8 0.280 7.751
>>>>>> 11073.5 0.00 62.22 2.89 9.76 59.84 169 7.537 8 0.280 7.816
>>>>>> 11075.6 0.00 62.22 4.80 9.76 59.84 169 7.537 8 0.280 7.816
>>>>>>
>>>>>> Survivor spaces continue to sit at about 50-65% occupancy,  
>>>>>> which  seems
>>>>>>
>>>>>> fairly good to my eye.  Eden fills approximately every 70  
>>>>>> seconds,
>>>>>> triggering minor GCs.
>>>>>>
>>>>>>
>>>>>> Any ideas?  This is becoming quite frustrating for us -- our
>>>>>> application uptime is pretty horrible with the too-frequent   
>>>>>> scheduled
>>>>>>
>>>>>> restarts we are being forced to run.
>>>>>>
>>>>>> Thanks for any assistance you might be able to offer,
>>>>>> -Jason Vasquez
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> <scavenge.png><cms.png>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>