Perplexing GC Time Growth

Wed Jan 2 14:58:32 PST 2008

Jason Vasquez wrote:
> I've enabled the flag and am starting the test now.  It takes a few  
> hours before we can see definite trending, but I'll let you know what  
> we see.
> 
> And yes, we do a fair bit of classloading/reflection work in the  
> application.

On the hunch that it was the accumulation of classes in the perm gen
and the system dictionary that was the root cause
of the troubles here (which the explicit class unloading will now
hopefully take care of -- recall that cms does not by default unload
classes to keep cms remark pause times low), you might want to monitor
the size of the perm gen (with and without the CMSClassUnloadingEnabled
option) by means of "jstat -gc ..." (which will show perm gen
utilization under the "PU" column).

This (lack of class unloading w/CMS unless explicitly requested)
will hopefully be much less of an issue going forward as
a result of work currently being done towards 6543076
(some of which occurred under 6634032 recently discussed
on this list); until that time you might find the feature
in 6541037 (available as of 6u4 in a production JVM) useful.

I look forward to the results of yr test; thanks!
-- ramki

> 
> Thanks,
> jason
> 
> 
> On Jan 2, 2008, at 17:02 , Y.S.Ramakrishna at Sun.COM wrote:
> 
>> Very cool log, showing a near linear and monotonic growth
>> not only in scavenge times, but also in initial mark and
>> remark times, see attached plots produced by the gc histogram
>> tool (from Tony Printezis, John Coomes et al). Then, looking
>> also at the log file, one also sees that the live data following
>> a GC is also increasing monotonically.
>>
>> While we think of potential reasons for this, or mull over
>> appropriate sensors that can lay bare the root cause here,
>> could you, on a quick hunch, do a quick experiment and tell
>> me if adding the options -XX:+CMSClassUnloadingEnabled makes
>> any difference to the observations above. [Does your application
>> do lots of class loading or reflection or string interning?]
>>
>> thanks.
>> -- ramki
>>
>> Jason Vasquez wrote:
>>> I've attached a representative garbage log.   To my eye, I don't  
>>> see anything specific that seems to indicate an event, but  
>>> hopefully more experienced eyes will see something different :)
>>> Thanks,
>>> -jason
>>> On Jan 1, 2008, at 15:46 , Y Srinivas Ramakrishna wrote:
>>>> It's probably a combination of card-scanning times and allocation  
>>>> slow-down
>>>> (but probably more of the former).
>>>>
>>>> We've had some internal instrumentation of card-scanning times in  
>>>> the JVM which
>>>> unfortunately has not made into the JVM code proper because the  
>>>> instrumentation
>>>> is not as lightweight as to be enabled in production. Perhaps a  
>>>> spin on a test system
>>>> with the card-scanning times explicitly called out might shed light.
>>>>
>>>> Basically what happens with CMS is that allocation is from free  
>>>> lists, and lacking
>>>> something like big bag of pages (BBOP) allocation, this has  
>>>> traditionally tended to
>>>> scatter the allocated objects over a large number of pages. This  
>>>> increases card-scanning
>>>> times, although one would normally expect that this would eventually
>>>> stabilize.
>>>>
>>>> Do the scavenge times increase suddenly after a specific event or  
>>>> do they just
>>>> creep up slowly after each scavenge? The complete GC log would be  
>>>> useful to
>>>> look at to answer that question.
>>>>
>>>> -- ramki
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Jason Vasquez <jason at mugfu.com>
>>>> Date: Monday, December 31, 2007 10:21 am
>>>> Subject: Perplexing GC Time Growth
>>>> To: hotspot-gc-use at openjdk.java.net
>>>>
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm having a perplexing problem -- the garbage collector appears  
>>>>> to be
>>>>>
>>>>> functioning well, with a nice object/garbage lifecycle, yet minor  
>>>>> GC
>>>>>
>>>>> times increase over the life of the process inexplicably.  We are
>>>>> working with telephony hardware with this application, so keeping  
>>>>> GC
>>>>>
>>>>> pauses very low is quite important.  (keeping well below 100 ms  
>>>>> would
>>>>>
>>>>> be ideal)
>>>>>
>>>>> Here is the current configuration we are using:
>>>>>
>>>>> -server \
>>>>> -Xloggc:garbage.log \
>>>>> -XX:+PrintGCDetails \
>>>>> -Dsun.rmi.dgc.server.gcInterval=3600000 \
>>>>> -Dsun.rmi.dgc.client.gcInterval=3600000 \
>>>>> -XX:ParallelGCThreads=8 \
>>>>> -XX:+UseParNewGC \
>>>>> -XX:+UseConcMarkSweepGC \
>>>>> -XX:+PrintGCTimeStamps \
>>>>> -XX:-TraceClassUnloading \
>>>>> -XX:+AggressiveOpts \
>>>>> -Xmx512M \
>>>>> -Xms512M \
>>>>> -Xmn128M \
>>>>> -XX:MaxTenuringThreshold=6 \
>>>>> -XX:+ExplicitGCInvokesConcurrent
>>>>>
>>>>> A large number of our bigger objects size-wise live for  
>>>>> approximately
>>>>>
>>>>> 4-5 minutes, thus the larger young generation, and tenuring  
>>>>> threshold.
>>>>>
>>>>> This seems to be successful in filtering most objects before they
>>>>> reach the tenured gen. (8 core x86 server, running 1.6.0_03-b05  
>>>>> on 32-
>>>>>
>>>>> bit Linux, kernel rev 2.6.18)
>>>>>
>>>>> Here is a representative snippet of our garbage log:
>>>>>
>>>>> 487.135: [GC 487.135: [ParNew: 112726K->7290K(118016K), 0.0218110
>>>>> secs] 134494K->29058K(511232K), 0.0220520 secs]
>>>>> 557.294: [GC 557.294: [ParNew: 112250K->7976K(118016K), 0.0204220
>>>>> secs] 134018K->29744K(511232K), 0.0206690 secs]
>>>>> 607.025: [GC 607.025: [ParNew: 112936K->7831K(118016K), 0.0231230
>>>>> secs] 134704K->30003K(511232K), 0.0233670 secs]
>>>>> 672.522: [GC 672.522: [ParNew: 112791K->7361K(118016K), 0.0253620
>>>>> secs] 134963K->29533K(511232K), 0.0256080 secs]
>>>>> ...
>>>>> 4006.635: [GC 4006.635: [ParNew: 112983K->7386K(118016K), 0.0385960
>>>>> secs] 141969K->36608K(511232K), 0.0388460 secs]
>>>>> 4083.066: [GC 4083.066: [ParNew: 112346K->8439K(118016K), 0.0365940
>>>>> secs] 141568K->37661K(511232K), 0.0368340 secs]
>>>>> 4158.457: [GC 4158.457: [ParNew: 113399K->7152K(118016K), 0.0360540
>>>>> secs] 142621K->36374K(511232K), 0.0362990 secs]
>>>>> 4228.312: [GC 4228.313: [ParNew: 112112K->8738K(118016K), 0.0362510
>>>>> secs] 141334K->38083K(511232K), 0.0365050 secs]
>>>>> 4293.800: [GC 4293.800: [ParNew: 113698K->8294K(118016K), 0.0368700
>>>>> secs] 143043K->37917K(511232K), 0.0371160 secs]
>>>>> ...
>>>>> 10489.555: [GC 10489.556: [ParNew: 112701K->7770K(118016K),  
>>>>> 0.0639770
>>>>>
>>>>> secs] 151966K->47156K(511232K), 0.0642210 secs]
>>>>> 10562.544: [GC 10562.544: [ParNew: 112730K->9267K(118016K),  
>>>>> 0.0625900
>>>>>
>>>>> secs] 152116K->48772K(511232K), 0.0628470 secs]
>>>>> 10622.558: [GC 10622.558: [ParNew: 114227K->8361K(118016K),  
>>>>> 0.0675730
>>>>>
>>>>> secs] 153732K->48381K(511232K), 0.0678220 secs]
>>>>> 10678.842: [GC 10678.842: [ParNew: 113056K->7214K(118016K),  
>>>>> 0.0669330
>>>>>
>>>>> secs] 153076K->47234K(511232K), 0.0671800 secs]
>>>>> ...
>>>>> 177939.062: [GC 177939.062: [ParNew: 112608K->8620K(118016K),
>>>>> 0.7681440 secs] 466132K->362144K(511232K), 0.7684030 secs]
>>>>> 178005.483: [GC 178005.483: [ParNew: 113449K->7731K(118016K),
>>>>> 0.7677300 secs] 466973K->361893K(511232K), 0.7679890 secs]
>>>>> 178069.658: [GC 178069.658: [ParNew: 112670K->6814K(118016K),
>>>>> 0.7700020 secs] 466832K->360976K(511232K), 0.7702590 secs]
>>>>> 178133.513: [GC 178133.513: [ParNew: 111717K->7920K(118016K),
>>>>> 0.7702920 secs] 465879K->362082K(511232K), 0.7705560 secs]
>>>>>
>>>>> As you can see, the gc times continue to increase over time, on the
>>>>> order of about 10-20ms per hour.  CMS runs are spaced very far  
>>>>> apart,
>>>>>
>>>>> in fact, since most objects die before reaching the tenured
>>>>> generation, the CMS is triggered more by RMI DGC runs then by heap
>>>>> growth.   (We were getting serial GCs, apparently due to RMI DGC
>>>>> before adding -XX:+ExplicitGCInvokesConcurrent)
>>>>>
>>>>> Here's some representative output from `jstat -gcutil -t -h10  
>>>>> <pid> 2s`:
>>>>>
>>>>> Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT
>>>>> 11067.6 55.74 0.00 89.32 9.73 59.84 168 7.471 8 0.280 7.751
>>>>> 11069.6 55.74 0.00 93.65 9.73 59.84 168 7.471 8 0.280 7.751
>>>>> 11071.6 55.74 0.00 99.34 9.73 59.84 168 7.471 8 0.280 7.751
>>>>> 11073.5 0.00 62.22 2.89 9.76 59.84 169 7.537 8 0.280 7.816
>>>>> 11075.6 0.00 62.22 4.80 9.76 59.84 169 7.537 8 0.280 7.816
>>>>>
>>>>> Survivor spaces continue to sit at about 50-65% occupancy, which  
>>>>> seems
>>>>>
>>>>> fairly good to my eye.  Eden fills approximately every 70 seconds,
>>>>> triggering minor GCs.
>>>>>
>>>>>
>>>>> Any ideas?  This is becoming quite frustrating for us -- our
>>>>> application uptime is pretty horrible with the too-frequent  
>>>>> scheduled
>>>>>
>>>>> restarts we are being forced to run.
>>>>>
>>>>> Thanks for any assistance you might be able to offer,
>>>>> -Jason Vasquez
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> <scavenge.png><cms.png>
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use