Perplexing GC Time Growth

Wed Jan 2 22:22:43 UTC 2008

I've enabled the flag and am starting the test now.  It takes a few  
hours before we can see definite trending, but I'll let you know what  
we see.

And yes, we do a fair bit of classloading/reflection work in the  
application.

Thanks,
jason

On Jan 2, 2008, at 17:02 , Y.S.Ramakrishna at Sun.COM wrote:

> Very cool log, showing a near linear and monotonic growth
> not only in scavenge times, but also in initial mark and
> remark times, see attached plots produced by the gc histogram
> tool (from Tony Printezis, John Coomes et al). Then, looking
> also at the log file, one also sees that the live data following
> a GC is also increasing monotonically.
>
> While we think of potential reasons for this, or mull over
> appropriate sensors that can lay bare the root cause here,
> could you, on a quick hunch, do a quick experiment and tell
> me if adding the options -XX:+CMSClassUnloadingEnabled makes
> any difference to the observations above. [Does your application
> do lots of class loading or reflection or string interning?]
>
> thanks.
> -- ramki
>
> Jason Vasquez wrote:
>> I've attached a representative garbage log.   To my eye, I don't  
>> see anything specific that seems to indicate an event, but  
>> hopefully more experienced eyes will see something different :)
>> Thanks,
>> -jason
>> On Jan 1, 2008, at 15:46 , Y Srinivas Ramakrishna wrote:
>>>
>>> It's probably a combination of card-scanning times and allocation  
>>> slow-down
>>> (but probably more of the former).
>>>
>>> We've had some internal instrumentation of card-scanning times in  
>>> the JVM which
>>> unfortunately has not made into the JVM code proper because the  
>>> instrumentation
>>> is not as lightweight as to be enabled in production. Perhaps a  
>>> spin on a test system
>>> with the card-scanning times explicitly called out might shed light.
>>>
>>> Basically what happens with CMS is that allocation is from free  
>>> lists, and lacking
>>> something like big bag of pages (BBOP) allocation, this has  
>>> traditionally tended to
>>> scatter the allocated objects over a large number of pages. This  
>>> increases card-scanning
>>> times, although one would normally expect that this would eventually
>>> stabilize.
>>>
>>> Do the scavenge times increase suddenly after a specific event or  
>>> do they just
>>> creep up slowly after each scavenge? The complete GC log would be  
>>> useful to
>>> look at to answer that question.
>>>
>>> -- ramki
>>>
>>>
>>> ----- Original Message -----
>>> From: Jason Vasquez <jason at mugfu.com>
>>> Date: Monday, December 31, 2007 10:21 am
>>> Subject: Perplexing GC Time Growth
>>> To: hotspot-gc-use at openjdk.java.net
>>>
>>>
>>>> Hi all,
>>>>
>>>> I'm having a perplexing problem -- the garbage collector appears  
>>>> to be
>>>>
>>>> functioning well, with a nice object/garbage lifecycle, yet minor  
>>>> GC
>>>>
>>>> times increase over the life of the process inexplicably.  We are
>>>> working with telephony hardware with this application, so keeping  
>>>> GC
>>>>
>>>> pauses very low is quite important.  (keeping well below 100 ms  
>>>> would
>>>>
>>>> be ideal)
>>>>
>>>> Here is the current configuration we are using:
>>>>
>>>> -server \
>>>> -Xloggc:garbage.log \
>>>> -XX:+PrintGCDetails \
>>>> -Dsun.rmi.dgc.server.gcInterval=3600000 \
>>>> -Dsun.rmi.dgc.client.gcInterval=3600000 \
>>>> -XX:ParallelGCThreads=8 \
>>>> -XX:+UseParNewGC \
>>>> -XX:+UseConcMarkSweepGC \
>>>> -XX:+PrintGCTimeStamps \
>>>> -XX:-TraceClassUnloading \
>>>> -XX:+AggressiveOpts \
>>>> -Xmx512M \
>>>> -Xms512M \
>>>> -Xmn128M \
>>>> -XX:MaxTenuringThreshold=6 \
>>>> -XX:+ExplicitGCInvokesConcurrent
>>>>
>>>> A large number of our bigger objects size-wise live for  
>>>> approximately
>>>>
>>>> 4-5 minutes, thus the larger young generation, and tenuring  
>>>> threshold.
>>>>
>>>> This seems to be successful in filtering most objects before they
>>>> reach the tenured gen. (8 core x86 server, running 1.6.0_03-b05  
>>>> on 32-
>>>>
>>>> bit Linux, kernel rev 2.6.18)
>>>>
>>>> Here is a representative snippet of our garbage log:
>>>>
>>>> 487.135: [GC 487.135: [ParNew: 112726K->7290K(118016K), 0.0218110
>>>> secs] 134494K->29058K(511232K), 0.0220520 secs]
>>>> 557.294: [GC 557.294: [ParNew: 112250K->7976K(118016K), 0.0204220
>>>> secs] 134018K->29744K(511232K), 0.0206690 secs]
>>>> 607.025: [GC 607.025: [ParNew: 112936K->7831K(118016K), 0.0231230
>>>> secs] 134704K->30003K(511232K), 0.0233670 secs]
>>>> 672.522: [GC 672.522: [ParNew: 112791K->7361K(118016K), 0.0253620
>>>> secs] 134963K->29533K(511232K), 0.0256080 secs]
>>>> ...
>>>> 4006.635: [GC 4006.635: [ParNew: 112983K->7386K(118016K), 0.0385960
>>>> secs] 141969K->36608K(511232K), 0.0388460 secs]
>>>> 4083.066: [GC 4083.066: [ParNew: 112346K->8439K(118016K), 0.0365940
>>>> secs] 141568K->37661K(511232K), 0.0368340 secs]
>>>> 4158.457: [GC 4158.457: [ParNew: 113399K->7152K(118016K), 0.0360540
>>>> secs] 142621K->36374K(511232K), 0.0362990 secs]
>>>> 4228.312: [GC 4228.313: [ParNew: 112112K->8738K(118016K), 0.0362510
>>>> secs] 141334K->38083K(511232K), 0.0365050 secs]
>>>> 4293.800: [GC 4293.800: [ParNew: 113698K->8294K(118016K), 0.0368700
>>>> secs] 143043K->37917K(511232K), 0.0371160 secs]
>>>> ...
>>>> 10489.555: [GC 10489.556: [ParNew: 112701K->7770K(118016K),  
>>>> 0.0639770
>>>>
>>>> secs] 151966K->47156K(511232K), 0.0642210 secs]
>>>> 10562.544: [GC 10562.544: [ParNew: 112730K->9267K(118016K),  
>>>> 0.0625900
>>>>
>>>> secs] 152116K->48772K(511232K), 0.0628470 secs]
>>>> 10622.558: [GC 10622.558: [ParNew: 114227K->8361K(118016K),  
>>>> 0.0675730
>>>>
>>>> secs] 153732K->48381K(511232K), 0.0678220 secs]
>>>> 10678.842: [GC 10678.842: [ParNew: 113056K->7214K(118016K),  
>>>> 0.0669330
>>>>
>>>> secs] 153076K->47234K(511232K), 0.0671800 secs]
>>>> ...
>>>> 177939.062: [GC 177939.062: [ParNew: 112608K->8620K(118016K),
>>>> 0.7681440 secs] 466132K->362144K(511232K), 0.7684030 secs]
>>>> 178005.483: [GC 178005.483: [ParNew: 113449K->7731K(118016K),
>>>> 0.7677300 secs] 466973K->361893K(511232K), 0.7679890 secs]
>>>> 178069.658: [GC 178069.658: [ParNew: 112670K->6814K(118016K),
>>>> 0.7700020 secs] 466832K->360976K(511232K), 0.7702590 secs]
>>>> 178133.513: [GC 178133.513: [ParNew: 111717K->7920K(118016K),
>>>> 0.7702920 secs] 465879K->362082K(511232K), 0.7705560 secs]
>>>>
>>>> As you can see, the gc times continue to increase over time, on the
>>>> order of about 10-20ms per hour.  CMS runs are spaced very far  
>>>> apart,
>>>>
>>>> in fact, since most objects die before reaching the tenured
>>>> generation, the CMS is triggered more by RMI DGC runs then by heap
>>>> growth.   (We were getting serial GCs, apparently due to RMI DGC
>>>> before adding -XX:+ExplicitGCInvokesConcurrent)
>>>>
>>>> Here's some representative output from `jstat -gcutil -t -h10  
>>>> <pid> 2s`:
>>>>
>>>> Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT
>>>> 11067.6 55.74 0.00 89.32 9.73 59.84 168 7.471 8 0.280 7.751
>>>> 11069.6 55.74 0.00 93.65 9.73 59.84 168 7.471 8 0.280 7.751
>>>> 11071.6 55.74 0.00 99.34 9.73 59.84 168 7.471 8 0.280 7.751
>>>> 11073.5 0.00 62.22 2.89 9.76 59.84 169 7.537 8 0.280 7.816
>>>> 11075.6 0.00 62.22 4.80 9.76 59.84 169 7.537 8 0.280 7.816
>>>>
>>>> Survivor spaces continue to sit at about 50-65% occupancy, which  
>>>> seems
>>>>
>>>> fairly good to my eye.  Eden fills approximately every 70 seconds,
>>>> triggering minor GCs.
>>>>
>>>>
>>>> Any ideas?  This is becoming quite frustrating for us -- our
>>>> application uptime is pretty horrible with the too-frequent  
>>>> scheduled
>>>>
>>>> restarts we are being forced to run.
>>>>
>>>> Thanks for any assistance you might be able to offer,
>>>> -Jason Vasquez
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
> <scavenge.png><cms.png>

_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use