CMS Promotion Failures

Tue Nov 16 02:31:04 UTC 2010

  Brian,

Ramki and Shaun have addressed most of your questions I think.
Just wanted to know what type of platform (how many hardware
threads)  you're using.  Also what is CMS doing when the
promotion failures are happening (concurrent marking,
preclean cleaning or sweeping)?

Jon

On 11/15/2010 11:29 AM, Brian Williams wrote:
> Greetings,
> We'd like some pointers on how to tune to avoid (or more realistically delay as long as possible) promotion failures with CMS.  Our server maintains an in memory database cache, that on appropriate hardware could take up over 100GB of RAM.
>
> Through what we've been able to find online, and lots of experimentation, we've made a lot of progress in tuning GC to work well for us.  We have the same problem that others with similar access patterns have--no matter what, we eventually seem to hit a promotion failure, which triggers a STW serial collection.
>
> Here are the general principles that we've arrived at to delay the promotion failure:
>
> 1.  Limit how much data is promoted to just what is actually old garbage.  This can be done by having a large new size, survivor size, and tenuring threshold.
>
> 2.  Use as large of heap as possible regardless of the size of the database cache that's needed.
>
> 3.  If possible, fully preload the database cache into memory at startup, and then perform a System.gc() to fully compact the old generation.  This will start things off with as little fragmentation as possible.
>
> A few questions
>
> 1.  Is it better to have CMSInitiatingOccupancyFraction set closer to the amount of live data in the server so that CMS runs more frequently or to set this value as high as possible without running into a concurrent mode failure?
>
> 2.  Would running with -XX:+AlwaysPreTouch make any difference?
>
> 3.  We've seen mentioned on this list that there are additional things that can be done to tune against promotion failures, e.g. "As regards fragmentation, it can be tricky to tune against, but we can try once we understand a bit more about the object sizes and demographics."  But we haven't seen any pointers for how to go about this.  Can you point us in the right direction?
>
> 4.  Would changing any of the PLAB/TLAB settings make a difference?
>
> 5.  What are the main factors that affect the duration of a promotion failure?  Is it the amount of live data in bytes, the number of live objects, the total size of the heap, etc?
>
> 6.  Are there any other JVM settings that we should try, other advice?
>
> By the way, we have given G1 a try, but we're still getting full GCs pretty frequently.
>
> Sorry for all of the questions.  We definitely appreciate any help you can offer.
>
> Brian
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use