CMS Promotion Failures

Mon Nov 15 14:16:05 PST 2010

On 11/15/10 11:29, Brian Williams wrote:
> Greetings,
> We'd like some pointers on how to tune to avoid (or more realistically delay as long as possible) promotion failures with CMS.  Our server maintains an in memory database cache, that on appropriate hardware could take up over 100GB of RAM.
> 
> Through what we've been able to find online, and lots of experimentation, we've made a lot of progress in tuning GC to work well for us.  We have the same problem that others with similar access patterns have--no matter what, we eventually seem to hit a promotion failure, which triggers a STW serial collection.
> 
> Here are the general principles that we've arrived at to delay the promotion failure:
> 
> 1.  Limit how much data is promoted to just what is actually old garbage.  This can be done by having a large new size, survivor size, and tenuring threshold.
> 
> 2.  Use as large of heap as possible regardless of the size of the database cache that's needed.
> 
> 3.  If possible, fully preload the database cache into memory at startup, and then perform a System.gc() to fully compact the old generation.  This will start things off with as little fragmentation as possible.
> 
> A few questions
> 
> 1.  Is it better to have CMSInitiatingOccupancyFraction set closer to the amount of live data in the server so that CMS runs more frequently or to set this value as high as possible without running into a concurrent mode failure?
> 

Somewhere in between. My experience has been that you want yr CMS cycles to be
neither too frequent, nor too infrequent.

> 2.  Would running with -XX:+AlwaysPreTouch make any difference?

Only initially, until all of the old gen pages get objects promoted into them.
On Solaris at least there is sometimes a cost from first touch, expecially
if using very large pages. The pre-touch moves that cost out of the scavenges
to the start-up phase.

> 
> 3.  We've seen mentioned on this list that there are additional things that can be done to tune against promotion failures, e.g. "As regards fragmentation, it can be tricky to tune against, but we can try once we understand a bit more about the object sizes and demographics."  But we haven't seen any pointers for how to go about this.  Can you point us in the right direction?
> 

The basic idea is as you say in (1), promote only medium- and long-lived data.
In other words, never promote any short-lived data, even under sudden load
spikes.

> 4.  Would changing any of the PLAB/TLAB settings make a difference?

These are autonomically sized and it's unlikely that a static setting
will outperform the adaption, epsecially if you do not have steady loads.

> 
> 5.  What are the main factors that affect the duration of a promotion failure?  Is it the amount of live data in bytes, the number of live objects, the total size of the heap, etc?
> 

Yes. :-)

(More seriously the cost is proportional to the amount copied, i.e. live data, and the
size of the heap, i.e. also the dead data; the overhead is also slightly higher if you have many
small as opposed to a few large objects.)

> 6.  Are there any other JVM settings that we should try, other advice?

Controlling promotion rate and avoiding premature promotion of short-lived data
is the most important piece of advice.

> 
> By the way, we have given G1 a try, but we're still getting full GCs pretty frequently.

Try giving G1 a bit more heap, and instead of constraining generation sizes to what
worked best for CMS, just specify a pause-time (start higher and slowly iterate
lower) and let G1's autonomics find an optimal partitioning of the heap.
There are probably a few not yet known sharp corners of G1 that if you
bring to our attention we can try and fix. One current disadvantage of G1
which is planned to be fixed soon, is that we do not deal with Reference
onjects during scavenges, so this can place G1 at a great disadvantage in terms
of carrying a lot more garbage, if your application happens to use
Reference objects (perhaps under the covers by the JDK libraries
that you are using).

Look at the GC tuning talk by Charlie Hunt and Tony Printezis in this year's
JavaOne for some good advice on GC tuning in general and CMS tuning in particular.
Hopefully they will also include G1 tuning into such a talk next year :-)

best.
-- ramki

> 
> Sorry for all of the questions.  We definitely appreciate any help you can offer.
> 
> Brian
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use