Rapid fluctuation of tenuring threshold

Fri Jan 11 14:51:11 PST 2008

Hi James -- thanks for all of the data; the graphs as well as the logs.
Basically, the adaptive tenuring policy (for fixed survivor spaces) appears
to be working pretty well here.
When there is space in the survivor spaces to accomodate the survivors
the threshold is loosened and as soon as there is a spike in the survivors
the threshold is tightened down. Interestingly what appears to be happening
is that occasionally your program is going through these what i shall call
"survival spikes" where the death rate of very recently allocated objects
falls sharply. (You see this as a sudden spike in the population of
age 1 objects in the PrintTenuringDistribution output.)
However this quickly passes and these objects then
die fairly quickly thereafter. When the death rate falls suddenly in this
manner, you see the adaptive tenuring policy clamp down the threshold
in an effort to prevent survivor space overflow (which can cause premature
promotion and nepotism). When this happens there is also an increase in
promotiom rate (which you can observe in your plot at a fine scale).

My guess is that you have a system on which occasionally some threads
are getting stalled or are getting temporarily starved. When this
happens the objects being processed by these threads tend to have their
lifetimes elongated and this causes a temporary fall in the death rate.
These threads probably then get to execute and quickly do the
processing that makes these objects dead and things then return to
normal. Either this, or there is some other burstiness in the processing
of objects that gives them this bursty dynamics. I suspect that the
"streaming" is not streaming as well as you expect and the
streams may be getting ocassionally stalled. So that's something
you want to check in your system.

As far as tuning GC so as to avoid any unintended performance consequences
from this, what you want to do is to prevent survivor space overflow under any
circumstances, because overflow can lead to premature promotion
and that in turn can lead to nepotism so that there will be a sudden spike
in promotions, and thence in scavenge pause times as well as more
pressure on the concurrent collector (just as you noticed). The increased
activity of the concurrent collector can in turn stall the streaming threads
some more. (Which makes me ask what your platform is -- #cpus --
and whether you have tried incremental cms to see if it makes the
behaviour less bursty and more smooth.)

>From the logs, i see some age 1 objects (i.e. objects that have been
allocated since the last scavenge) are getting tenured. This clearly
indicates that the survivor spaces are under-sized to deal with the
occasional heavier load (or stall of the processing threads).
I would recommend loweing the max tenuring threshold to 4
(this avoids the unnecessary copying of long-lived objects that
you see which increases scavenge pause times) and
increasing the survivor space size by decreasing the survivor
ratio to something like 4 (this would avoid the occasional
survivor space overflow which can be very bad as i described
above). In order to compensate for the the smaller survivor ratio (so  as
to leave the size of Eden the same, so as to leave your inter-scavenge
period unchanged), i would make a commensurate
increase in NewSize. In order to leave CMS cycles running
at the same rate as before you may also need to add space
appropriately to the old generation (by increasing Xmx Xms).

That's all for now from me, but i am sure others on the alias
will be able to provide further tips (or alternate theories for the
apparent burstiness in behaviour). [Note however that the tenuring
threshold oscillating from 16 to 1 and back to 16 is more or less
normal in this case, especially the transition from 16 to 1 given the
suddenness with which the new object death rate falls -- it is this
suddenness that makes me suspect some kind of a stall in
one or a group of threads that would normally be processing
these surviving objects. Do you have any application level
queue length or other metrics that would tell you whether
something like this was happening?]

How many threads do you have in your application, how many
cpu's on your platform, and do you observe any fluctuation in
the cpu consumed by various threads -- in particular are the
downward spikes in tenuring threshold preceded by stalling of
some heavily working threads in the application?  That's the
direction in which I would point your investigation.

PS: I am attaching the survivor space occupancy %ge plot
that you attached in your most recent email to you since
it is a nice illustration of how the reduction in tenuring threshold
is still occasionally unable to avoid survivor space overflow
(to the extent of 1 in 9.5 scavenges on average according to
a rough calculation).

-- ramki

----- Original Message -----
From: James Nichols <jamesnichols3 at gmail.com>
Date: Friday, January 11, 2008 11:11 am
Subject: Rapid fluctuation of tenuring threshold
To: hotspot-gc-use at openjdk.java.net

> Hello,
> 
> My application has a fairly high streaming dataflow that creates a lot 
> of
> short-lived objects.  I've made the young generation fairly large and 
> have a
> pretty big survivor space, but I still have a # of objects that end up 
> in
> the old generation.  As a result of this, my old generation size ramps 
> up
> slowly over time until these objects are dead, then they all get 
> cleaned up
> and there is a big drop in old generation usage, from about 2.5GBs to 
> about
> 1GB.  Most of the time it takes about 80 minutes to get up to 2.5GBs, 
> but
> occasionally it happens much more rapidly, as fast at every 8 minutes. 
>  My
> workload does fluctuate over time, but not enough to explain this 
> change in
> garbage collection behavior.  I'm seeing some very odd behavior in the 
> JVM's
> tuning of the tenuring threshold.
> 
> Attached is a chart that plots the old generation over time (the blue 
> line
> with a moving average in yellow).  Each dot is a garbage collection (usually
> the young generation).  On this chart, I also plotted the tenuring threshold
> in red, with a moving average in black.  Each dot represents what the
> threshold was for that particular garbage collection.  You can see around
> time 11,500 minutes the peaks/valleys become much more frequent as 
> more data
> is ending up int the old generation.  During this time, the moving average
> of the tenuring threshold drops substantially.  This also happens during
> other periods where the old generation is filled up quickly.   I'm not 
> sure
> if this is a cause or an effect.  I leaning towards the tenuring threshold
> being the cause because the objects ending up in the old generation are
> typically reclaimed in a very short period of time.
> 
> I've analyzed the distribution of the tenuring threshold over a substantial
> period of time in my application.  It appears that 25% of the time the
> tenuring threshold is 1, 60% of the time it's 16, and the other 15% is
> distributed somewhere in the middle.  I'm a bit puzzled why the threshold
> would so rapidly change between 1 and 16, and wonder if I'm getting 
> hosed by
> the auto-tuning of this threshold.  Looking at the log, it rapidly 
> goes from
> 1 to 16, very rarely stopping intermediately.  Is it reasonable to think
> that something is wrong with the auto tuning to make it stay at 1 and 
> cause
> the behavior I noted above?
> 
> I'm running jdk 1.5.0_12 on RedHat linux and my application server is 
> JBoss
> 4.0.5.  My GC settings are below... I can send the gc.dat log if you want
> it, it's pretty big so I won't email it to the list.  I've signed up 
> to the
> mailing list, but please CC me on any replies.
> 
> Thanks, James
> 
> -server -Xms5170m -Xmx5170m
> -XX:NewSize=1536M -XX:MaxNewSize=1536M
>  -XX:PermSize=512M -XX:MaxPermSize=512M
> -XX:MaxTenuringThreshold=30 -XX:SurvivorRatio=10
> -XX:+ScavengeBeforeFullGC
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC -XX:ParallelGCThreads=3
> -XX:+CMSParallelRemarkEnabled -XX:+CMSClassUnloadingEnabled
> -XX:+CMSPermGenSweepingEnabled -XX:+DisableExplicitGC
> -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
> -Xloggc:gc.dat
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use