Rapid fluctuation of tenuring threshold

Tue Jan 15 10:27:46 PST 2008

Hello Ramki,

Thank you very much for your response.  Your insights are very useful.

I am actively digging into the system behavior during these periods when the
death rate falls suddenly.  I have several hundred threads in the
application.  200 or so of these threads make outbound webservice calls
every 5 minutes or so.   I suspect that something is going on with these
threads which is causing these issues.  Many of these remote web service
endpoints are on the same network, so it's likely that something is occuring
which stalls many threads at the same time.  I don't have a very good
application layer metric to track this but am working to implement one.    I
have a 4CPU machine running the application and haven't tried running the
incremental collector.  The CPU usage on the machine isn't very high, as
most of the threads are network bound, but I do see occasional spikes.  I
haven't had much luck correlating these spikes with the promotion issues,
but will try.

I've been testing out the new surviror space related tunings, but it will be
several days until I'm ready to deploy these settings onto the production
enviroment but will report back if these settings produce better results.
Initial results on a test environment look very promising.

Thanks,
Jim

On 1/11/08, Y Srinivas Ramakrishna <Y.S.Ramakrishna at sun.com> wrote:
>
>
> Hi James -- thanks for all of the data; the graphs as well as the logs.
> Basically, the adaptive tenuring policy (for fixed survivor spaces)
> appears
> to be working pretty well here.
> When there is space in the survivor spaces to accomodate the survivors
> the threshold is loosened and as soon as there is a spike in the survivors
> the threshold is tightened down. Interestingly what appears to be
> happening
> is that occasionally your program is going through these what i shall call
> "survival spikes" where the death rate of very recently allocated objects
> falls sharply. (You see this as a sudden spike in the population of
> age 1 objects in the PrintTenuringDistribution output.)
> However this quickly passes and these objects then
> die fairly quickly thereafter. When the death rate falls suddenly in this
> manner, you see the adaptive tenuring policy clamp down the threshold
> in an effort to prevent survivor space overflow (which can cause premature
> promotion and nepotism). When this happens there is also an increase in
> promotiom rate (which you can observe in your plot at a fine scale).
>
> My guess is that you have a system on which occasionally some threads
> are getting stalled or are getting temporarily starved. When this
> happens the objects being processed by these threads tend to have their
> lifetimes elongated and this causes a temporary fall in the death rate.
> These threads probably then get to execute and quickly do the
> processing that makes these objects dead and things then return to
> normal. Either this, or there is some other burstiness in the processing
> of objects that gives them this bursty dynamics. I suspect that the
> "streaming" is not streaming as well as you expect and the
> streams may be getting ocassionally stalled. So that's something
> you want to check in your system.
>
> As far as tuning GC so as to avoid any unintended performance consequences
> from this, what you want to do is to prevent survivor space overflow under
> any
> circumstances, because overflow can lead to premature promotion
> and that in turn can lead to nepotism so that there will be a sudden spike
> in promotions, and thence in scavenge pause times as well as more
> pressure on the concurrent collector (just as you noticed). The increased
> activity of the concurrent collector can in turn stall the streaming
> threads
> some more. (Which makes me ask what your platform is -- #cpus --
> and whether you have tried incremental cms to see if it makes the
> behaviour less bursty and more smooth.)
>
> From the logs, i see some age 1 objects (i.e. objects that have been
> allocated since the last scavenge) are getting tenured. This clearly
> indicates that the survivor spaces are under-sized to deal with the
> occasional heavier load (or stall of the processing threads).
> I would recommend loweing the max tenuring threshold to 4
> (this avoids the unnecessary copying of long-lived objects that
> you see which increases scavenge pause times) and
> increasing the survivor space size by decreasing the survivor
> ratio to something like 4 (this would avoid the occasional
> survivor space overflow which can be very bad as i described
> above). In order to compensate for the the smaller survivor ratio (so  as
> to leave the size of Eden the same, so as to leave your inter-scavenge
> period unchanged), i would make a commensurate
> increase in NewSize. In order to leave CMS cycles running
> at the same rate as before you may also need to add space
> appropriately to the old generation (by increasing Xmx Xms).
>
> That's all for now from me, but i am sure others on the alias
> will be able to provide further tips (or alternate theories for the
> apparent burstiness in behaviour). [Note however that the tenuring
> threshold oscillating from 16 to 1 and back to 16 is more or less
> normal in this case, especially the transition from 16 to 1 given the
> suddenness with which the new object death rate falls -- it is this
> suddenness that makes me suspect some kind of a stall in
> one or a group of threads that would normally be processing
> these surviving objects. Do you have any application level
> queue length or other metrics that would tell you whether
> something like this was happening?]
>
> How many threads do you have in your application, how many
> cpu's on your platform, and do you observe any fluctuation in
> the cpu consumed by various threads -- in particular are the
> downward spikes in tenuring threshold preceded by stalling of
> some heavily working threads in the application?  That's the
> direction in which I would point your investigation.
>
> PS: I am attaching the survivor space occupancy %ge plot
> that you attached in your most recent email to you since
> it is a nice illustration of how the reduction in tenuring threshold
> is still occasionally unable to avoid survivor space overflow
> (to the extent of 1 in 9.5 scavenges on average according to
> a rough calculation).
>
> -- ramki
>
> ----- Original Message -----
> From: James Nichols <jamesnichols3 at gmail.com>
> Date: Friday, January 11, 2008 11:11 am
> Subject: Rapid fluctuation of tenuring threshold
> To: hotspot-gc-use at openjdk.java.net
>
>
> > Hello,
> >
> > My application has a fairly high streaming dataflow that creates a lot
> > of
> > short-lived objects.  I've made the young generation fairly large and
> > have a
> > pretty big survivor space, but I still have a # of objects that end up
> > in
> > the old generation.  As a result of this, my old generation size ramps
> > up
> > slowly over time until these objects are dead, then they all get
> > cleaned up
> > and there is a big drop in old generation usage, from about 2.5GBs to
> > about
> > 1GB.  Most of the time it takes about 80 minutes to get up to 2.5GBs,
> > but
> > occasionally it happens much more rapidly, as fast at every 8 minutes.
> >  My
> > workload does fluctuate over time, but not enough to explain this
> > change in
> > garbage collection behavior.  I'm seeing some very odd behavior in the
> > JVM's
> > tuning of the tenuring threshold.
> >
> > Attached is a chart that plots the old generation over time (the blue
> > line
> > with a moving average in yellow).  Each dot is a garbage collection
> (usually
> > the young generation).  On this chart, I also plotted the tenuring
> threshold
> > in red, with a moving average in black.  Each dot represents what the
> > threshold was for that particular garbage collection.  You can see
> around
> > time 11,500 minutes the peaks/valleys become much more frequent as
> > more data
> > is ending up int the old generation.  During this time, the moving
> average
> > of the tenuring threshold drops substantially.  This also happens during
> > other periods where the old generation is filled up quickly.   I'm not
> > sure
> > if this is a cause or an effect.  I leaning towards the tenuring
> threshold
> > being the cause because the objects ending up in the old generation are
> > typically reclaimed in a very short period of time.
> >
> > I've analyzed the distribution of the tenuring threshold over a
> substantial
> > period of time in my application.  It appears that 25% of the time the
> > tenuring threshold is 1, 60% of the time it's 16, and the other 15% is
> > distributed somewhere in the middle.  I'm a bit puzzled why the
> threshold
> > would so rapidly change between 1 and 16, and wonder if I'm getting
> > hosed by
> > the auto-tuning of this threshold.  Looking at the log, it rapidly
> > goes from
> > 1 to 16, very rarely stopping intermediately.  Is it reasonable to think
> > that something is wrong with the auto tuning to make it stay at 1 and
> > cause
> > the behavior I noted above?
> >
> > I'm running jdk 1.5.0_12 on RedHat linux and my application server is
> > JBoss
> > 4.0.5.  My GC settings are below... I can send the gc.dat log if you
> want
> > it, it's pretty big so I won't email it to the list.  I've signed up
> > to the
> > mailing list, but please CC me on any replies.
> >
> > Thanks, James
> >
> > -server -Xms5170m -Xmx5170m
> > -XX:NewSize=1536M -XX:MaxNewSize=1536M
> >  -XX:PermSize=512M -XX:MaxPermSize=512M
> > -XX:MaxTenuringThreshold=30 -XX:SurvivorRatio=10
> > -XX:+ScavengeBeforeFullGC
> > -XX:+UseConcMarkSweepGC
> > -XX:+UseParNewGC -XX:ParallelGCThreads=3
> > -XX:+CMSParallelRemarkEnabled -XX:+CMSClassUnloadingEnabled
> > -XX:+CMSPermGenSweepingEnabled -XX:+DisableExplicitGC
> > -XX:+PrintTenuringDistribution -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> > -XX:+PrintHeapAtGC
> > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
> > -Xloggc:gc.dat
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20080115/2e5f5ab4/attachment.html