Excessive concurrent-abortable-preclean wall time?

Thu Oct 15 22:20:07 UTC 2009

Hi Jacob --

On 10/15/09 13:43, Eagle Kessler wrote:
...
> Looking at the logs, it seemed that each young generation caught
> around 1MB of live transient data and promoted it, and those
> promotions eventually caused enough fragmentation to force a
> compacting collection. The old generation was definitely not growing
> between collections or anything like that, but fragmentation as the
> cause is a guess on my part.

 From yr description I would tend to agree that that looks
like a likely explanation.

> 
> We asked ops to add -XX:MaxTenuringThreshold=2 and
> -XX:SurvivorRatio=128, in the hopes that the addition of the tenuring
> threshold would prevent the live transient data from being promoted.

Of course, you sized SurvivorRatio so that all of that transient data
(and then some) fitted in the survivor spaces (as your tenuring distribution
data below showed).

> When that change was applied, though, we began seeing constant, and
> extremely poorly-performing, CMS collections:
> 
> 778.842: [GC [1 CMS-initial-mark: 1403910K(2097152K)]
> 1955787K(3137664K), 0.3734370 secs]

The occupancy of the old gen is here ~67%, which is
quite close to the occupancy of the old gen following
a collection (i.e. yr program's "footprint").
For some reason CMS seems to have decided
that 67% is the right level at which to initiate a
CMS cycle, so you started seeing these back to back
cycles.

I am not sure why CMS ergonomics decided that that was
the right triggering level (clearly you have managed to
run successfully at a higher manually set initiating
occupancy fraction of 85%, so perhaps the ergonomic trigger
is being overly conserrvative). My guess is that for some
reason the CMS ergonomics is perhaps doing the wrong thing here.
Roughly speaking, at a high level, here's how it works --
it looks at (1) how much space is free in the old gen at
the end of each scavenge (2) it tracks roughly how much
our recent promotion rate (3) based on these two numbers,
it calculates roughly how long we are likely to last
before we run out of space in the old gen (plus possibly
some safety factor) (4) it compares that with how long our
recent CMS cycles have lasted and kicks off a CMS collection
at the appropriate time or if it finds that CMSinitiatingOccupancyFraction
setting has been exceeded. (I believe that number for 5u14
is by default 92%; so i must conclude that it is not the
CMSInitiatingOccupancyFraction but the ergonomic triggering
logic described above that is causing collections to kick off
too early.)

The workaround you used, of explictly setting CMSInitiatingOccupancyFraction
and asking to use that and only that as the trigger for CMS collection
via -XX:+UseCMSInitiatingOccupancyOnly (thus short-circuiting the
ergonomic trigger) was indeed the right thing to do here.

Unless someone else on this list knows the reason for this
behaviour or has other suggestions, I'd suggest the following:
(a) try the latest 5uXX (5u21 i believe), although it's unlikely its
     behaviour will be any different. (Aside: Note that
     5uXX is approaching EOSL end of this month, so make sure
     to check with http://www.sun.com/software/javaforbusiness/support.jsp
     re support of that product going forward.)
(b) try the latest 6uXX (6u16), and latest HotSpot hs17 to
     see if the behaviour is similar. If the behaviour w/hs17 etc
     is good we are all happy. If not, report the bug below.
(c) file a bug if this behaviour is reproducible with (b) above,
     or with (1) if you have appropriate formal support.

As regards :-

> 2) Why did CMS-concurrent-abortable-preclean begin taking so much wall
> time when CMS was running constantly?

a measure of the efficacy of that phase is whether it succeeds in
its objective of placing the CMS remark pause roughly mid-way between
scavenges -- this reduces the occurrence of possibly back-to-back
STW pauses and also tends to maximally work-balance the CMS remark.
The remark pauses all seem to do a pretty good job of
scheduling when the young gen is roughly 50% full. The long abortable-preclean
phase is just a measure of the long inter-scavenge durations seen in this log.
As the following indicates:-

> 2409.982: [CMS-concurrent-abortable-preclean: 13.555/159.013 secs]

CMS precleaning runs at a duty cycle of less than 10%, so it's
something that you can basically ignore. (I am curious though by
what you meant when stating that the new settings made the long
abortable preclean go away; but the real test of the efficacy of
that cycle is how it spaces apart the STW pauses and whether it's
able to keep the CMS-remark pauses under control.)

all the best!
-- ramki
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use