Young generation configuration

Mon Sep 14 14:52:41 PDT 2009

Thanks for all the information Ramki.

I had to lower my YG to 1G in order to reduce my typical YG GC to under
one second, and under .5 sec for many gc's.  I'm not playing with
initiating occupancy fraction settings to avoid the cms failures I'm
getting.  But it's looking so much better.

Our app login and logout produces truck loads of garbage, so figuring
out the initiating occupancy fraction settings is a bit tricky.

Everything is definitely much clearer now.

Thanks!
Jeff

-----Original Message-----
From: Y.S.Ramakrishna at Sun.COM [mailto:Y.S.Ramakrishna at Sun.COM] 
Sent: Friday, September 11, 2009 6:20 PM
To: Jeff Lloyd
Cc: hotspot-gc-use at openjdk.java.net
Subject: Re: Young generation configuration

Hi Jeff --

On 09/11/09 14:06, jeff.lloyd at algorithmics.com wrote:
> Hi Ramki,
> 
> I did not know that lower pause times and higher throughput were
> generally incompatible.  Good to know - it makes sense too.
> 
> I'm trying to find out how long "too long" is.  Bankers can be fickle.
> :-)  Honestly, I think "too long" constitutes a noticeable pause in
GUI
> interactions.

So, may be around one 200 ms pause per second or so at the most?
(If you think that is not suitable, think up a suitable figure like
that.)
That would give us the requisite pause time budget and implicitly
define a GC overhead budget of 200/1000 = 20% (which is actually
quite high, but still lower than the overhead i saw in some
of your logs from a quick browse,
but as Tony pointed out that's because of the excessive copying
you were doing of relatively long-lived data that you may be
better off tenuring more quickly and letting the concurrent
collector deal with it (modulo yr & Tony's earlier remarks
re the slightly (see below) increased pressure -- probably
unavoidable if you are to meet yr pause time goals -- on
the concurrent collector).

> 
> How did you measure the proportion of short-lived and medium-lived
> objects?

oh, i was playing somewhat fast and loose. I was taking the ratio
of (age 1 survivors): (Eden size) to get a rough read on the
short:(not short). I sampled a single GC from one of yr log files,
but that would be the way to figure this out (while averaging
over a sufficiently large set of samples). (Of course, "long" and
"short",
are relative, and age1 just tells you what survived that was allocated
in the last GC epoch. If GC's happen frequently less data would
die and more would qualify as "not short" by that kind of loose
definition (so my "long" and "short" was relative to the given GC
period).

> 
> We typically expect a "session" be live for most of the day, and

How much typical session data do you have? What is the rate at
which sessions get created? Does this happen perhaps mostly at the
start of the day? (In which case you would see lots of promotion
activity at the start of the day, but not so much later in the day?)
Or is the session creation rate uniform through the typical day?

> multiple reports of seconds or minutes in duration executed within
that
> session.  So yes, I am seeing my "steady state" continue for a long

Let's say 1 minute. So during that 1 minute, how much data do you
produce and of that how much needs to be saved into the session
in the form of the "result" from that report? Looks like that
result would constitute data that you want to tenure sooner
rather than later. Depending on how long the intermediate
results needed to generate the final result are needed (you
mentioned large trees of intermediate objects i think in an
earlier email), you may want to copy them in the survivor
spaces, or -- if that data is so large as to cost excessive
copying time -- just promote that too. Luckily, in typical
cases, if data wants to be large, it also wants to live
long.

> time, with blips of activity throughout the day.  We cache a lot of
> results, which can lead to a general upward trend, but it doesn't seem
> to be our current source of object volume.

The cached data will tenure. Best to tenure it soon, if the
proportion of cached data is large. (I am guessing that
if you cache, you probably find it saves computation later --
so it also saves allocation later; thus I might naively
expect that you will initially tenure lots of data as your
caches fill, and later in steady state tenure less as well
as perhaps allocate less.)

If I look at one random tenuring distribution sample out of yr
logs, I see:-

- age   1: 2151744736 bytes, 2151744736 total
- age   2:  897330448 bytes, 3049075184 total
- age   3: 1274314280 bytes, 4323389464 total
- age   4: 1351603024 bytes, 5674992488 total
- age   5: 1529394376 bytes, 7204386864 total
- age   6: 1219001160 bytes, 8423388024 total

which is very flat -- indicating that anything that survives
a scavenge appears to live on for quite a while (lots of
assumptions about steady loads and so on). Experimenting
with an MTT of 1 or 2 might be useful, cf yr previous emails
with Tony et al. (Yes you will want to increase yr OG size,
as you noted, but no it will not fill up much faster because
the rate at which you promote will be nearly the same, because
most data that survives a single scavenge here tends to live -- above --
for at least 6 scavenges after which it prmotes anyway; you are
just promoting that same data a bit sooner without wasting effort
in copying it back and forth. It is true that some small amount
if intermediate data will promote but that's probably OK).

You will then want to play with initiating occupancy fraction
once you get an idea about the rate at which it's filling
upo versus the rate at which CMS is able to collect versus
the effect on scavenges of letting the CMS gen fill up more
before collecting versus the effect of doing more frequent
or less frequent CMS cycles (and its effect on mutator throughput
and available CPU and memory bandwidth).

Yes, as Paul noted, definitely +UseCompressedOops to relieve
heap pressure (reduce GC overhead) and speed up mutators
by improving cache efficiency.

-- ramki

--------------------------------------------------------------------------
This email and any files transmitted with it are confidential and proprietary to Algorithmics Incorporated and its affiliates ("Algorithmics"). If received in error, use is prohibited. Please destroy, and notify sender. Sender does not waive confidentiality or privilege. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. Algorithmics does not accept liability for any errors or omissions. Any commitment intended to bind Algorithmics must be reduced to writing and signed by an authorized signatory.
--------------------------------------------------------------------------