G1 performance

Fri May 10 22:20:28 UTC 2013

On 5/9/2013 10:56 AM, Srinivas Ramakrishna wrote:
> On Thu, May 9, 2013 at 8:09 AM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>> On 5/9/13 12:22 AM, Srinivas Ramakrishna wrote:
> ...
>>> As regards ParallelGC's adaptive sizing, I have recently found that
>>> the filter that maintains the estimate
>>> of the promotion volume (which sereves among other things, as input to
>>> the decision whether a
>>> scavenge might fail) produces estimates that are far too perssimistic
>>> and, in a heap-constrained situation,
>>> can end up locking the JVM into what appears to be a self-sustaining
>>> suboptimal state from which
>>> it will not easily be dislodged.  A smallish tweak to the filter will
>>> produce much better behaviour.
>>> I will submit  patch when I have had a chance to test its performance.
>>
>> Ok. You have my attention.  What's the nature of the tweak?
> :-)
>
> The padded promotion average calculation adds absolute deviation
> estimate to the mean estimate.
> There are three issues:
> . the absolute deviation is positive (naturally) even when the
> deviation is negative (so when promotion volume is declining wrt the
> prediction), which
>    increases the padding, which increases the deviation, which .... A
> fix is to use a deviation of 0 when it's negative (the case above).
> This reduces
>    the pessimism in the prediction.

That seems right.

> . the filter gives much more weightage to historical estimate than to
> current value. This can be good since it prevents oscillation or
> over-reaction to
>    small short-term changes, acting as a low pass filter, but the
> weightage is so skewed that the reaction becomes extremely sluggish.
> Tuning the
>    weightage to favour past and present equally seems to work well (of
> course this needs to be tested over a range of benchmarks to see if
> that
>    value is generally good).
GC ergo has always been sold (at least by me) as a solution for
applications at a steady state.  Response to changing application
behavior was secondary.  I can see adding some flexibility for
quicker reactions but I'd prefer it based on some knobs.

> . when we get stuck (because of the above two, but mostly because of
> the first reason) in a back-to-back full gc  state, i think there is
> no new
>    promotion estimate telemetry coming into the filter. This leaves the
> output of the filter locked up in the past. We need to either decay
> the filter output
>    at each gc event (major or minor), or better just feed it a
> promotion volume estimate computed from comparing the old gen sizes
> before and after
>    the collection (with a floor at 0).

At some point not too distant in the future we're going to make a change
so that the young gen will change at a full GC (currently only the old gen
changes).  This may not address this problem directly but we should see
what it does.

>
> I ran a simulation of a range of filters in excel with the same inputs
> as the telemetric signal we were getting and plotted the outputs.
> I need to recheck the data (as my excel skills are less than stellar),
> but I am attaching a preliminary plot. (I'll make sure to recheck and
> attach a checked plot to the bug that i am planning to submit once we
> have done some more testing here with the proposed changes.)
> The first is a "long-term" plot of the outputs, the second zooms into
> the first few moments after the spike and its decline.
>
> The outputs are a series of changes I made (which i'll explain in the
> bug that i file) to the filter, showing the effect of each change to
> the
> response from the filter.

I'll wait for the details in the CR before commenting.

Thanks for all the work on this.

Jon

>
> -- ramki