<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
Yes, I certainly agree that this is just fixing the symptom
short-term. I'd be happy if somebody else would be interested in
picking it up to look at and address the real problem.<br>
<br>
Cheers,<br>
Mikael<br>
<br>
On 2012-04-25 01:36, Srinivas Ramakrishna wrote:
<blockquote
cite="mid:CABzyjym2hCsyuayckxDGK7Z+r69coYy1BytdDrX2CrQFcAWxgw@mail.gmail.com"
type="cite">May be you could deal with high initial variance of a
smaller data set by capping the variance<br>
(or using a "gentler surrogate" such as average absolute
difference) in the padding calculations<br>
until the sample population is sufficiently large (say 10 or so...
100 seems too much to wait for,<br>
even for skewed, non-normal statistics). I know, what is one
ad-hoc'ism versus another here, and<br>
I agree that any such change would have to be empirically tested
for stability and performance etc.<br>
<br>
My remark about this specific fix merely papering over a minor
symptom while the<br>
larger problem of counter overflow remaining in other paces in the
code still stands i think,<br>
and fixing that issue once and for all would likely fix the
symptom that Mikael is trying<br>
to fix here.<br>
<br>
But I haven't looked at all of this code carefully in a long time,
so I'll let you folks deal with this<br>
as you think is correct in the long-term and expedient in the
short-term.<br>
<br>
-- ramki<br>
<br>
<div class="gmail_quote">On Wed, Apr 25, 2012 at 12:44 AM, Jon
Masamitsu <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:jon.masamitsu@oracle.com">jon.masamitsu@oracle.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Ramki,<br>
<br>
I'm having to reconstruct how I arrived at this code (and<br>
the reconstruction may be faulty) but I think the problem<br>
was that at VM initialization there was sometimes<br>
more variance in the pause times and I didn't want<br>
to be weighted toward a pause time that was atypically<br>
smaller or larger. I know I had to do extra work to get the<br>
heap to grow quickly at the beginning in order to<br>
achieve throughput that was close to the throughput<br>
with a tuned heap. This was on client benchmarks<br>
where the run times were short.<br>
<br>
In any event I don't think that we can just reason about<br>
the affects of t his code and then decide to remove<br>
it. We would need to generate some data with<br>
and without the code before making a decision.<br>
<br>
Jon
<div>
<div class="h5"><br>
<br>
<br>
On 4/24/2012 11:26 PM, Srinivas Ramakrishna wrote:<br>
</div>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div class="h5">
Hi Mikael --<br>
<br>
I am not convinced that the code here that is skeptical
about using<br>
weighted averages until<br>
there are sufficiently many samples in the history is
really that useful.<br>
I'd just as soon start using the<br>
original weight straightaway and dispense with the slow
circumspective<br>
march towards its eventual use.<br>
For example, with weight = 75, which would weight the
history 0.75 and the<br>
current sample 0.25 would need<br>
3 or 4 steps before it started using the original
weight. If you agree,<br>
then you can lose that code<br>
and your change to it., and thus avoid the divide by 0
issue that you faced,<br>
<br>
As to the background of that piece of code -- Did the
direct and immediate<br>
use of the weight cause the<br>
control circuits to oscillate initially? Has that been
observed or is it<br>
just a theoretical fear? (Although my<br>
theory says we need not have any such fears, especially
if we wait long<br>
enough before actuating the<br>
control surfaces.)<br>
<br>
Also, it appears as though there would be other serious
problems, see<br>
LinearLeastSquaresFit,<br>
if count() were to overflow. In fact many of the
calculations in<br>
LinearLeastSquareFit (and perhaps elsewhere<br>
where the count is used) would go unstable at such a
transition point. May<br>
be what we really need<br>
is a way to cull the count periodically to avoid its
overflow. I think one<br>
could probably do that without too<br>
much problem by keeping a boundary crossing (say at half
the available<br>
range for count) at which to do so,<br>
by correspondingly scaling the relevant quantities down
appropriately --<br>
sum_x, sum_x^2 etc.<br>
IIRC, this is a technique oftenused in digital filtering
and signal<br>
processing, when implementing IIR filters.<br>
<br>
-- ramki<br>
<br>
<br>
On Tue, Apr 24, 2012 at 11:01 AM, Mikael Vidstedt<<br>
<a moz-do-not-send="true"
href="mailto:mikael.vidstedt@oracle.com"
target="_blank">mikael.vidstedt@oracle.com</a>>
wrote:<br>
<br>
</div>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div class="h5">
Hi all,<br>
<br>
The statistical counters in gcUtil are used to keep
track of historical<br>
information about various key metrics in the garbage
collectors. Built in<br>
to the core AdaptiveWeightedAverage base class is the
concept of aging the<br>
values, essentially treating the first 100 values
differently and putting<br>
more weight on them since there's not yet enough
historical data built up.<br>
<br>
In the class there is a 32-bit counter (_sample_count)
that incremented<br>
for every sample and used to compute scale the weight
of the added value<br>
(see compute_adaptive_average), and the scaling logic
divides 100 by the<br>
count. In the normal case this is not a problem - the
counters are reset<br>
every once in a while and/or grow very slowly. In some
pathological cases<br>
the counter will however continue to increment and
eventually<br>
overflow/wrap, meaning the 32-bit count will go back
to zero and the<br>
division in compute_adaptive_average will lead to a
div-by-zero crash.<br>
<br>
The test case where this is observed is a test that
stress tests<br>
allocation in combination with the GC locker.
Specifically, the test is<br>
</div>
</div>
multi-threaded which pounds on java.util.zip.Deflater.**deflate,
which
<div class="im"><br>
internally uses the GetPrimitiveArrayCritical JNI
function to temporarily<br>
lock out the GC (using the GC locker). The garbage
collector used is in<br>
this case the parallel scavenger and the the counter
that overflows is<br>
_avg_pretenured. _avg_pretenured is incremented/sampled
every time an<br>
allocation is made directly in the old gen, which I
believe is more likely<br>
when the GC locker is active.<br>
<br>
<br>
The suggested fix is to only perform the division in<br>
compute_adaptive_average when it is relevant, which
currently is for the<br>
first 100 values. Once there are more than 100 samples
there is no longer a<br>
need to scale the weight.<br>
<br>
This problem is tracked in 7158457 (stress: jdk7 u4 core
dumps during<br>
megacart stress test run).<br>
<br>
Please review and comment on the webrev below:<br>
<br>
</div>
<a moz-do-not-send="true"
href="http://cr.openjdk.java.net/%7E**mikael/7158457/webrev.00"
target="_blank">http://cr.openjdk.java.net/~**mikael/7158457/webrev.00</a><<a
moz-do-not-send="true"
href="http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00"
target="_blank">http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00</a>><br>
<br>
Thanks,<br>
Mikael<br>
<br>
<br>
</blockquote>
</blockquote>
</blockquote>
</div>
<br>
</blockquote>
<br>
</body>
</html>