<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
Final(?) version, only difference is the removed assert in sample():<br>
<br>
<a class="moz-txt-link-freetext" href="http://cr.openjdk.java.net/~mikael/7158457/webrev.02/">http://cr.openjdk.java.net/~mikael/7158457/webrev.02/</a><br>
<br>
I'm also looking for a sponsor for this fix - any takers?<br>
<br>
Thanks,<br>
Mikael<br>
<br>
On 2012-05-05 05:57, Srinivas Ramakrishna wrote:
<blockquote
cite="mid:CABzyjym-WtPkzokKACr1Arid3kiJXze2VX-8JxsC7hUmF4eY-A@mail.gmail.com"
type="cite">Mikael -- I agree that this fix would be good to get
in, at least to address the immediate crash.<br>
<br>
reviewed.<br>
-- ramki<br>
<br>
<div class="gmail_quote">On Fri, May 4, 2012 at 3:03 AM, Mikael
Vidstedt <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:mikael.vidstedt@oracle.com" target="_blank">mikael.vidstedt@oracle.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
Update:<br>
<br>
I'd like to continue working on this issue, and as part of
that look at how the promoted and pretenured counters are
used today and potentially update the logic given the
insights around bytes vs. words etc.<br>
<br>
I think it would be helpful to address the crash problem
short term though, and so I'd like to get feedback on the
updated webrev:<br>
<br>
<a moz-do-not-send="true"
href="http://cr.openjdk.java.net/%7Emikael/7158457/webrev.01/"
target="_blank">http://cr.openjdk.java.net/~mikael/7158457/webrev.01/</a><br>
<br>
It's essentially the same as the previous one with Igor's
suggested _is_old check improvement. Would it be ok to make
this fix and clone the bug to track the follow-up work
cleaning up the actual logic/heuristics?<br>
<br>
Thanks,<br>
Mikael
<div>
<div class="h5"><br>
<br>
On 2012-04-26 03:38, Mikael Vidstedt wrote:
<blockquote type="cite"> <br>
On 2012-04-24 23:50, Igor Veresov wrote:
<blockquote type="cite">
<div>Wasn't it the case that _avg_pretenured value
is formed incorrectly? I don't think it should be
sampled at every allocation, I think it should
measure the amount of data allocated in the old
gen between young collections, also if you
remember we agreed that the dimensions are wrong.
Or were you able to find a better explanation as
opposed to what we discussed before?</div>
</blockquote>
<br>
Thanks for reminding me - I believe you're absolutely
right. For a while I was thinking it actually did make
sense after all, but I just did an experiment to see
what actually happens at runtime:<br>
<br>
The _avg_pretenured counter is indeed sampled every
time an allocation is made directly in old gen. The
actual value reflects the average size *in words* of
the object that was allocated. Put differently, it's
the average size of pretenured objects in words.<br>
<br>
The counter is used in
PSAdaptiveSizePolicy::update_averages to calculate the
average amount of promoted data:<br>
<br>
<font face="Courier New, Courier, monospace">avg_promoted()->sample(promoted
+ _avg_pretenured->padded_average());<br>
</font><br>
The promoted parameter is the number of *bytes* that
were just promoted. To sum it up, that appears to
imply that there are two problems with the above
computation:<br>
<br>
1. It's taking the size of everything that was just
promoted and adds the size of an average prenured
object (as opposed to the total size of all recently
pretenured objects)<br>
2. It's adding variables with different dimensions -
promoted is in bytes, and the _avg_pretenured padded
average is in words<br>
<br>
Since that effectively means _avg_pretenured it's off
by a factor (word_size * number_of_objects) I'm
guessing it will in the common case not really matter
to the calculation of avg_promoted...<br>
<br>
<br>
<blockquote type="cite">
<div><br>
</div>
<div>As for the treatment of the symptom, the code
looks good to me. Do you think it might be
beneficial to check the old value of _is_old
before assigning to it? Would cause less memory
traffic, if increment_count() is called
frequently.</div>
<div>
<pre style="font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;word-spacing:0px"><span style="color:blue"> 60 void increment_count() {</span>
<span style="color:blue"> 61 _sample_count++;</span><font color="#0000ff">
</font></pre>
<pre style="font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;word-spacing:0px"><span style="font-family:Helvetica;white-space:normal"><pre style="font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;word-spacing:0px"><span style="color:blue"> </span><span><font color="#ff0000">62 if (!_is_old && _sample_count > OLD_THRESHOLD) {</font></span></pre></span><span style="color:blue"> 63 _is_old = true;</span>
<span style="color:blue"> 64 }</span>
<span style="color:blue"> 65 }</span></pre>
</div>
</blockquote>
<br>
Good suggestion. I'll update my change with this in
case we need something urgently, but given the above
issues it's likely a good idea to take another pass at
this.<br>
<br>
Thanks,<br>
Mikael<br>
<br>
<blockquote type="cite">
<div>
<div><br>
</div>
</div>
<div>igor</div>
<br>
<div>
<div>On Apr 24, 2012, at 8:01 AM, Mikael Vidstedt
wrote:</div>
<br>
<blockquote type="cite">
<div><br>
Hi all,<br>
<br>
The statistical counters in gcUtil are used to
keep track of historical information about
various key metrics in the garbage collectors.
Built in to the core AdaptiveWeightedAverage
base class is the concept of aging the values,
essentially treating the first 100 values
differently and putting more weight on them
since there's not yet enough historical data
built up.<br>
<br>
In the class there is a 32-bit counter
(_sample_count) that incremented for every
sample and used to compute scale the weight of
the added value (see
compute_adaptive_average), and the scaling
logic divides 100 by the count. In the normal
case this is not a problem - the counters are
reset every once in a while and/or grow very
slowly. In some pathological cases the counter
will however continue to increment and
eventually overflow/wrap, meaning the 32-bit
count will go back to zero and the division in
compute_adaptive_average will lead to a
div-by-zero crash.<br>
<br>
The test case where this is observed is a test
that stress tests allocation in combination
with the GC locker. Specifically, the test is
multi-threaded which pounds on
java.util.zip.Deflater.deflate, which
internally uses the GetPrimitiveArrayCritical
JNI function to temporarily lock out the GC
(using the GC locker). The garbage collector
used is in this case the parallel scavenger
and the the counter that overflows is
_avg_pretenured. _avg_pretenured is
incremented/sampled every time an allocation
is made directly in the old gen, which I
believe is more likely when the GC locker is
active.<br>
<br>
<br>
The suggested fix is to only perform the
division in compute_adaptive_average when it
is relevant, which currently is for the first
100 values. Once there are more than 100
samples there is no longer a need to scale the
weight.<br>
<br>
This problem is tracked in 7158457 (stress:
jdk7 u4 core dumps during megacart stress test
run).<br>
<br>
Please review and comment on the webrev below:<br>
<br>
<a moz-do-not-send="true"
href="http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00"
target="_blank">http://cr.openjdk.java.net/~mikael/7158457/webrev.00</a><br>
<br>
Thanks,<br>
Mikael<br>
<br>
</div>
</blockquote>
</div>
<br>
</blockquote>
<br>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</blockquote>
<br>
</body>
</html>