RFR(S): 7158457: division by zero in adaptiveweightedaverage
Mikael Vidstedt
mikael.vidstedt at oracle.com
Wed Apr 25 21:51:32 UTC 2012
Yes, I certainly agree that this is just fixing the symptom short-term.
I'd be happy if somebody else would be interested in picking it up to
look at and address the real problem.
Cheers,
Mikael
On 2012-04-25 01:36, Srinivas Ramakrishna wrote:
> May be you could deal with high initial variance of a smaller data set
> by capping the variance
> (or using a "gentler surrogate" such as average absolute difference)
> in the padding calculations
> until the sample population is sufficiently large (say 10 or so... 100
> seems too much to wait for,
> even for skewed, non-normal statistics). I know, what is one
> ad-hoc'ism versus another here, and
> I agree that any such change would have to be empirically tested for
> stability and performance etc.
>
> My remark about this specific fix merely papering over a minor symptom
> while the
> larger problem of counter overflow remaining in other paces in the
> code still stands i think,
> and fixing that issue once and for all would likely fix the symptom
> that Mikael is trying
> to fix here.
>
> But I haven't looked at all of this code carefully in a long time, so
> I'll let you folks deal with this
> as you think is correct in the long-term and expedient in the short-term.
>
> -- ramki
>
> On Wed, Apr 25, 2012 at 12:44 AM, Jon Masamitsu
> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>> wrote:
>
> Ramki,
>
> I'm having to reconstruct how I arrived at this code (and
> the reconstruction may be faulty) but I think the problem
> was that at VM initialization there was sometimes
> more variance in the pause times and I didn't want
> to be weighted toward a pause time that was atypically
> smaller or larger. I know I had to do extra work to get the
> heap to grow quickly at the beginning in order to
> achieve throughput that was close to the throughput
> with a tuned heap. This was on client benchmarks
> where the run times were short.
>
> In any event I don't think that we can just reason about
> the affects of t his code and then decide to remove
> it. We would need to generate some data with
> and without the code before making a decision.
>
> Jon
>
>
>
> On 4/24/2012 11:26 PM, Srinivas Ramakrishna wrote:
>
> Hi Mikael --
>
> I am not convinced that the code here that is skeptical about
> using
> weighted averages until
> there are sufficiently many samples in the history is really
> that useful.
> I'd just as soon start using the
> original weight straightaway and dispense with the slow
> circumspective
> march towards its eventual use.
> For example, with weight = 75, which would weight the history
> 0.75 and the
> current sample 0.25 would need
> 3 or 4 steps before it started using the original weight. If
> you agree,
> then you can lose that code
> and your change to it., and thus avoid the divide by 0 issue
> that you faced,
>
> As to the background of that piece of code -- Did the direct
> and immediate
> use of the weight cause the
> control circuits to oscillate initially? Has that been
> observed or is it
> just a theoretical fear? (Although my
> theory says we need not have any such fears, especially if we
> wait long
> enough before actuating the
> control surfaces.)
>
> Also, it appears as though there would be other serious
> problems, see
> LinearLeastSquaresFit,
> if count() were to overflow. In fact many of the calculations in
> LinearLeastSquareFit (and perhaps elsewhere
> where the count is used) would go unstable at such a
> transition point. May
> be what we really need
> is a way to cull the count periodically to avoid its overflow.
> I think one
> could probably do that without too
> much problem by keeping a boundary crossing (say at half the
> available
> range for count) at which to do so,
> by correspondingly scaling the relevant quantities down
> appropriately --
> sum_x, sum_x^2 etc.
> IIRC, this is a technique oftenused in digital filtering and
> signal
> processing, when implementing IIR filters.
>
> -- ramki
>
>
> On Tue, Apr 24, 2012 at 11:01 AM, Mikael Vidstedt<
> mikael.vidstedt at oracle.com
> <mailto:mikael.vidstedt at oracle.com>> wrote:
>
> Hi all,
>
> The statistical counters in gcUtil are used to keep track
> of historical
> information about various key metrics in the garbage
> collectors. Built in
> to the core AdaptiveWeightedAverage base class is the
> concept of aging the
> values, essentially treating the first 100 values
> differently and putting
> more weight on them since there's not yet enough
> historical data built up.
>
> In the class there is a 32-bit counter (_sample_count)
> that incremented
> for every sample and used to compute scale the weight of
> the added value
> (see compute_adaptive_average), and the scaling logic
> divides 100 by the
> count. In the normal case this is not a problem - the
> counters are reset
> every once in a while and/or grow very slowly. In some
> pathological cases
> the counter will however continue to increment and eventually
> overflow/wrap, meaning the 32-bit count will go back to
> zero and the
> division in compute_adaptive_average will lead to a
> div-by-zero crash.
>
> The test case where this is observed is a test that stress
> tests
> allocation in combination with the GC locker.
> Specifically, the test is
> multi-threaded which pounds on
> java.util.zip.Deflater.**deflate, which
>
> internally uses the GetPrimitiveArrayCritical JNI function
> to temporarily
> lock out the GC (using the GC locker). The garbage
> collector used is in
> this case the parallel scavenger and the the counter that
> overflows is
> _avg_pretenured. _avg_pretenured is incremented/sampled
> every time an
> allocation is made directly in the old gen, which I
> believe is more likely
> when the GC locker is active.
>
>
> The suggested fix is to only perform the division in
> compute_adaptive_average when it is relevant, which
> currently is for the
> first 100 values. Once there are more than 100 samples
> there is no longer a
> need to scale the weight.
>
> This problem is tracked in 7158457 (stress: jdk7 u4 core
> dumps during
> megacart stress test run).
>
> Please review and comment on the webrev below:
>
> http://cr.openjdk.java.net/~**mikael/7158457/webrev.00
> <http://cr.openjdk.java.net/%7E**mikael/7158457/webrev.00><http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00>
>
> Thanks,
> Mikael
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120425/0ee8e5ba/attachment.htm>
More information about the hotspot-gc-dev
mailing list