RFR(S): 7158457: division by zero in adaptiveweightedaverage

Wed Apr 25 21:51:32 UTC 2012

Yes, I certainly agree that this is just fixing the symptom short-term. 
I'd be happy if somebody else would be interested in picking it up to 
look at and address the real problem.

Cheers,
Mikael

On 2012-04-25 01:36, Srinivas Ramakrishna wrote:
> May be you could deal with high initial variance of a smaller data set 
> by capping the variance
> (or using a "gentler surrogate" such as average absolute difference) 
> in the padding calculations
> until the sample population is sufficiently large (say 10 or so... 100 
> seems too much to wait for,
> even for skewed, non-normal statistics). I know, what is one 
> ad-hoc'ism versus another here, and
> I agree that any such change would have to be empirically tested for 
> stability and performance etc.
>
> My remark about this specific fix merely papering over a minor symptom 
> while the
> larger problem of counter overflow remaining in other paces in the 
> code still stands i think,
> and fixing that issue once and for all would likely fix the symptom 
> that Mikael is trying
> to fix here.
>
> But I haven't looked at all of this code carefully in a long time, so 
> I'll let you folks deal with this
> as you think is correct in the long-term and expedient in the short-term.
>
> -- ramki
>
> On Wed, Apr 25, 2012 at 12:44 AM, Jon Masamitsu 
> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>> wrote:
>
>     Ramki,
>
>     I'm having to reconstruct how I arrived at this code (and
>     the reconstruction may be faulty) but I think the problem
>     was that at VM initialization there was sometimes
>     more variance in the pause times and I didn't want
>     to be weighted toward a pause time that was atypically
>     smaller or larger.   I know I had to do extra work to get the
>     heap to grow quickly at the beginning in order to
>     achieve throughput that was close to the throughput
>     with a tuned heap.  This was on client benchmarks
>     where the run times were short.
>
>     In any event I don't think that we can just reason about
>     the affects of t his code and then decide to remove
>     it.  We would need to generate some data with
>     and without the code before making a decision.
>
>     Jon
>
>
>
>     On 4/24/2012 11:26 PM, Srinivas Ramakrishna wrote:
>
>         Hi Mikael --
>
>         I am not convinced that the code here that is skeptical about
>         using
>         weighted averages until
>         there are sufficiently many samples in the history is really
>         that useful.
>         I'd just as soon start using the
>         original weight straightaway and dispense with the slow
>         circumspective
>         march towards its eventual use.
>         For example, with weight = 75, which would weight the history
>         0.75 and the
>         current sample 0.25 would need
>         3 or 4 steps before it started using the original weight. If
>         you agree,
>         then you can lose that code
>         and your change to it., and thus avoid the divide by 0 issue
>         that you faced,
>
>         As to the background of that piece of code -- Did the direct
>         and immediate
>         use of the weight cause the
>         control circuits to oscillate initially? Has that been
>         observed or is it
>         just a theoretical fear? (Although my
>         theory says we need not have any such fears, especially if we
>         wait long
>         enough before actuating the
>         control surfaces.)
>
>         Also, it appears as though there would be other serious
>         problems, see
>         LinearLeastSquaresFit,
>         if count() were to overflow. In fact many of the calculations in
>         LinearLeastSquareFit (and perhaps elsewhere
>         where the count is used) would go unstable at such a
>         transition point. May
>         be what we really need
>         is a way to cull the count periodically to avoid its overflow.
>         I think one
>         could probably do that without too
>         much problem by keeping a boundary crossing (say at half the
>         available
>         range for count) at which to do so,
>         by correspondingly scaling the relevant quantities down
>         appropriately --
>         sum_x, sum_x^2 etc.
>         IIRC, this is a technique oftenused in digital filtering and
>         signal
>         processing, when implementing IIR filters.
>
>         -- ramki
>
>
>         On Tue, Apr 24, 2012 at 11:01 AM, Mikael Vidstedt<
>         mikael.vidstedt at oracle.com
>         <mailto:mikael.vidstedt at oracle.com>>  wrote:
>
>             Hi all,
>
>             The statistical counters in gcUtil are used to keep track
>             of historical
>             information about various key metrics in the garbage
>             collectors. Built in
>             to the core AdaptiveWeightedAverage base class is the
>             concept of aging the
>             values, essentially treating the first 100 values
>             differently and putting
>             more weight on them since there's not yet enough
>             historical data built up.
>
>             In the class there is a 32-bit counter (_sample_count)
>             that incremented
>             for every sample and used to compute scale the weight of
>             the added value
>             (see compute_adaptive_average), and the scaling logic
>             divides 100 by the
>             count. In the normal case this is not a problem - the
>             counters are reset
>             every once in a while and/or grow very slowly. In some
>             pathological cases
>             the counter will however continue to increment and eventually
>             overflow/wrap, meaning the 32-bit count will go back to
>             zero and the
>             division in compute_adaptive_average will lead to a
>             div-by-zero crash.
>
>             The test case where this is observed is a test that stress
>             tests
>             allocation in combination with the GC locker.
>             Specifically, the test is
>             multi-threaded which pounds on
>             java.util.zip.Deflater.**deflate, which
>
>             internally uses the GetPrimitiveArrayCritical JNI function
>             to temporarily
>             lock out the GC (using the GC locker). The garbage
>             collector used is in
>             this case the parallel scavenger and the the counter that
>             overflows is
>             _avg_pretenured. _avg_pretenured is incremented/sampled
>             every time an
>             allocation is made directly in the old gen, which I
>             believe is more likely
>             when the GC locker is active.
>
>
>             The suggested fix is to only perform the division in
>             compute_adaptive_average when it is relevant, which
>             currently is for the
>             first 100 values. Once there are more than 100 samples
>             there is no longer a
>             need to scale the weight.
>
>             This problem is tracked in 7158457 (stress: jdk7 u4 core
>             dumps during
>             megacart stress test run).
>
>             Please review and comment on the webrev below:
>
>             http://cr.openjdk.java.net/~**mikael/7158457/webrev.00
>             <http://cr.openjdk.java.net/%7E**mikael/7158457/webrev.00><http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00>
>
>             Thanks,
>             Mikael
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120425/0ee8e5ba/attachment.htm>