RFR(S): 7158457: division by zero in adaptiveweightedaverage

Wed Apr 25 07:44:20 UTC 2012

Ramki,

I'm having to reconstruct how I arrived at this code (and
the reconstruction may be faulty) but I think the problem
was that at VM initialization there was sometimes
more variance in the pause times and I didn't want
to be weighted toward a pause time that was atypically
smaller or larger.   I know I had to do extra work to get the
heap to grow quickly at the beginning in order to
achieve throughput that was close to the throughput
with a tuned heap.  This was on client benchmarks
where the run times were short.

In any event I don't think that we can just reason about
the affects of t his code and then decide to remove
it.  We would need to generate some data with
and without the code before making a decision.

Jon

On 4/24/2012 11:26 PM, Srinivas Ramakrishna wrote:
> Hi Mikael --
>
> I am not convinced that the code here that is skeptical about using
> weighted averages until
> there are sufficiently many samples in the history is really that useful.
> I'd just as soon start using the
> original weight straightaway and dispense with the slow circumspective
> march towards its eventual use.
> For example, with weight = 75, which would weight the history 0.75 and the
> current sample 0.25 would need
> 3 or 4 steps before it started using the original weight. If you agree,
> then you can lose that code
> and your change to it., and thus avoid the divide by 0 issue that you faced,
>
> As to the background of that piece of code -- Did the direct and immediate
> use of the weight cause the
> control circuits to oscillate initially? Has that been observed or is it
> just a theoretical fear? (Although my
> theory says we need not have any such fears, especially if we wait long
> enough before actuating the
> control surfaces.)
>
> Also, it appears as though there would be other serious problems, see
> LinearLeastSquaresFit,
> if count() were to overflow. In fact many of the calculations in
> LinearLeastSquareFit (and perhaps elsewhere
> where the count is used) would go unstable at such a transition point. May
> be what we really need
> is a way to cull the count periodically to avoid its overflow. I think one
> could probably do that without too
> much problem by keeping a boundary crossing (say at half the available
> range for count) at which to do so,
> by correspondingly scaling the relevant quantities down appropriately --
> sum_x, sum_x^2 etc.
> IIRC, this is a technique oftenused in digital filtering and signal
> processing, when implementing IIR filters.
>
> -- ramki
>
>
> On Tue, Apr 24, 2012 at 11:01 AM, Mikael Vidstedt<
> mikael.vidstedt at oracle.com>  wrote:
>
>> Hi all,
>>
>> The statistical counters in gcUtil are used to keep track of historical
>> information about various key metrics in the garbage collectors. Built in
>> to the core AdaptiveWeightedAverage base class is the concept of aging the
>> values, essentially treating the first 100 values differently and putting
>> more weight on them since there's not yet enough historical data built up.
>>
>> In the class there is a 32-bit counter (_sample_count) that incremented
>> for every sample and used to compute scale the weight of the added value
>> (see compute_adaptive_average), and the scaling logic divides 100 by the
>> count. In the normal case this is not a problem - the counters are reset
>> every once in a while and/or grow very slowly. In some pathological cases
>> the counter will however continue to increment and eventually
>> overflow/wrap, meaning the 32-bit count will go back to zero and the
>> division in compute_adaptive_average will lead to a div-by-zero crash.
>>
>> The test case where this is observed is a test that stress tests
>> allocation in combination with the GC locker. Specifically, the test is
>> multi-threaded which pounds on java.util.zip.Deflater.**deflate, which
>> internally uses the GetPrimitiveArrayCritical JNI function to temporarily
>> lock out the GC (using the GC locker). The garbage collector used is in
>> this case the parallel scavenger and the the counter that overflows is
>> _avg_pretenured. _avg_pretenured is incremented/sampled every time an
>> allocation is made directly in the old gen, which I believe is more likely
>> when the GC locker is active.
>>
>>
>> The suggested fix is to only perform the division in
>> compute_adaptive_average when it is relevant, which currently is for the
>> first 100 values. Once there are more than 100 samples there is no longer a
>> need to scale the weight.
>>
>> This problem is tracked in 7158457 (stress: jdk7 u4 core dumps during
>> megacart stress test run).
>>
>> Please review and comment on the webrev below:
>>
>> http://cr.openjdk.java.net/~**mikael/7158457/webrev.00<http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00>
>>
>> Thanks,
>> Mikael
>>
>>