RFR(S): 7158457: division by zero in adaptiveweightedaverage

Wed Apr 25 08:36:36 UTC 2012

May be you could deal with high initial variance of a smaller data set by
capping the variance
(or using a "gentler surrogate" such as average absolute difference) in the
padding calculations
until the sample population is sufficiently large (say 10 or so... 100
seems too much to wait for,
even for skewed, non-normal statistics). I know, what is one ad-hoc'ism
versus another here, and
I agree that any such change would have to be empirically tested for
stability and performance etc.

My remark about this specific fix merely papering over a minor symptom
while the
larger problem of counter overflow remaining in other paces in the code
still stands i think,
and fixing that issue once and for all would likely fix the symptom that
Mikael is trying
to fix here.

But I haven't looked at all of this code carefully in a long time, so I'll
let you folks deal with this
as you think is correct in the long-term and expedient in the short-term.

-- ramki

On Wed, Apr 25, 2012 at 12:44 AM, Jon Masamitsu <jon.masamitsu at oracle.com>wrote:

> Ramki,
>
> I'm having to reconstruct how I arrived at this code (and
> the reconstruction may be faulty) but I think the problem
> was that at VM initialization there was sometimes
> more variance in the pause times and I didn't want
> to be weighted toward a pause time that was atypically
> smaller or larger.   I know I had to do extra work to get the
> heap to grow quickly at the beginning in order to
> achieve throughput that was close to the throughput
> with a tuned heap.  This was on client benchmarks
> where the run times were short.
>
> In any event I don't think that we can just reason about
> the affects of t his code and then decide to remove
> it.  We would need to generate some data with
> and without the code before making a decision.
>
> Jon
>
>
>
> On 4/24/2012 11:26 PM, Srinivas Ramakrishna wrote:
>
>> Hi Mikael --
>>
>> I am not convinced that the code here that is skeptical about using
>> weighted averages until
>> there are sufficiently many samples in the history is really that useful.
>> I'd just as soon start using the
>> original weight straightaway and dispense with the slow circumspective
>> march towards its eventual use.
>> For example, with weight = 75, which would weight the history 0.75 and the
>> current sample 0.25 would need
>> 3 or 4 steps before it started using the original weight. If you agree,
>> then you can lose that code
>> and your change to it., and thus avoid the divide by 0 issue that you
>> faced,
>>
>> As to the background of that piece of code -- Did the direct and immediate
>> use of the weight cause the
>> control circuits to oscillate initially? Has that been observed or is it
>> just a theoretical fear? (Although my
>> theory says we need not have any such fears, especially if we wait long
>> enough before actuating the
>> control surfaces.)
>>
>> Also, it appears as though there would be other serious problems, see
>> LinearLeastSquaresFit,
>> if count() were to overflow. In fact many of the calculations in
>> LinearLeastSquareFit (and perhaps elsewhere
>> where the count is used) would go unstable at such a transition point. May
>> be what we really need
>> is a way to cull the count periodically to avoid its overflow. I think one
>> could probably do that without too
>> much problem by keeping a boundary crossing (say at half the available
>> range for count) at which to do so,
>> by correspondingly scaling the relevant quantities down appropriately --
>> sum_x, sum_x^2 etc.
>> IIRC, this is a technique oftenused in digital filtering and signal
>> processing, when implementing IIR filters.
>>
>> -- ramki
>>
>>
>> On Tue, Apr 24, 2012 at 11:01 AM, Mikael Vidstedt<
>> mikael.vidstedt at oracle.com>  wrote:
>>
>>  Hi all,
>>>
>>> The statistical counters in gcUtil are used to keep track of historical
>>> information about various key metrics in the garbage collectors. Built in
>>> to the core AdaptiveWeightedAverage base class is the concept of aging
>>> the
>>> values, essentially treating the first 100 values differently and putting
>>> more weight on them since there's not yet enough historical data built
>>> up.
>>>
>>> In the class there is a 32-bit counter (_sample_count) that incremented
>>> for every sample and used to compute scale the weight of the added value
>>> (see compute_adaptive_average), and the scaling logic divides 100 by the
>>> count. In the normal case this is not a problem - the counters are reset
>>> every once in a while and/or grow very slowly. In some pathological cases
>>> the counter will however continue to increment and eventually
>>> overflow/wrap, meaning the 32-bit count will go back to zero and the
>>> division in compute_adaptive_average will lead to a div-by-zero crash.
>>>
>>> The test case where this is observed is a test that stress tests
>>> allocation in combination with the GC locker. Specifically, the test is
>>> multi-threaded which pounds on java.util.zip.Deflater.****deflate, which
>>>
>>> internally uses the GetPrimitiveArrayCritical JNI function to temporarily
>>> lock out the GC (using the GC locker). The garbage collector used is in
>>> this case the parallel scavenger and the the counter that overflows is
>>> _avg_pretenured. _avg_pretenured is incremented/sampled every time an
>>> allocation is made directly in the old gen, which I believe is more
>>> likely
>>> when the GC locker is active.
>>>
>>>
>>> The suggested fix is to only perform the division in
>>> compute_adaptive_average when it is relevant, which currently is for the
>>> first 100 values. Once there are more than 100 samples there is no
>>> longer a
>>> need to scale the weight.
>>>
>>> This problem is tracked in 7158457 (stress: jdk7 u4 core dumps during
>>> megacart stress test run).
>>>
>>> Please review and comment on the webrev below:
>>>
>>> http://cr.openjdk.java.net/~****mikael/7158457/webrev.00<http://cr.openjdk.java.net/%7E**mikael/7158457/webrev.00>
>>> <http:**//cr.openjdk.java.net/%**7Emikael/7158457/webrev.00<http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00>
>>> >
>>>
>>> Thanks,
>>> Mikael
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120425/8151a161/attachment.htm>