RFR(S): 7158457: division by zero in adaptiveweightedaverage

Fri May 11 19:13:05 UTC 2012

Final(?) version, only difference is the removed assert in sample():

http://cr.openjdk.java.net/~mikael/7158457/webrev.02/

I'm also looking for a sponsor for this fix - any takers?

Thanks,
Mikael

On 2012-05-05 05:57, Srinivas Ramakrishna wrote:
> Mikael -- I agree that this fix would be good to get in, at least to 
> address the immediate crash.
>
> reviewed.
> -- ramki
>
> On Fri, May 4, 2012 at 3:03 AM, Mikael Vidstedt 
> <mikael.vidstedt at oracle.com <mailto:mikael.vidstedt at oracle.com>> wrote:
>
>
>     Update:
>
>     I'd like to continue working on this issue, and as part of that
>     look at how the promoted and pretenured counters are used today
>     and potentially update the logic given the insights around bytes
>     vs. words etc.
>
>     I think it would be helpful to address the crash problem short
>     term though, and so I'd like to get feedback on the updated webrev:
>
>     http://cr.openjdk.java.net/~mikael/7158457/webrev.01/
>     <http://cr.openjdk.java.net/%7Emikael/7158457/webrev.01/>
>
>     It's essentially the same as the previous one with Igor's
>     suggested _is_old check improvement. Would it be ok to make this
>     fix and clone the bug to track the follow-up work cleaning up the
>     actual logic/heuristics?
>
>     Thanks,
>     Mikael
>
>
>     On 2012-04-26 03:38, Mikael Vidstedt wrote:
>>
>>     On 2012-04-24 23:50, Igor Veresov wrote:
>>>     Wasn't it the case that _avg_pretenured value is formed
>>>     incorrectly? I don't think it should be sampled at every
>>>     allocation, I think it should measure the amount of data
>>>     allocated in the old gen between young collections, also if you
>>>     remember we agreed that the dimensions are wrong. Or were you
>>>     able to find a better explanation as opposed to what we
>>>     discussed before?
>>
>>     Thanks for reminding me - I believe you're absolutely right. For
>>     a while I was thinking it actually did make sense after all, but
>>     I just did an experiment to see what actually happens at runtime:
>>
>>     The _avg_pretenured counter is indeed sampled every time an
>>     allocation is made directly in old gen. The actual value reflects
>>     the average size *in words* of the object that was allocated. Put
>>     differently, it's the average size of pretenured objects in words.
>>
>>     The counter is used in PSAdaptiveSizePolicy::update_averages to
>>     calculate the average amount of promoted data:
>>
>>     avg_promoted()->sample(promoted + _avg_pretenured->padded_average());
>>
>>     The promoted parameter is the number of *bytes* that were just
>>     promoted. To sum it up, that appears to imply that there are two
>>     problems with the above computation:
>>
>>     1. It's taking the size of everything that was just promoted and
>>     adds the size of an average prenured object (as opposed to the
>>     total size of all recently pretenured objects)
>>     2. It's adding variables with different dimensions - promoted is
>>     in bytes, and the _avg_pretenured padded average is in words
>>
>>     Since that effectively means _avg_pretenured it's off by a factor
>>     (word_size * number_of_objects) I'm guessing it will in the
>>     common case not really matter to the calculation of avg_promoted...
>>
>>
>>>
>>>     As for the treatment of the symptom, the code looks good to me.
>>>     Do you think it might be beneficial to check the old value of
>>>     _is_old before assigning to it? Would cause less memory traffic,
>>>     if increment_count() is called frequently.
>>>        60   void  increment_count() {
>>>        61     _sample_count++;
>>>        62     if (!_is_old&&  _sample_count>  OLD_THRESHOLD) {
>>>        63       _is_old = true;
>>>        64     }
>>>        65   }
>>
>>     Good suggestion. I'll update my change with this in case we need
>>     something urgently, but given the above issues it's likely a good
>>     idea to take another pass at this.
>>
>>     Thanks,
>>     Mikael
>>
>>>
>>>     igor
>>>
>>>     On Apr 24, 2012, at 8:01 AM, Mikael Vidstedt wrote:
>>>
>>>>
>>>>     Hi all,
>>>>
>>>>     The statistical counters in gcUtil are used to keep track of
>>>>     historical information about various key metrics in the garbage
>>>>     collectors. Built in to the core AdaptiveWeightedAverage base
>>>>     class is the concept of aging the values, essentially treating
>>>>     the first 100 values differently and putting more weight on
>>>>     them since there's not yet enough historical data built up.
>>>>
>>>>     In the class there is a 32-bit counter (_sample_count) that
>>>>     incremented for every sample and used to compute scale the
>>>>     weight of the added value (see compute_adaptive_average), and
>>>>     the scaling logic divides 100 by the count. In the normal case
>>>>     this is not a problem - the counters are reset every once in a
>>>>     while and/or grow very slowly. In some pathological cases the
>>>>     counter will however continue to increment and eventually
>>>>     overflow/wrap, meaning the 32-bit count will go back to zero
>>>>     and the division in compute_adaptive_average will lead to a
>>>>     div-by-zero crash.
>>>>
>>>>     The test case where this is observed is a test that stress
>>>>     tests allocation in combination with the GC locker.
>>>>     Specifically, the test is multi-threaded which pounds on
>>>>     java.util.zip.Deflater.deflate, which internally uses the
>>>>     GetPrimitiveArrayCritical JNI function to temporarily lock out
>>>>     the GC (using the GC locker). The garbage collector used is in
>>>>     this case the parallel scavenger and the the counter that
>>>>     overflows is _avg_pretenured. _avg_pretenured is
>>>>     incremented/sampled every time an allocation is made directly
>>>>     in the old gen, which I believe is more likely when the GC
>>>>     locker is active.
>>>>
>>>>
>>>>     The suggested fix is to only perform the division in
>>>>     compute_adaptive_average when it is relevant, which currently
>>>>     is for the first 100 values. Once there are more than 100
>>>>     samples there is no longer a need to scale the weight.
>>>>
>>>>     This problem is tracked in 7158457 (stress: jdk7 u4 core dumps
>>>>     during megacart stress test run).
>>>>
>>>>     Please review and comment on the webrev below:
>>>>
>>>>     http://cr.openjdk.java.net/~mikael/7158457/webrev.00
>>>>     <http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00>
>>>>
>>>>     Thanks,
>>>>     Mikael
>>>>
>>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120511/465a9ed3/attachment.htm>