RFR(S): 7158457: division by zero in adaptiveweightedaverage

Sat May 5 09:57:29 UTC 2012

Mikael -- I agree that this fix would be good to get in, at least to
address the immediate crash.

reviewed.
-- ramki

On Fri, May 4, 2012 at 3:03 AM, Mikael Vidstedt
<mikael.vidstedt at oracle.com>wrote:

>
> Update:
>
> I'd like to continue working on this issue, and as part of that look at
> how the promoted and pretenured counters are used today and potentially
> update the logic given the insights around bytes vs. words etc.
>
> I think it would be helpful to address the crash problem short term
> though, and so I'd like to get feedback on the updated webrev:
>
> http://cr.openjdk.java.net/~mikael/7158457/webrev.01/
>
> It's essentially the same as the previous one with Igor's suggested
> _is_old check improvement. Would it be ok to make this fix and clone the
> bug to track the follow-up work cleaning up the actual logic/heuristics?
>
> Thanks,
> Mikael
>
>
> On 2012-04-26 03:38, Mikael Vidstedt wrote:
>
>
> On 2012-04-24 23:50, Igor Veresov wrote:
>
> Wasn't it the case that _avg_pretenured value is formed incorrectly? I
> don't think it should be sampled at every allocation, I think it should
> measure the amount of data allocated in the old gen between young
> collections, also if you remember we agreed that the dimensions are wrong.
> Or were you able to find a better explanation as opposed to what we
> discussed before?
>
>
> Thanks for reminding me - I believe you're absolutely right. For a while I
> was thinking it actually did make sense after all, but I just did an
> experiment to see what actually happens at runtime:
>
> The _avg_pretenured counter is indeed sampled every time an allocation is
> made directly in old gen. The actual value reflects the average size *in
> words* of the object that was allocated. Put differently, it's the average
> size of pretenured objects in words.
>
> The counter is used in PSAdaptiveSizePolicy::update_averages to calculate
> the average amount of promoted data:
>
> avg_promoted()->sample(promoted + _avg_pretenured->padded_average());
>
> The promoted parameter is the number of *bytes* that were just promoted.
> To sum it up, that appears to imply that there are two problems with the
> above computation:
>
> 1. It's taking the size of everything that was just promoted and adds the
> size of an average prenured object (as opposed to the total size of all
> recently pretenured objects)
> 2. It's adding variables with different dimensions - promoted is in bytes,
> and the _avg_pretenured padded average is in words
>
> Since that effectively means _avg_pretenured it's off by a factor
> (word_size * number_of_objects) I'm guessing it will in the common case not
> really matter to the calculation of avg_promoted...
>
>
>
>  As for the treatment of the symptom, the code looks good to me. Do you
> think it might be beneficial to check the old value of _is_old before
> assigning to it? Would cause less memory traffic, if increment_count() is
> called frequently.
>
>   60   void  increment_count() {  61     _sample_count++;
>
>   62     if (!_is_old && _sample_count > OLD_THRESHOLD) {
>
> 63 _is_old = true; 64 } 65 }
>
>
> Good suggestion. I'll update my change with this in case we need something
> urgently, but given the above issues it's likely a good idea to take
> another pass at this.
>
> Thanks,
> Mikael
>
>
>  igor
>
>  On Apr 24, 2012, at 8:01 AM, Mikael Vidstedt wrote:
>
>
> Hi all,
>
> The statistical counters in gcUtil are used to keep track of historical
> information about various key metrics in the garbage collectors. Built in
> to the core AdaptiveWeightedAverage base class is the concept of aging the
> values, essentially treating the first 100 values differently and putting
> more weight on them since there's not yet enough historical data built up.
>
> In the class there is a 32-bit counter (_sample_count) that incremented
> for every sample and used to compute scale the weight of the added value
> (see compute_adaptive_average), and the scaling logic divides 100 by the
> count. In the normal case this is not a problem - the counters are reset
> every once in a while and/or grow very slowly. In some pathological cases
> the counter will however continue to increment and eventually
> overflow/wrap, meaning the 32-bit count will go back to zero and the
> division in compute_adaptive_average will lead to a div-by-zero crash.
>
> The test case where this is observed is a test that stress tests
> allocation in combination with the GC locker. Specifically, the test is
> multi-threaded which pounds on java.util.zip.Deflater.deflate, which
> internally uses the GetPrimitiveArrayCritical JNI function to temporarily
> lock out the GC (using the GC locker). The garbage collector used is in
> this case the parallel scavenger and the the counter that overflows is
> _avg_pretenured. _avg_pretenured is incremented/sampled every time an
> allocation is made directly in the old gen, which I believe is more likely
> when the GC locker is active.
>
>
> The suggested fix is to only perform the division in
> compute_adaptive_average when it is relevant, which currently is for the
> first 100 values. Once there are more than 100 samples there is no longer a
> need to scale the weight.
>
> This problem is tracked in 7158457 (stress: jdk7 u4 core dumps during
> megacart stress test run).
>
> Please review and comment on the webrev below:
>
> http://cr.openjdk.java.net/~mikael/7158457/webrev.00
>
> Thanks,
> Mikael
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120505/a5387f5a/attachment.htm>