RFR(S): 7158457: division by zero in adaptiveweightedaverage
Mikael Vidstedt
mikael.vidstedt at oracle.com
Fri May 11 19:13:05 UTC 2012
Final(?) version, only difference is the removed assert in sample():
http://cr.openjdk.java.net/~mikael/7158457/webrev.02/
I'm also looking for a sponsor for this fix - any takers?
Thanks,
Mikael
On 2012-05-05 05:57, Srinivas Ramakrishna wrote:
> Mikael -- I agree that this fix would be good to get in, at least to
> address the immediate crash.
>
> reviewed.
> -- ramki
>
> On Fri, May 4, 2012 at 3:03 AM, Mikael Vidstedt
> <mikael.vidstedt at oracle.com <mailto:mikael.vidstedt at oracle.com>> wrote:
>
>
> Update:
>
> I'd like to continue working on this issue, and as part of that
> look at how the promoted and pretenured counters are used today
> and potentially update the logic given the insights around bytes
> vs. words etc.
>
> I think it would be helpful to address the crash problem short
> term though, and so I'd like to get feedback on the updated webrev:
>
> http://cr.openjdk.java.net/~mikael/7158457/webrev.01/
> <http://cr.openjdk.java.net/%7Emikael/7158457/webrev.01/>
>
> It's essentially the same as the previous one with Igor's
> suggested _is_old check improvement. Would it be ok to make this
> fix and clone the bug to track the follow-up work cleaning up the
> actual logic/heuristics?
>
> Thanks,
> Mikael
>
>
> On 2012-04-26 03:38, Mikael Vidstedt wrote:
>>
>> On 2012-04-24 23:50, Igor Veresov wrote:
>>> Wasn't it the case that _avg_pretenured value is formed
>>> incorrectly? I don't think it should be sampled at every
>>> allocation, I think it should measure the amount of data
>>> allocated in the old gen between young collections, also if you
>>> remember we agreed that the dimensions are wrong. Or were you
>>> able to find a better explanation as opposed to what we
>>> discussed before?
>>
>> Thanks for reminding me - I believe you're absolutely right. For
>> a while I was thinking it actually did make sense after all, but
>> I just did an experiment to see what actually happens at runtime:
>>
>> The _avg_pretenured counter is indeed sampled every time an
>> allocation is made directly in old gen. The actual value reflects
>> the average size *in words* of the object that was allocated. Put
>> differently, it's the average size of pretenured objects in words.
>>
>> The counter is used in PSAdaptiveSizePolicy::update_averages to
>> calculate the average amount of promoted data:
>>
>> avg_promoted()->sample(promoted + _avg_pretenured->padded_average());
>>
>> The promoted parameter is the number of *bytes* that were just
>> promoted. To sum it up, that appears to imply that there are two
>> problems with the above computation:
>>
>> 1. It's taking the size of everything that was just promoted and
>> adds the size of an average prenured object (as opposed to the
>> total size of all recently pretenured objects)
>> 2. It's adding variables with different dimensions - promoted is
>> in bytes, and the _avg_pretenured padded average is in words
>>
>> Since that effectively means _avg_pretenured it's off by a factor
>> (word_size * number_of_objects) I'm guessing it will in the
>> common case not really matter to the calculation of avg_promoted...
>>
>>
>>>
>>> As for the treatment of the symptom, the code looks good to me.
>>> Do you think it might be beneficial to check the old value of
>>> _is_old before assigning to it? Would cause less memory traffic,
>>> if increment_count() is called frequently.
>>> 60 void increment_count() {
>>> 61 _sample_count++;
>>> 62 if (!_is_old&& _sample_count> OLD_THRESHOLD) {
>>> 63 _is_old = true;
>>> 64 }
>>> 65 }
>>
>> Good suggestion. I'll update my change with this in case we need
>> something urgently, but given the above issues it's likely a good
>> idea to take another pass at this.
>>
>> Thanks,
>> Mikael
>>
>>>
>>> igor
>>>
>>> On Apr 24, 2012, at 8:01 AM, Mikael Vidstedt wrote:
>>>
>>>>
>>>> Hi all,
>>>>
>>>> The statistical counters in gcUtil are used to keep track of
>>>> historical information about various key metrics in the garbage
>>>> collectors. Built in to the core AdaptiveWeightedAverage base
>>>> class is the concept of aging the values, essentially treating
>>>> the first 100 values differently and putting more weight on
>>>> them since there's not yet enough historical data built up.
>>>>
>>>> In the class there is a 32-bit counter (_sample_count) that
>>>> incremented for every sample and used to compute scale the
>>>> weight of the added value (see compute_adaptive_average), and
>>>> the scaling logic divides 100 by the count. In the normal case
>>>> this is not a problem - the counters are reset every once in a
>>>> while and/or grow very slowly. In some pathological cases the
>>>> counter will however continue to increment and eventually
>>>> overflow/wrap, meaning the 32-bit count will go back to zero
>>>> and the division in compute_adaptive_average will lead to a
>>>> div-by-zero crash.
>>>>
>>>> The test case where this is observed is a test that stress
>>>> tests allocation in combination with the GC locker.
>>>> Specifically, the test is multi-threaded which pounds on
>>>> java.util.zip.Deflater.deflate, which internally uses the
>>>> GetPrimitiveArrayCritical JNI function to temporarily lock out
>>>> the GC (using the GC locker). The garbage collector used is in
>>>> this case the parallel scavenger and the the counter that
>>>> overflows is _avg_pretenured. _avg_pretenured is
>>>> incremented/sampled every time an allocation is made directly
>>>> in the old gen, which I believe is more likely when the GC
>>>> locker is active.
>>>>
>>>>
>>>> The suggested fix is to only perform the division in
>>>> compute_adaptive_average when it is relevant, which currently
>>>> is for the first 100 values. Once there are more than 100
>>>> samples there is no longer a need to scale the weight.
>>>>
>>>> This problem is tracked in 7158457 (stress: jdk7 u4 core dumps
>>>> during megacart stress test run).
>>>>
>>>> Please review and comment on the webrev below:
>>>>
>>>> http://cr.openjdk.java.net/~mikael/7158457/webrev.00
>>>> <http://cr.openjdk.java.net/%7Emikael/7158457/webrev.00>
>>>>
>>>> Thanks,
>>>> Mikael
>>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120511/465a9ed3/attachment.htm>
More information about the hotspot-gc-dev
mailing list