Request for review: 8008546: G1: Wrong G1CONFIDENCEPERCENT results in GUARANTEE(VARIANCE() > -1.0) FAILED
Bengt Rutisson
bengt.rutisson at oracle.com
Fri Feb 22 07:16:11 UTC 2013
John, Vladimir,
Thanks for the very detailed analysis of the issue! :)
Vladimir, I agree with John. Your change is fine, so consider it
reviewed and go ahead and push it.
John, I really like how well you described what is going on here.
Unfortunately I think I still don't have enough knowledge to answer the
question of whether we should just make sure to never get a negative
target step time or if we should avoid adding large step times all
together. For me it is not too important to sort this out right now, but
it might be good to create a bug and add your excellent analysis. What
do you think?
Thanks,
Bengt
On 2/21/13 10:43 PM, John Cuthbertson wrote:
> Hi Bengt, Vladimir,
>
> Bumping up the value of G1ConfidencePercent eventually caused the
> guarantee to fail. I believe I know what's going on here now.
>
> The following is tracing and instrumentation from
> CMTask::do_marking_step(), from the marking cycle where it starts to
> go wrong...
>
> 3.788: [GC concurrent-mark-start]
> [0] >>>>>>>>>> START, call = 1,
> _num = 5, _sum = 1.180, _sum_of_squares = 0.432
> _davg = 0.275, _dvariance = 0.023, _alpha = 0.700
> _length = 10, _next = 5
>
> [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]=
> 0.382
> [5]= 0.000 [6]= 0.000 [7]= 0.000 [8]= 0.000 [9]=
> 0.000
> davg: 0.2747654, sig: 3.0000000, dsd: 0.1517040, first: 0.7298773
> davg: 0.2747654, num: 5, cf: 1.0000000, second: 0.2747654
> Result: 0.7298773
> [0] >>>>>>>>>> START, call = 1, target = 9.27ms >>>>>>>>>>
> [0] Adding 0.1885053 to counter
> [0] <<<<<<<<<< ABORTING, target = 9.27ms, elapsed = 9.46ms <<<<<<<<<<
>
> [0] >>>>>>>>>> START, call = 2,
> _num = 6, _sum = 1.369, _sum_of_squares = 0.467
> _davg = 0.249, _dvariance = 0.017, _alpha = 0.700
> _length = 10, _next = 6
>
> [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]=
> 0.382
> [5]= 0.189 [6]= 0.000 [7]= 0.000 [8]= 0.000 [9]=
> 0.000
> davg: 0.2488873, sig: 3.0000000, dsd: 0.1311627, first: 0.6423756
> davg: 0.2488873, num: 6, cf: 1.0000000, second: 0.2488873
> Result: 0.6423756
> [0] >>>>>>>>>> START, call = 2, target = 9.36ms >>>>>>>>>>
> [0] Adding 0.2263276 to counter
> [0] <<<<<<<<<< ABORTING, target = 9.36ms, elapsed = 9.58ms <<<<<<<<<<
>
> [0] >>>>>>>>>> START, call = 3,
> _num = 7, _sum = 1.595, _sum_of_squares = 0.519
> _davg = 0.242, _dvariance = 0.012, _alpha = 0.700
> _length = 10, _next = 7
>
> [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]=
> 0.382
> [5]= 0.189 [6]= 0.226 [7]= 0.000 [8]= 0.000 [9]=
> 0.000
> davg: 0.2421194, sig: 3.0000000, dsd: 0.1100790, first: 0.5723563
> davg: 0.2421194, num: 7, cf: 1.0000000, second: 0.2421194
> Result: 0.5723563
> [0] >>>>>>>>>> START, call = 3, target = 9.43ms >>>>>>>>>>
> [0] Adding 16.8998093 to counter
> [0] <<<<<<<<<< ABORTING, target = 9.43ms, elapsed = 26.33ms <<<<<<<<<<
>
> The step time difference prediction we obtain for call #1 was 0.73ms,
> resulting in a target step time of 9.27ms. The marking step aborts
> after 9.46ms and adds 0.189ms to the number sequence.
>
> This causes the step time difference prediction we obtain for call #2
> to be calculated as 0.642ms; target step time of 9.36ms. The marking
> step aborts after an elapsed 9.58ms. The difference (9.58 - 9.36)
> 0.226ms is added to the number sequence.
>
> This causes the step time difference prediction we obtain for call #3
> to be calculated 0.572ms; the target step time of 9.43. The marking
> step aborts after an elapsed time of 26.33ms. We add the difference
> (26.33 - 9.43) 16.90 to the number sequence. This is where it gets
> interesting. :)
>
> In call 4:
>
> [0] >>>>>>>>>> START, call = 4,
> _num = 8, _sum = 18.495, _sum_of_squares = 286.122
> _davg = 5.239, _dvariance = 40.798, _alpha = 0.700
> _length = 10, _next = 8
>
> [0]= 0.500 [1]= 0.137 [2]= 0.126 [3]= 0.036 [4]=
> 0.382
> [5]= 0.189 [6]= 0.226 [7]= 16.900 [8]= 0.000 [9]=
> 0.000
> davg: 5.2394264, sig: 3.0000000, dsd: 6.3873188, first: 24.4013827
> davg: 5.2394264, num: 8, cf: 1.0000000, second: 5.2394264
> Result: 24.4013827
> [0] >>>>>>>>>> START, call = 4, target = -14.40ms >>>>>>>>>>
> [0] Adding 14.5933617 to counter
> [0] <<<<<<<<<< ABORTING, target = -14.40ms, elapsed = 0.19ms <<<<<<<<<<
>
> the predicted step time difference (because of the large value added
> to the number sequence the previous call) is calculated as 24.40ms.
> This gives a NEGATIVE target step time of -14.40ms. The marking step
> aborts after an extremely short 0.19ms. We add the difference (0.19 -
> (-14.40)) 14.59 to the number sequence.
>
> From this point on the predicted step time difference gets larger and
> larger until we start seeing infinities for dsd, sum of squares, etc.
> This gives a negative infinity as the time step target and we attempt
> to add infinity into the number sequence which results in a variance
> value of infinity.
>
> Why was the elapsed time for call #3 much longer than the target step
> time? My guess is that the marking step claimed a large enough object
> that it just took that amount of time to scan it. (Note this was with
> debug binary).
>
> Why does a higher G1ConfidencePercent value cause the problem? The
> confidence percent is used to calculate the value of
> G1CollectorPolicy::sigma() which is used in the expression that
> predicts a new value from a given number sequence:
>
> return MAX2(seq->davg() + sigma() * seq->dsd(),
> seq->davg() * confidence_factor(seq->num()));
>
> it's a multiplier of the decaying standard deviation. So a value
> larger than 100 will result in larger and larger values in the LHS of
> the MAX2.
>
> Vladimir: your fix is good and a valid fix for this issue but I wonder
> if we should also be preventing the target step time from becoming
> negative, e.g.:
>
> [0] >>>>>>>>>> START, call = 17,
> _num = 10, _sum = 1.816, _sum_of_squares = 0.554
> _davg = 0.177, _dvariance = 0.009, _alpha = 0.700
> _length = 10, _next = 7
>
> [0]= 0.057 [1]= 0.541 [2]= 0.245 [3]= 0.084 [4]=
> 0.248
> [5]= 0.231 [6]= 0.085 [7]= 0.030 [8]= 0.035 [9]=
> 0.258
> davg: 0.1774224, sig: 3.0000000, dsd: 0.0971153, first: 0.4687682
> davg: 0.1774224, num: 10, cf: 1.0000000, second: 0.1774224
> Result: 0.4687682
> [0] >>>>>>>>>> START, call = 17, target = 9.53ms >>>>>>>>>>
> [0] Adding 20.1697202 to counter
> [0] <<<<<<<<<< ABORTING, target = 9.53ms, elapsed = 29.70ms <<<<<<<<<<
>
> [0] >>>>>>>>>> START, call = 18,
> _num = 10, _sum = 21.955, _sum_of_squares = 407.371
> _davg = 6.175, _dvariance = 58.761, _alpha = 0.700
> _length = 10, _next = 8
>
> [0]= 0.057 [1]= 0.541 [2]= 0.245 [3]= 0.084 [4]=
> 0.248
> [5]= 0.231 [6]= 0.085 [7]= 20.170 [8]= 0.035 [9]=
> 0.258
> davg: 6.1751117, sig: 3.0000000, dsd: 7.6655934, first: 29.1718919
> davg: 6.1751117, num: 10, cf: 1.0000000, second: 6.1751117
> Result: 29.1718919
> [0] >>>>>>>>>> START, call = 18, target = 0.00ms >>>>>>>>>>
> [0] Adding 0.1926280 to counter
> [0] <<<<<<<<<< ABORTING, target = 0.00ms, elapsed = 0.19ms <<<<<<<<<<
>
> This should prevent the values in the sequence diverging and reading
> Inf. Or perhaps we shouldn't even add large step time differences to
> the number seq. Whatever we decide to do (if anything) should be done
> as separate CR.
>
> Regards,
>
> JohnC
>
> On 2/21/2013 10:50 AM, John Cuthbertson wrote:
>> Hi Bengt,
>>
>> Thanks for the stack trace. It helps. The counter is obviously
>> _marking_step_diffs_ms and the confidence value feeds into this via
>> G1CollectorPolicy::get_new_prediction(). I ran jbb2005 with the flags
>> you suggest and still no failure. I'm still a little confused though.
>> If we look at TruncatedSeq:add() and AbsSeq::variance():
>>
>> void TruncatedSeq::add(double val) {
>> AbsSeq::add(val);
>>
>> // get the oldest value in the sequence...
>> double old_val = _sequence[_next];
>> // ...remove it from the sum and sum of squares
>> _sum -= old_val;
>> _sum_of_squares -= old_val * old_val;
>>
>> // ...and update them with the new value
>> _sum += val;
>> _sum_of_squares += val * val;
>>
>> // now replace the old value with the new one
>> _sequence[_next] = val;
>> _next = (_next + 1) % _length;
>>
>> // only increase it if the buffer is not full
>> if (_num < _length)
>> ++_num;
>>
>> guarantee( variance() > -1.0, "variance should be >= 0" );
>> }
>>
>> the guarantee can only trip if the result of variance() is -1.0 or
>> less. Correct? What value is being returned by variance()?
>>
>> Now look at AbsSeq::variance():
>>
>> double AbsSeq::variance() const {
>> if (_num <= 1)
>> return 0.0;
>>
>> double x_bar = avg();
>> double result = _sum_of_squares / total() - x_bar * x_bar;
>> if (result < 0.0) {
>> // due to loss-of-precision errors, the variance might be negative
>> // by a small bit
>>
>> // guarantee(-0.1 < result && result < 0.0,
>> // "if variance is negative, it should be very small");
>> result = 0.0;
>> }
>> return result;
>> }
>>
>> How can this return a result -1.0 or less? If result is negative then
>> the return value should be 0.0 or a small magnitude negative; not
>> -1.0 or less. If the value of result is close enough to 0.0 to make
>> the "result < 0.0" evaluate to false, it shouldn't be able to result
>> in "variance() > -1.0" evaluate to false.
>>
>> Vladimir: I still think your change is good but I'm not sure it's a
>> true fix for the problem.
>>
>> Regards,
>>
>> JohnC
>>
>> On 2/21/2013 1:56 AM, Bengt Rutisson wrote:
>>>
>>> John,
>>>
>>> The bug report lists a guarantee in nuberSeq.cpp. You need a
>>> concurrent cycle for the issue to reproduce.
>>>
>>> I can repro it with this command line using SPECjbb2005:
>>>
>>> java -XX:+UseG1GC -XX:G1ConfidencePercent=300 -Xmx256m
>>> -XX:InitiatingHeapOccupancyPercent=1 -XX:+PrintGC -cp
>>> jbb.jar:check.jar spec.jbb.JBBmain
>>>
>>> Here is the stack trace from the hs_err file i get:
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # Internal Error
>>> (/Users/brutisso/repos/hs-gc-test-deprecated/src/share/vm/utilities/numberSeq.cpp:166),
>>> pid=34072, tid=16131
>>> # guarantee(variance() > -1.0) failed: variance should be >= 0
>>> #
>>> # JRE version: Java(TM) SE Runtime Environment (8.0-b68) (build
>>> 1.8.0-ea-b68)
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b15-internal-jvmg
>>> mixed mode bsd-amd64 compressed oops)
>>> # Failed to write core dump. Core dumps have been disabled. To
>>> enable core dumping, try "ulimit -c unlimited" before starting Java
>>> again
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>>
>>> --------------- T H R E A D ---------------
>>>
>>> Current thread (0x00007feda9038000): ConcurrentGCThread [stack:
>>> 0x0000000129587000,0x0000000129687000] [id=16131]
>>>
>>> Stack: [0x0000000129587000,0x0000000129687000],
>>> sp=0x0000000129686190, free space=1020k
>>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
>>> C=native code)
>>> V [libjvm.dylib+0xd48e80] VMError::report(outputStream*)+0x1262
>>> V [libjvm.dylib+0xd4a57b] VMError::report_and_die()+0x88f
>>> V [libjvm.dylib+0x5328a7] report_vm_error(char const*, int, char
>>> const*, char const*)+0xc7
>>> V [libjvm.dylib+0xac6dd2] TruncatedSeq::add(double)+0x170
>>> V [libjvm.dylib+0x4d389a] CMTask::do_marking_step(double, bool,
>>> bool)+0x15ba
>>> V [libjvm.dylib+0x4d9d5a] CMConcurrentMarkingTask::work(unsigned
>>> int)+0x192
>>> V [libjvm.dylib+0xda8c8e] GangWorker::loop()+0x666
>>> V [libjvm.dylib+0xda7fc0] GangWorker::run()+0x3a
>>> V [libjvm.dylib+0xafef34] java_start(Thread*)+0x1d4
>>> C [libsystem_c.dylib+0x14742] _pthread_start+0x147
>>> C [libsystem_c.dylib+0x1181] thread_start+0xd
>>>
>>>
>>> Bengt
>>>
>>>
>>> On 2/20/13 9:13 PM, John Cuthbertson wrote:
>>>> Hi Vladimir,
>>>>
>>>> The change looks good to me. But I have a question: which counter
>>>> was tripping the guarantee? I just ran jvm98 with a confidence
>>>> percent value of 200 and 300 and the guarantee didn't fire for me.
>>>>
>>>> I kind of agree with Bengt that moving this kind of error checking
>>>> earlier would be better. But that can be done as a separate CR.
>>>>
>>>> Thanks,
>>>>
>>>> JohnC
>>>>
>>>> On 2/20/2013 7:27 AM, vladimir kempik wrote:
>>>>> Hi Bengt,
>>>>>
>>>>> Thanks for looking at this!
>>>>>
>>>>> Here is an updated webrev based on your feedback:
>>>>>
>>>>> http://cr.openjdk.java.net/~mcherkas/vladimir/8008546/webrev.01/
>>>>>
>>>>> I applied what you suggested.
>>>>>
>>>>> Thanks,
>>>>> Vladimir.
>>>>>
>>>>> On 20.02.2013 17:54, Bengt Rutisson wrote:
>>>>>>
>>>>>> Hi Vladimir,
>>>>>>
>>>>>> This looks very similar to how we treat G1ReservePercent, so I
>>>>>> think it looks good. An alternative would have been to check this
>>>>>> earlier in the initialization phase and update the flag
>>>>>> G1ConfidencePercent so that PrintFlagsFinal would have printed
>>>>>> the actual value. But for consistency I think it looks good this
>>>>>> way.
>>>>>>
>>>>>> I think you can change G1ConfidencePercent to be an uintx instead
>>>>>> of intx (in g1_globals.hpp). In that case you don't need the
>>>>>> second if statment since it can't be less than 0. It is also more
>>>>>> consistent with G1ReservePercent which is an uintx.
>>>>>>
>>>>>> Thanks,
>>>>>> Bengt
>>>>>>
>>>>>> On 2/20/13 2:31 PM, vladimir kempik wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Could I have a couple of reviews for this change?
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~mcherkas/vladimir/8008546/webrev.00/
>>>>>>>
>>>>>>> Input value for G1CONFIDENCEPERCENT wasn't checked before using.
>>>>>>> This results in crash sometimes if -XX:+UseG1GC
>>>>>>> -XX:G1ConfidencePercent=200 flags are used. Now checking the
>>>>>>> value same way as it was done for G1ReservePercent. Increase to
>>>>>>> 0 if negative, decrease to 100 if more than 100.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the hotspot-gc-dev
mailing list