RFR (M): 8136681: Factor out IHOP calculation from G1CollectorPolicy

Tue Nov 17 19:14:52 UTC 2015

On 11/17/2015 12:14 AM, Thomas Schatzl wrote:
> Hi,
>
> On Mon, 2015-11-16 at 11:02 -0800, Jon Masamitsu wrote:
>> On 11/16/2015 05:31 AM, Thomas Schatzl wrote:
>>> Hi Jon,
>>>
>>>     thanks a lot for all these reminders for better documentation. I have
>>> been working too long on this functionality so that "everything is
>>> clear" to me :)
>>>
>>> New webrevs with hopefully more complete explanations at:
>>> http://cr.openjdk.java.net/~tschatzl/8136681/webrev.1_to_2/
>>> (incremental)
>>> http://cr.openjdk.java.net/~tschatzl/8136681/webrev.2/ (changes)
>>>
>>>
>>> On Fri, 2015-11-13 at 07:58 -0800, Jon Masamitsu wrote:
>>>> Thomas,
>>>>
>>>> This is partial.  If you send out a second webrev based on Mikael's
>>>> review, I'll finish with that.
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~tschatzl/8136681/webrev/src/share/vm/gc/g1/g1CollectedHeap.hpp.frames.html
>>>>
>>>>> 1370 // Returns the number of regions the humongous object of the
>>>>> given word size
>>>>> 1371 // covers.
>>>> "covers" is not quite right since to me it says that the humongous
>>>> object completely uses the
>>>> region.  I'd use "requires".
>>> Fixed.
>>>
>>>> http://cr.openjdk.java.net/~tschatzl/8136681/webrev/src/share/vm/gc/g1/g1IHOPControl.hpp.html
>>>>
>>>>      49   // Update information about recent time during which allocations happened,
>>>>      50   // how many allocations happened and an additional safety buffer.
>>>>
>>>> // Update information about
>>>>
>>>> I DON'T KNOW WHICH OF THESE IS MORE PRECISE.
>>>> //   Time during which allocations occurred (sum of mutator execution time + GC pause times)
>>>> OR
>>>> //   Concurrent marking time (concurrent mark end - concurrent mark start)
>>>>
>>>> //   Allocations in bytes during that time
>>>> //   Safety buffer ???
>>>>
>>>> I couldn't figure out what the safety buffer is supposed to be.  It seems to
>>>> be the young gen size but don't know why.
>>> I tried to explain in in the text. In short, in G1 the IHOP value is
>>> based on old gen occupancy only. The problem is that the young gen also
>>> needs to be allocated somewhere too.
>>>
>>> Now you could just say, use the maximum young gen size. However this is
>>> 60% of the heap by default... so the adaptive IHOP algorithm uses a
>>> measure of the young gen that is not bounded by G1ReservePercent.
>>>
>>> The reason to use the unbounded value is because if the code used the
>>> bounded one, it would cancel out with G1ReservePercent, because the
>>> closer we get to G1ReservePercent, the smaller that bounded value would
>>> get, which would make the current IHOP value rise etc, which would delay
>>> the initiation of the marking.
>>>
>>> That would end up loosing throughput as then the young gen gets smaller
>>> and smaller (and GC frequency increases), it can take a long time until
>>> G1 gets close enough to G1ReservePercent so that the other factors
>>> (allocation rate, marking time) are used.
>>>
>>> Basically initial mark will be delayed until young gen reaches its
>>> minimum size, at which time G1 will continue to use that young gen size
>>> until marking occurs. Which means that typically G1 will eat into
>>> G1ReservePercent, which we also do not want.
>>>
>>> Additionally it would get G1 more in trouble in regards to pause time,
>>> giving it less room during that time.
>> I think I understand the issue of using an unbounded young gen but
>> what precisely is meant by "measure of the young gen"?  By measure
>> do you mean you used the size of young gen from recent young-only
>> collections?
> As in "measurement". Yes, this uses the size of the young gen from a
> recent young-only collection when the decision whether to start marking
> or not occurs.
> Which is supposed to be on the large side compared to the ones following
> during marking.
>
>>> Unfortunately G1 is in two minds about this, i.e. used() for humongous
>>> objects does not contain the "waste" at the end of the las region, but
>>> used() for regular regions does.
>> So occupancy is not useful and you use free regions + free in the current
>> old allocation regions?
> I have been referring to that G1's used() for humongous regions seems to
> return an incorrect value.
>
> That's why when adding the allocation information for humongous regions
> (around line 1018 in g1CollectedHeap.cpp) the change first calculates
> the size in regions and multiplies it by full region sizes, instead of
> using something like used().
>
> [...]
>>>> Have you given much thought to what affect a to-space exhaustion should
>>>> have on
>>>> IHOP?  I understand that it is not the design center for this but I
>>>> think that to-space
>>>> exhaustion can confuse the statistics.   Maybe a reset of the statistics
>>>> and a dropping
>>>> the IHOP to a small value (current heap occupancy or even 0) until you
>>>> get 3 successful
>>>> marking cycles.  Or think about it later.
>>> I already thought a little about how this should interact with regards
>>> to the calculation: the idea is that basically the algorithm will notice
>>> that there is a significant amount of additional allocation, and will
>>> lower the IHOP threshold automatically. (Looking through the code I
>>> think I saw some problems here, I will see to fix that)
>> The increased allocation would be the allocation from compacting
>> all the live data into regions during the full GC.  That certainly should
> No, from converting survivor/eden regions into old regions. That will
> result in a huge bump in allocation rate (if the evac failure has been
> serious, i.e. a lot of failed evacuations). Which means that in the
> future, the IHOP will be lower.

Ah, yes.  Evacuation failure does not necessarily mean a full GC.

>
> If the evacuation failure has been not so serious, the impact is
> certainly smaller.
>
> The impact of evacuation failures is already much smaller than before,
> and there are a few more things that could be done to make it even
> smaller.
>
> The only drawback is that potentially this decrease in IHOP is too small
> to avoid the next evac failure/full gc. None of our prediction can
> handle long-term cyclic occurrences (like once a day there is a
> significant brief 30s spike in promotion rate), so I do not see that as
> a particular issue.

I've mentioned before that I would consider a policy that started
the next marking cycle immediately.   It's a simple policy (no
confusion about whether the decrease in IHOP was enough) and
provides the best effort to avoid an undesirable situation.   I won't
bug you with that again. :-)

>
> That's something that the user needs to tune out at this time (CMS would
> not be able to handle this either).

Agree.

Patch looks good.  Reviewed.

Jon

>
>> make the IHOP drop but that seems rather weakly related to the
>> actual IHOP needed to avoid promotion failure.   It's hard for me
>> to see how that is going to scale (i.e., it seems complicated to
>> use something like the live data size as input to the modeling
>> of IHOP). I'd start with something really simple but if you're
>> comfortable with it, that's fine.
>>
>>> If evacuation failure happens in a gc during marking, there are a few
>>> options, I have not decided on what's best:
>>>
>>> - do nothing because the user asked to run an impossible to manage
>>> workload (combination of live set, allocation rate, and other options).
>>>
>>>     - there is already some log output which that information can be
>>> derived from.
>>>
>>> - allow the user to set a maximum IHOP threshold. He could base this
>>> value on the log messages he gets.
>>>     - the user can already do that by increasing G1ReservePercent btw
>>>
>>> - make sure that marking completes in time
>>>
>>>     - let the mutator threads (or during young gcs while marking is
>>> running) do some marking if we notice that we do not have enough time.
>>> Not sure if it is worth the effort.
>>>
>>>     - fix some bugs in marking :) that prevent that in extraordinary
>>> conditions.
> - another option would be to just start doing mixed gcs as G1 is able to
> do that without any completed marking (even during marking), hoping that
> this will yield enough space.
>
>>>     - make sure that we always start marking early enough by making sure
>>> that mixed gc reclaims enough memory. Planning some work here as part of
>>> generic work on improving gc policy.
>> I like this one above.
>>
>> Thanks for the extra explanations.
> Thanks for the discussion.
>
> Thomas
>