Strange G1 behavior

Tue Oct 17 09:52:15 UTC 2017

> On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> Hi Kirk,
> 
> On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
>> Hi Thomas,
>> 
>> Again, thank you for the detailed response.
>> 
>> 
>>> On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <thomas.schatzl at oracle.
>>> com> wrote:
>>> 
>>> For the allocation rate, please compare the slopes of heap usage
>>> after (young) gcs during these spikes (particularly in that full gc
>>> case) and normal operation.
>> 
>> Censum estimates allocation rates as this is a metric that I
>> routinely evaluate.
>> 
>> This log shows a spike at 10:07 which correlates with the Full GC
>> event. However the allocation rates while high, are well within
>> values I’ve seen with many other applications that are well behaved.
>> Censum also estimates rates of promotion and those seem exceedingly
>> high at 10:07. That said, there are spikes just after 10:10 and 
>> around 10:30 which don’t trigger a Full. In both cases the estimates
>> for allocation rates are high though the estimates for rates
>> of  promotion while high, are not as high as those seen at 10:07.
>> 
>> All in all, nothing here seems out of the ordinary and while I want
>> you to be right about the waste and PLAB behaviors, these spikes feel
>> artificial, i.e. I still want to blame the collector for not being
>> able to cope with some aspect of application behavior that it should
>> be able to cope with.. that is something other than a high allocation
>> rate with low recover due to data simply being reference and therefor
>> not eligible for collection...
> 
> I always meant "promotion rate" here as allocation rate. For this
> discussion (and in general) in a generational collector the
> application's real allocation rate is usually not very interesting.
> 
> Sorry for being imprecise.

No worries, I get that promotion rates are very important but in when deciding what needs to be tuned, allocation rates are also very important so I try to be very precise here, maybe too precise at times.

> 
>>> In this application, given the information I have, every like
>>> 1500s, there seems to be some component in the application that
>>> allocates a lot of memory in a short time, and holds onto most of
>>> it for its duration.
>> 
>> Sorry but I’m not seeing this pattern either in occupancy after or
>> allocation rate views. What I do see is a systematic loss of free
>> heap over time (slow memory leak??? Effects of caching???).
> 
> Let's have a look at the heap usage after gc over time for a few
> collection cycle before that full gc event. Please look at http://cr.op
> enjdk.java.net/~tschatzl/g1-strange-log/strange-g1-promo.png which just
> shows a few of these.

Sorry but I can’t see these charts.

The issue with promotion rates is that you have to ask, is this the dog wagging the tail, or the tail wagging the dog. In this case I think it’s the tail wagging the dog. Would it be possible that a significant portion of the data being promoted was directly due to to-space being resized to 0?

Sorry for being so slow but I think I’m starting to understand the underlying pathology here. I’ll hypothesis here and you can blow the whole idea out of the water with more insightful comments..

Since tenured can grow into the space that the to-space is likely to also want, preference towards tenured deflates survivor which in turn puts more pressure on tenured which in turn deflates survivor even further, wash rinse, repeat, Full GC to clean up. Since this didn’t happen with the (older) generational collectors, it seems as if it’s possible that everything could be ok if a minimum to-space size could be maintained.

> 
> I added rough linear interpolations of the heap usage after gc (so that
> the slope of these lines corresponds to the promotion rates). I can see
> a large, significant difference in the slopes between the collection
> cycles before the full gc event (black lines) and the full gc event
> (red line), while all black ones are roughly the same. :)
> 
> Note that my graphs were painted off-hand without any actual
> calculation, and particular the red one looks like an underestimation
> of the slope.
> 
>> As I look at all of the views in Censum I see nothing outstanding
>> that leads me to believe that this Full is a by-product of some
>> interaction between the collector and the application (some form of
>> zombies????). Also, one certainly cannot rule out your speculation 
> 
> It does not look like there is an issue e.g. with j.l.ref.References of
> any kind.

LinkList used to be susceptible to nepotism and all generations collectors suffer from zombies.
> 
>> for heap fragmentation in PLABs. I simply don’t have the data to say
>> anything about that though I know it can be a huge issue. What I can
>> say is that even if there is 20% waste, it can’t account for the
>> amount of memory being recovered. I qualify that with, unless there 
>> is a blast of barely humongous allocations taking place. I’d like to
>> thing this is a waste issue but I’m suspicious. I’m also suspicious
>> that it’s simply the application allocating in a burst then
>> releasing. If that were the case I’d expect a much more gradual
>> reduction in the live set size.
>> 
>> I think the answer right now is; we need more data.
> 
> Agree.

It’s on it’s way!! Yeah for more data.
You might be interested in this for visualization.  https://github.com/kcpeppe/regions
It’s a wee bit crude at the moment but I’ll accept pull requests for any refinements.

Kind regards,
Kirk