Strange G1 behavior

Tue Oct 17 08:22:54 UTC 2017

Hi Kirk,

On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> Hi Thomas,
> 
> Again, thank you for the detailed response.
> 
> 
> > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <thomas.schatzl at oracle.
> > com> wrote:
> > 
> > For the allocation rate, please compare the slopes of heap usage
> > after (young) gcs during these spikes (particularly in that full gc
> > case) and normal operation.
> 
> Censum estimates allocation rates as this is a metric that I
> routinely evaluate.
> 
> This log shows a spike at 10:07 which correlates with the Full GC
> event. However the allocation rates while high, are well within
> values I’ve seen with many other applications that are well behaved.
> Censum also estimates rates of promotion and those seem exceedingly
> high at 10:07. That said, there are spikes just after 10:10 and 
> around 10:30 which don’t trigger a Full. In both cases the estimates
> for allocation rates are high though the estimates for rates
> of  promotion while high, are not as high as those seen at 10:07.
>
> All in all, nothing here seems out of the ordinary and while I want
> you to be right about the waste and PLAB behaviors, these spikes feel
> artificial, i.e. I still want to blame the collector for not being
> able to cope with some aspect of application behavior that it should
> be able to cope with.. that is something other than a high allocation
> rate with low recover due to data simply being reference and therefor
> not eligible for collection...

I always meant "promotion rate" here as allocation rate. For this
discussion (and in general) in a generational collector the
application's real allocation rate is usually not very interesting.

Sorry for being imprecise.

> > In this application, given the information I have, every like
> > 1500s, there seems to be some component in the application that
> > allocates a lot of memory in a short time, and holds onto most of
> > it for its duration.
> 
> Sorry but I’m not seeing this pattern either in occupancy after or
> allocation rate views. What I do see is a systematic loss of free
> heap over time (slow memory leak??? Effects of caching???).

Let's have a look at the heap usage after gc over time for a few
collection cycle before that full gc event. Please look at http://cr.op
enjdk.java.net/~tschatzl/g1-strange-log/strange-g1-promo.png which just
shows a few of these.

I added rough linear interpolations of the heap usage after gc (so that
the slope of these lines corresponds to the promotion rates). I can see
a large, significant difference in the slopes between the collection
cycles before the full gc event (black lines) and the full gc event
(red line), while all black ones are roughly the same. :)

Note that my graphs were painted off-hand without any actual
calculation, and particular the red one looks like an underestimation
of the slope.

> As I look at all of the views in Censum I see nothing outstanding
> that leads me to believe that this Full is a by-product of some
> interaction between the collector and the application (some form of
> zombies????). Also, one certainly cannot rule out your speculation 

It does not look like there is an issue e.g. with j.l.ref.References of
any kind.

> for heap fragmentation in PLABs. I simply don’t have the data to say
> anything about that though I know it can be a huge issue. What I can
> say is that even if there is 20% waste, it can’t account for the
> amount of memory being recovered. I qualify that with, unless there 
> is a blast of barely humongous allocations taking place. I’d like to
> thing this is a waste issue but I’m suspicious. I’m also suspicious
> that it’s simply the application allocating in a burst then
> releasing. If that were the case I’d expect a much more gradual
> reduction in the live set size.
> 
> I think the answer right now is; we need more data.

Agree.

> I’ll try to get 
> the “client” to turn on the extra flags and see what that yields. I
> won’t play with plab sizing this go ‘round if you don’t mind. If
> you’re right and it is a problem with waste, then the beer is on me
> the next time we meet.
> 
> The don’t allocate arrays in size of powers of 2 is an interesting
> comment. While there are clear advantages to allocating arrays in
> size of powers of 2, I believe in that these cases are specialized
> and that I don’t generally see people dogmatically allocating this
> way.

There are cases where you probably want 2^n sized buffers, but in many,
many cases like some serialization for data transfer, it does not
matter a bit whether the output buffer can hold exactly 2^n bytes or
not, i.e. is just a bit smaller.

Of course this is something G1 should handle better by itself, but for
now that is what you can do about this.

Thanks,
  Thomas