Strange G1 behavior
Thomas Schatzl
thomas.schatzl at oracle.com
Tue Oct 17 11:31:31 UTC 2017
Hi,
On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:
> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl at oracle
> > .com> wrote:
> >
> > Hi Kirk,
> >
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> > > Hi Thomas,
> > >
> > > Again, thank you for the detailed response.
> > >
> > >
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
> > > > <thomas.schatzl at oracle.com> wrote:
> > > >
> > > > For the allocation rate, please compare the slopes of heap
> > > > usage after (young) gcs during these spikes (particularly in
> > > > that full gc case) and normal operation.
> > >
> > > Censum estimates allocation rates as this is a metric that I
> > > routinely evaluate.
> > >
> > > This log shows a spike at 10:07 which correlates with the Full GC
> > > event. However the allocation rates while high, are well within
> > > values I’ve seen with many other applications that are well
> > > behaved. Censum also estimates rates of promotion and those seem
> > > exceedingly high at 10:07. That said, there are spikes just after
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
> > > the estimates for allocation rates are high though the estimates
> > > for rates of promotion while high, are not as high as those seen
> > > at 10:07.
> > >
> > > All in all, nothing here seems out of the ordinary and while I
> > > want you to be right about the waste and PLAB behaviors, these
> > > spikes feel artificial, i.e. I still want to blame the collector
> > > for not being able to cope with some aspect of application
> > > behavior that it should be able to cope with.. that is something
> > > other than a high allocation rate with low recover due to data
> > > simply being reference and therefor not eligible for
> > > collection...
> >
> > I always meant "promotion rate" here as allocation rate. For this
> > discussion (and in general) in a generational collector the
> > application's real allocation rate is usually not very interesting.
> >
> > Sorry for being imprecise.
>
> No worries, I get that promotion rates are very important but in when
> deciding what needs to be tuned, allocation rates are also very
> important so I try to be very precise here, maybe too precise at
> times.
Let's qualify this further: for the purpose of investigating this full
gc :)
> > > > In this application, given the information I have, every like
> > > > 1500s, there seems to be some component in the application that
> > > > allocates a lot of memory in a short time, and holds onto most
> > > > of
> > > > it for its duration.
> > >
> > > Sorry but I’m not seeing this pattern either in occupancy after
> > > or
> > > allocation rate views. What I do see is a systematic loss of free
> > > heap over time (slow memory leak??? Effects of caching???).
> >
> > Let's have a look at the heap usage after gc over time for a few
> > collection cycle before that full gc event. Please look at
> > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
> > promo.png which just shows a few of these.
>
> Sorry but I can’t see these charts.
My email client added some line breaks in the URL because it was too
long, sorry. One option is to manually stitch it together; all graphs
so far are collected in
http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
strange-g1-promo.png image.
(and cr.openjdk.java.net is live and kicking)
> The issue with promotion rates is that you have to ask, is this the
> dog wagging the tail, or the tail wagging the dog. In this case I
> think it’s the tail wagging the dog. Would it be possible that a
> significant portion of the data being promoted was directly due to
> to-space being resized to 0?
No. At that point the full gc has already been inevitable from my
understanding of the log. If you look at the log, the second to-space
exhaustion before the full gc does not, as assumed in the policy, give
back any free regions (typically they do), so there is no point in
restarting the application. Finally some time later it performs the
full gc.
Overall, in my understanding there seems to be a basic out-of-memory
situation here: the application's live set (plus any overheads caused
by memory management) during that application phase is too much for the
given available memory for G1.
Full GC can free up some of that extra space (and potentially the
application already let go of some data at that time), allowing the
application to continue until everything returns to normal.
That a GC algorithm needs extra space to work well is not a secret, and
is something the user needs to account for. Iirc some (rather old now)
literature (on smaller heaps, with different applications) indicate
that from ~33% of live set performance may start to degrade
significantly. After that full gc we are already at ~66% of live
data...
So the behavior we are seeing now seems like expected (but of course
undesirable) behavior of the GC algorithm to me.
Now there is of course the question where that overhead comes from, and
how it can be reduced as the memory used after full gc is quite a bit
below the maximum heap. :)
All that will probably be a bit clearer with more data.
> Sorry for being so slow but I think I’m starting to understand the
> underlying pathology here. I’ll hypothesis here and you can blow the
> whole idea out of the water with more insightful comments..
>
> Since tenured can grow into the space that the to-space is likely to
> also want, preference towards tenured deflates survivor which in turn
> puts more pressure on tenured which in turn deflates survivor even
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t
> happen with the (older) generational collectors, it seems as if it’s
> possible that everything could be ok if a minimum to-space size could
> be maintained.
Not sure what you mean with "preference towards tenured": of course the
generations share the same total memory, and if one of them gets
unusually large, the other needs to adapt. G1 already has a minimum
young gen size (which is, as you can see, not a hard minimum), but in
this degenerate situation it simply thinks: "okay, let's try limping
along with less than that because we can. Maybe we are lucky, the
application gets back to normal within the next few seconds and so can
avoid full gc".
The alternative would basically be to start a fully compacting gc
earlier (the other collectors would do that btw) as soon as the space
statically allocated to old gen runs out.
Do you think this is preferable behavior?
> > I added rough linear interpolations of the heap usage after gc (so
> > that the slope of these lines corresponds to the promotion rates).
> > I can see a large, significant difference in the slopes between the
> > collection cycles before the full gc event (black lines) and the
> > full gc event (red line), while all black ones are roughly the
> > same. :)
> >
> > Note that my graphs were painted off-hand without any actual
> > calculation, and particular the red one looks like an
> > underestimation of the slope.
> >
> > > As I look at all of the views in Censum I see nothing outstanding
> > > that leads me to believe that this Full is a by-product of some
> > > interaction between the collector and the application (some form
> > > of zombies????). Also, one certainly cannot rule out your
> > > speculation
> >
> > It does not look like there is an issue e.g. with
> > j.l.ref.References of
> > any kind.
>
> LinkList used to be susceptible to nepotism and all generations
> collectors suffer from zombies.
Maybe. I did not preclude more issues here :)
[...]
> > >
> > > I think the answer right now is; we need more data.
> >
> > Agree.
>
> It’s on it’s way!! Yeah for more data.
> You might be interested in this for visualization. https://github.co
> m/kcpeppe/regions
> It’s a wee bit crude at the moment but I’ll accept pull requests for
> any refinements.
>
Thanks,
Thomas
More information about the hotspot-gc-dev
mailing list