Strange G1 behavior

Tue Oct 17 12:36:37 UTC 2017

It looks like a 20MB attachement isn't the way to go... So here is the log:
https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0

On 17 October 2017 at 14:21, Guillaume Lederrey <
guillaume.lederrey at gmail.com> wrote:

> Hello!
>
> Thanks a lot for your analysis! I have to admit that I barely follow your
> train of thoughts.
>
> Here is another GC log after adding "-XX:+G1PrintRegionLivenessInfo".
> This is only a few hours worth of data, I'll send additional logs as they
> are generated.
>
> From what I understand of your discussions, we also need to have a deeper
> look at why our application is allocating that much memory and what we can
> do to prevent this. My very high level guess: we have some clients doing
> some more expensive query. Since our SPARQL endpoint is publicly
> accessible, that would mean we need to find a way to proactively estimate
> the memory cost of a query and drop the too expensive ones. Not an easy
> problem either...
>
> Thanks again!
>
>   Guillaume
>
> On 17 October 2017 at 13:31, Thomas Schatzl <thomas.schatzl at oracle.com>
> wrote:
>
>> Hi,
>>
>> On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:
>> > > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl at oracle
>> > > .com> wrote:
>> > >
>> > > Hi Kirk,
>> > >
>> > > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
>> > > > Hi Thomas,
>> > > >
>> > > > Again, thank you for the detailed response.
>> > > >
>> > > >
>> > > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
>> > > > > <thomas.schatzl at oracle.com> wrote:
>> > > > >
>> > > > > For the allocation rate, please compare the slopes of heap
>> > > > > usage after (young) gcs during these spikes (particularly in
>> > > > > that full gc case) and normal operation.
>> > > >
>> > > > Censum estimates allocation rates as this is a metric that I
>> > > > routinely evaluate.
>> > > >
>> > > > This log shows a spike at 10:07 which correlates with the Full GC
>> > > > event. However the allocation rates while high, are well within
>> > > > values I’ve seen with many other applications that are well
>> > > > behaved. Censum also estimates rates of promotion and those seem
>> > > > exceedingly high at 10:07. That said, there are spikes just after
>> > > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
>> > > > the estimates for allocation rates are high though the estimates
>> > > > for rates of  promotion while high, are not as high as those seen
>> > > > at 10:07.
>> > > >
>> > > > All in all, nothing here seems out of the ordinary and while I
>> > > > want you to be right about the waste and PLAB behaviors, these
>> > > > spikes feel artificial, i.e. I still want to blame the collector
>> > > > for not being able to cope with some aspect of application
>> > > > behavior that it should be able to cope with.. that is something
>> > > > other than a high allocation rate with low recover due to data
>> > > > simply being reference and therefor not eligible for
>> > > > collection...
>> > >
>> > > I always meant "promotion rate" here as allocation rate. For this
>> > > discussion (and in general) in a generational collector the
>> > > application's real allocation rate is usually not very interesting.
>> > >
>> > > Sorry for being imprecise.
>> >
>> > No worries, I get that promotion rates are very important but in when
>> > deciding what needs to be tuned, allocation rates are also very
>> > important so I try to be very precise here, maybe too precise at
>> > times.
>>
>> Let's qualify this further: for the purpose of investigating this full
>> gc :)
>>
>> > > > > In this application, given the information I have, every like
>> > > > > 1500s, there seems to be some component in the application that
>> > > > > allocates a lot of memory in a short time, and holds onto most
>> > > > > of
>> > > > > it for its duration.
>> > > >
>> > > > Sorry but I’m not seeing this pattern either in occupancy after
>> > > > or
>> > > > allocation rate views. What I do see is a systematic loss of free
>> > > > heap over time (slow memory leak??? Effects of caching???).
>> > >
>> > > Let's have a look at the heap usage after gc over time for a few
>> > > collection cycle before that full gc event. Please look at
>> > > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
>> > > promo.png which just shows a few of these.
>> >
>> > Sorry but I can’t see these charts.
>>
>> My email client added some line breaks in the URL because it was too
>> long, sorry. One option is to manually stitch it together; all graphs
>> so far are collected in
>>
>> http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
>> strange-g1-promo.png image.
>>
>> (and cr.openjdk.java.net is live and kicking)
>>
>> > The issue with promotion rates is that you have to ask, is this the
>> > dog wagging the tail, or the tail wagging the dog. In this case I
>> > think it’s the tail wagging the dog. Would it be possible that a
>> > significant portion of the data being promoted was directly due to
>> > to-space being resized to 0?
>>
>> No. At that point the full gc has already been inevitable from my
>> understanding of the log. If you look at the log, the second to-space
>> exhaustion before the full gc does not, as assumed in the policy, give
>> back any free regions (typically they do), so there is no point in
>> restarting the application. Finally some time later it performs the
>> full gc.
>>
>> Overall, in my understanding there seems to be a basic out-of-memory
>> situation here: the application's live set (plus any overheads caused
>> by memory management) during that application phase is too much for the
>> given available memory for G1.
>>
>> Full GC can free up some of that extra space (and potentially the
>> application already let go of some data at that time), allowing the
>> application to continue until everything returns to normal.
>>
>> That a GC algorithm needs extra space to work well is not a secret, and
>> is something the user needs to account for. Iirc some (rather old now)
>> literature (on smaller heaps, with different applications) indicate
>> that from ~33% of live set performance may start to degrade
>> significantly. After that full gc we are already at ~66% of live
>> data...
>>
>> So the behavior we are seeing now seems like expected (but of course
>> undesirable) behavior of the GC algorithm to me.
>>
>> Now there is of course the question where that overhead comes from, and
>> how it can be reduced as the memory used after full gc is quite a bit
>> below the maximum heap. :)
>>
>> All that will probably be a bit clearer with more data.
>>
>> > Sorry for being so slow but I think I’m starting to understand the
>> > underlying pathology here. I’ll hypothesis here and you can blow the
>> > whole idea out of the water with more insightful comments..
>> >
>> > Since tenured can grow into the space that the to-space is likely to
>> > also want, preference towards tenured deflates survivor which in turn
>> > puts more pressure on tenured which in turn deflates survivor even
>> > further, wash rinse, repeat, Full GC to clean up. Since this didn’t
>> > happen with the (older) generational collectors, it seems as if it’s
>> > possible that everything could be ok if a minimum to-space size could
>> > be maintained.
>>
>> Not sure what you mean with "preference towards tenured": of course the
>> generations share the same total memory, and if one of them gets
>> unusually large, the other needs to adapt. G1 already has a minimum
>> young gen size (which is, as you can see, not a hard minimum), but in
>> this degenerate situation it simply thinks: "okay, let's try limping
>> along with less than that because we can. Maybe we are lucky, the
>> application gets back to normal within the next few seconds and so can
>> avoid full gc".
>>
>> The alternative would basically be to start a fully compacting gc
>> earlier (the other collectors would do that btw) as soon as the space
>> statically allocated to old gen runs out.
>>
>> Do you think this is preferable behavior?
>>
>> > > I added rough linear interpolations of the heap usage after gc (so
>> > > that the slope of these lines corresponds to the promotion rates).
>> > > I can see a large, significant difference in the slopes between the
>> > > collection cycles before the full gc event (black lines) and the
>> > > full gc event (red line), while all black ones are roughly the
>> > > same. :)
>> > >
>> > > Note that my graphs were painted off-hand without any actual
>> > > calculation, and particular the red one looks like an
>> > > underestimation of the slope.
>> > >
>> > > > As I look at all of the views in Censum I see nothing outstanding
>> > > > that leads me to believe that this Full is a by-product of some
>> > > > interaction between the collector and the application (some form
>> > > > of zombies????). Also, one certainly cannot rule out your
>> > > > speculation
>> > >
>> > > It does not look like there is an issue e.g. with
>> > > j.l.ref.References of
>> > > any kind.
>> >
>> > LinkList used to be susceptible to nepotism and all generations
>> > collectors suffer from zombies.
>>
>> Maybe. I did not preclude more issues here :)
>>
>> [...]
>> > > >
>> > > > I think the answer right now is; we need more data.
>> > >
>> > > Agree.
>> >
>> > It’s on it’s way!! Yeah for more data.
>> > You might be interested in this for visualization.  https://github.co
>> > m/kcpeppe/regions
>> > It’s a wee bit crude at the moment but I’ll accept pull requests for
>> > any refinements.
>> >
>>
>> Thanks,
>>   Thomas
>>
>>
>
>
> --
> mobile : +41 76 573 32 40 <+41%2076%20573%2032%2040>
> skype : Guillaume.Lederrey
> Freenode: gehel
>
> projects :
> * https://signs-web.herokuapp.com/
>

-- 
mobile : +41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel

projects :
* https://signs-web.herokuapp.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20171017/df2d3641/attachment.htm>