<div dir="ltr">It looks like a 20MB attachement isn't the way to go... So here is the log: <a href="https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0">https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0</a><br><div class="gmail_extra"><br><div class="gmail_quote">On 17 October 2017 at 14:21, Guillaume Lederrey <span dir="ltr"><<a href="mailto:guillaume.lederrey@gmail.com" target="_blank">guillaume.lederrey@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello!<div><br></div><div>Thanks a lot for your analysis! I have to admit that I barely follow your train of thoughts.</div><div><br></div><div>Here is another GC log after adding "-XX:+<wbr>G1PrintRegionLivenessInfo". This is only a few hours worth of data, I'll send additional logs as they are generated.</div><div><br></div><div>From what I understand of your discussions, we also need to have a deeper look at why our application is allocating that much memory and what we can do to prevent this. My very high level guess: we have some clients doing some more expensive query. Since our SPARQL endpoint is publicly accessible, that would mean we need to find a way to proactively estimate the memory cost of a query and drop the too expensive ones. Not an easy problem either...</div><div><br></div><div>Thanks again!</div><span class="gmail-HOEnZb"><font color="#888888"><div><br></div><div>  Guillaume</div></font></span></div><div class="gmail_extra"><div><div class="gmail-h5"><br><div class="gmail_quote">On 17 October 2017 at 13:31, Thomas Schatzl <span dir="ltr"><<a href="mailto:thomas.schatzl@oracle.com" target="_blank">thomas.schatzl@oracle.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<span><br>

On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:<br>

> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle<br>

> > .com> wrote:<br>

> ><br>

> > Hi Kirk,<br>

> ><br>

> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:<br>

> > > Hi Thomas,<br>

> > ><br>

> > > Again, thank you for the detailed response.<br>

> > ><br>

> > ><br>

> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl<br>

</span><div><div class="gmail-m_-2838348212345660517h5">> > > > <<a href="mailto:thomas.schatzl@oracle.com" target="_blank">thomas.schatzl@oracle.com</a>> wrote:<br>

> > > ><br>

> > > > For the allocation rate, please compare the slopes of heap<br>

> > > > usage after (young) gcs during these spikes (particularly in<br>

> > > > that full gc case) and normal operation.<br>

> > ><br>

> > > Censum estimates allocation rates as this is a metric that I<br>

> > > routinely evaluate.<br>

> > ><br>

> > > This log shows a spike at 10:07 which correlates with the Full GC<br>

> > > event. However the allocation rates while high, are well within<br>

> > > values I’ve seen with many other applications that are well<br>

> > > behaved. Censum also estimates rates of promotion and those seem<br>

> > > exceedingly high at 10:07. That said, there are spikes just after<br>

> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases<br>

> > > the estimates for allocation rates are high though the estimates<br>

> > > for rates of  promotion while high, are not as high as those seen<br>

> > > at 10:07.<br>

> > ><br>

> > > All in all, nothing here seems out of the ordinary and while I<br>

> > > want you to be right about the waste and PLAB behaviors, these<br>

> > > spikes feel artificial, i.e. I still want to blame the collector<br>

> > > for not being able to cope with some aspect of application<br>

> > > behavior that it should be able to cope with.. that is something<br>

> > > other than a high allocation rate with low recover due to data<br>

> > > simply being reference and therefor not eligible for<br>

> > > collection...<br>

> ><br>

> > I always meant "promotion rate" here as allocation rate. For this<br>

> > discussion (and in general) in a generational collector the<br>

> > application's real allocation rate is usually not very interesting.<br>

> ><br>

> > Sorry for being imprecise.<br>

><br>

> No worries, I get that promotion rates are very important but in when<br>

> deciding what needs to be tuned, allocation rates are also very<br>

> important so I try to be very precise here, maybe too precise at<br>

> times.<br>

<br>

</div></div>Let's qualify this further: for the purpose of investigating this full<br>

gc :)<br>

<span><br>

> > > > In this application, given the information I have, every like<br>

> > > > 1500s, there seems to be some component in the application that<br>

> > > > allocates a lot of memory in a short time, and holds onto most<br>

> > > > of<br>

> > > > it for its duration.<br>

> > ><br>

> > > Sorry but I’m not seeing this pattern either in occupancy after<br>

> > > or<br>

> > > allocation rate views. What I do see is a systematic loss of free<br>

> > > heap over time (slow memory leak??? Effects of caching???).<br>

> ><br>

> > Let's have a look at the heap usage after gc over time for a few<br>

> > collection cycle before that full gc event. Please look at<br>

> > <a href="http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-" rel="noreferrer" target="_blank">http://cr.openjdk.java.net/~ts<wbr>chatzl/g1-strange-log/strange-<wbr>g1-</a><br>

</span>> > promo.png which just shows a few of these.<br>

<span>><br>

> Sorry but I can’t see these charts.<br>

<br>

</span>My email client added some line breaks in the URL because it was too<br>

long, sorry. One option is to manually stitch it together; all graphs<br>

so far are collected in<br>

<br>

<a href="http://cr.openjdk.java.net/~tschatzl/g1-strange-log/" rel="noreferrer" target="_blank">http://cr.openjdk.java.net/~ts<wbr>chatzl/g1-strange-log/</a> , the<br>

strange-g1-promo.png image.<br>

<br>

(and <a href="http://cr.openjdk.java.net" rel="noreferrer" target="_blank">cr.openjdk.java.net</a> is live and kicking)<br>

<span><br>

> The issue with promotion rates is that you have to ask, is this the<br>

> dog wagging the tail, or the tail wagging the dog. In this case I<br>

> think it’s the tail wagging the dog. Would it be possible that a<br>

> significant portion of the data being promoted was directly due to<br>

> to-space being resized to 0?<br>

<br>

</span>No. At that point the full gc has already been inevitable from my<br>

understanding of the log. If you look at the log, the second to-space<br>

exhaustion before the full gc does not, as assumed in the policy, give<br>

back any free regions (typically they do), so there is no point in<br>

restarting the application. Finally some time later it performs the<br>

full gc.<br>

<br>

Overall, in my understanding there seems to be a basic out-of-memory<br>

situation here: the application's live set (plus any overheads caused<br>

by memory management) during that application phase is too much for the<br>

given available memory for G1.<br>

<br>

Full GC can free up some of that extra space (and potentially the<br>

application already let go of some data at that time), allowing the<br>

application to continue until everything returns to normal.<br>

<br>

That a GC algorithm needs extra space to work well is not a secret, and<br>

is something the user needs to account for. Iirc some (rather old now)<br>

literature (on smaller heaps, with different applications) indicate<br>

that from ~33% of live set performance may start to degrade<br>

significantly. After that full gc we are already at ~66% of live<br>

data...<br>

<br>

So the behavior we are seeing now seems like expected (but of course<br>

undesirable) behavior of the GC algorithm to me.<br>

<br>

Now there is of course the question where that overhead comes from, and<br>

how it can be reduced as the memory used after full gc is quite a bit<br>

below the maximum heap. :)<br>

<br>

All that will probably be a bit clearer with more data.<br>

<span><br>

> Sorry for being so slow but I think I’m starting to understand the<br>

> underlying pathology here. I’ll hypothesis here and you can blow the<br>

> whole idea out of the water with more insightful comments..<br>

><br>

> Since tenured can grow into the space that the to-space is likely to<br>

> also want, preference towards tenured deflates survivor which in turn<br>

> puts more pressure on tenured which in turn deflates survivor even<br>

> further, wash rinse, repeat, Full GC to clean up. Since this didn’t<br>

> happen with the (older) generational collectors, it seems as if it’s<br>

> possible that everything could be ok if a minimum to-space size could<br>

> be maintained.<br>

<br>

</span>Not sure what you mean with "preference towards tenured": of course the<br>

generations share the same total memory, and if one of them gets<br>

unusually large, the other needs to adapt. G1 already has a minimum<br>

young gen size (which is, as you can see, not a hard minimum), but in<br>

this degenerate situation it simply thinks: "okay, let's try limping<br>

along with less than that because we can. Maybe we are lucky, the<br>

application gets back to normal within the next few seconds and so can<br>

avoid full gc".<br>

<br>

The alternative would basically be to start a fully compacting gc<br>

earlier (the other collectors would do that btw) as soon as the space<br>

statically allocated to old gen runs out.<br>

<br>

Do you think this is preferable behavior?<br>

<span><br>

> > I added rough linear interpolations of the heap usage after gc (so<br>

> > that the slope of these lines corresponds to the promotion rates).<br>

> > I can see a large, significant difference in the slopes between the<br>

> > collection cycles before the full gc event (black lines) and the<br>

> > full gc event (red line), while all black ones are roughly the<br>

> > same. :)<br>

> ><br>

> > Note that my graphs were painted off-hand without any actual<br>

> > calculation, and particular the red one looks like an<br>

> > underestimation of the slope.<br>

> ><br>

> > > As I look at all of the views in Censum I see nothing outstanding<br>

> > > that leads me to believe that this Full is a by-product of some<br>

> > > interaction between the collector and the application (some form<br>

> > > of zombies????). Also, one certainly cannot rule out your<br>

> > > speculation <br>

> ><br>

> > It does not look like there is an issue e.g. with<br>

> > j.l.ref.References of<br>

> > any kind.<br>

><br>

> LinkList used to be susceptible to nepotism and all generations<br>

> collectors suffer from zombies.<br>

<br>

</span>Maybe. I did not preclude more issues here :)<br>

<br>

[...]<br>

<span>> > ><br>

> > > I think the answer right now is; we need more data.<br>

> ><br>

> > Agree.<br>

><br>

> It’s on it’s way!! Yeah for more data.<br>

> You might be interested in this for visualization.  <a href="https://github.co" rel="noreferrer" target="_blank">https://github<wbr>.co</a><br>

> m/kcpeppe/regions<br>

> It’s a wee bit crude at the moment but I’ll accept pull requests for<br>

> any refinements.<br>

><br>

<br>

</span>Thanks,<br>

  Thomas<br>

<br>

</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="gmail-">-- <br><div class="gmail-m_-2838348212345660517gmail_signature"><div dir="ltr"><div>mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank">+41 76 573 32 40</a><br>skype : Guillaume.Lederrey<br>Freenode: gehel</div><div><br>projects :<br>* <a href="https://signs-web.herokuapp.com/" target="_blank">https://signs-web.herokuapp.<wbr>com/</a><br></div></div></div>

</span></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div>mobile : +41 76 573 32 40<br>skype : Guillaume.Lederrey<br>Freenode: gehel</div><div><br>projects :<br>* <a href="https://signs-web.herokuapp.com/" target="_blank">https://signs-web.herokuapp.com/</a><br></div></div></div>

</div></div>