<div dir="ltr">It looks like a 20MB attachement isn't the way to go... So here is the log: <a href="https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0">https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0</a><br><div class="gmail_extra"><br><div class="gmail_quote">On 17 October 2017 at 14:21, Guillaume Lederrey <span dir="ltr"><<a href="mailto:guillaume.lederrey@gmail.com" target="_blank">guillaume.lederrey@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello!<div><br></div><div>Thanks a lot for your analysis! I have to admit that I barely follow your train of thoughts.</div><div><br></div><div>Here is another GC log after adding "-XX:+<wbr>G1PrintRegionLivenessInfo". This is only a few hours worth of data, I'll send additional logs as they are generated.</div><div><br></div><div>From what I understand of your discussions, we also need to have a deeper look at why our application is allocating that much memory and what we can do to prevent this. My very high level guess: we have some clients doing some more expensive query. Since our SPARQL endpoint is publicly accessible, that would mean we need to find a way to proactively estimate the memory cost of a query and drop the too expensive ones. Not an easy problem either...</div><div><br></div><div>Thanks again!</div><span class="gmail-HOEnZb"><font color="#888888"><div><br></div><div> Guillaume</div></font></span></div><div class="gmail_extra"><div><div class="gmail-h5"><br><div class="gmail_quote">On 17 October 2017 at 13:31, Thomas Schatzl <span dir="ltr"><<a href="mailto:thomas.schatzl@oracle.com" target="_blank">thomas.schatzl@oracle.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<span><br>
On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:<br>
> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle<br>
> > .com> wrote:<br>
> ><br>
> > Hi Kirk,<br>
> ><br>
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:<br>
> > > Hi Thomas,<br>
> > ><br>
> > > Again, thank you for the detailed response.<br>
> > ><br>
> > ><br>
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl<br>
</span><div><div class="gmail-m_-2838348212345660517h5">> > > > <<a href="mailto:thomas.schatzl@oracle.com" target="_blank">thomas.schatzl@oracle.com</a>> wrote:<br>
> > > ><br>
> > > > For the allocation rate, please compare the slopes of heap<br>
> > > > usage after (young) gcs during these spikes (particularly in<br>
> > > > that full gc case) and normal operation.<br>
> > ><br>
> > > Censum estimates allocation rates as this is a metric that I<br>
> > > routinely evaluate.<br>
> > ><br>
> > > This log shows a spike at 10:07 which correlates with the Full GC<br>
> > > event. However the allocation rates while high, are well within<br>
> > > values I’ve seen with many other applications that are well<br>
> > > behaved. Censum also estimates rates of promotion and those seem<br>
> > > exceedingly high at 10:07. That said, there are spikes just after<br>
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases<br>
> > > the estimates for allocation rates are high though the estimates<br>
> > > for rates of promotion while high, are not as high as those seen<br>
> > > at 10:07.<br>
> > ><br>
> > > All in all, nothing here seems out of the ordinary and while I<br>
> > > want you to be right about the waste and PLAB behaviors, these<br>
> > > spikes feel artificial, i.e. I still want to blame the collector<br>
> > > for not being able to cope with some aspect of application<br>
> > > behavior that it should be able to cope with.. that is something<br>
> > > other than a high allocation rate with low recover due to data<br>
> > > simply being reference and therefor not eligible for<br>
> > > collection...<br>
> ><br>
> > I always meant "promotion rate" here as allocation rate. For this<br>
> > discussion (and in general) in a generational collector the<br>
> > application's real allocation rate is usually not very interesting.<br>
> ><br>
> > Sorry for being imprecise.<br>
><br>
> No worries, I get that promotion rates are very important but in when<br>
> deciding what needs to be tuned, allocation rates are also very<br>
> important so I try to be very precise here, maybe too precise at<br>
> times.<br>
<br>
</div></div>Let's qualify this further: for the purpose of investigating this full<br>
gc :)<br>
<span><br>
> > > > In this application, given the information I have, every like<br>
> > > > 1500s, there seems to be some component in the application that<br>
> > > > allocates a lot of memory in a short time, and holds onto most<br>
> > > > of<br>
> > > > it for its duration.<br>
> > ><br>
> > > Sorry but I’m not seeing this pattern either in occupancy after<br>
> > > or<br>
> > > allocation rate views. What I do see is a systematic loss of free<br>
> > > heap over time (slow memory leak??? Effects of caching???).<br>
> ><br>
> > Let's have a look at the heap usage after gc over time for a few<br>
> > collection cycle before that full gc event. Please look at<br>
> > <a href="http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-" rel="noreferrer" target="_blank">http://cr.openjdk.java.net/~ts<wbr>chatzl/g1-strange-log/strange-<wbr>g1-</a><br>
</span>> > promo.png which just shows a few of these.<br>
<span>><br>
> Sorry but I can’t see these charts.<br>
<br>
</span>My email client added some line breaks in the URL because it was too<br>
long, sorry. One option is to manually stitch it together; all graphs<br>
so far are collected in<br>
<br>
<a href="http://cr.openjdk.java.net/~tschatzl/g1-strange-log/" rel="noreferrer" target="_blank">http://cr.openjdk.java.net/~ts<wbr>chatzl/g1-strange-log/</a> , the<br>
strange-g1-promo.png image.<br>
<br>
(and <a href="http://cr.openjdk.java.net" rel="noreferrer" target="_blank">cr.openjdk.java.net</a> is live and kicking)<br>
<span><br>
> The issue with promotion rates is that you have to ask, is this the<br>
> dog wagging the tail, or the tail wagging the dog. In this case I<br>
> think it’s the tail wagging the dog. Would it be possible that a<br>
> significant portion of the data being promoted was directly due to<br>
> to-space being resized to 0?<br>
<br>
</span>No. At that point the full gc has already been inevitable from my<br>
understanding of the log. If you look at the log, the second to-space<br>
exhaustion before the full gc does not, as assumed in the policy, give<br>
back any free regions (typically they do), so there is no point in<br>
restarting the application. Finally some time later it performs the<br>
full gc.<br>
<br>
Overall, in my understanding there seems to be a basic out-of-memory<br>
situation here: the application's live set (plus any overheads caused<br>
by memory management) during that application phase is too much for the<br>
given available memory for G1.<br>
<br>
Full GC can free up some of that extra space (and potentially the<br>
application already let go of some data at that time), allowing the<br>
application to continue until everything returns to normal.<br>
<br>
That a GC algorithm needs extra space to work well is not a secret, and<br>
is something the user needs to account for. Iirc some (rather old now)<br>
literature (on smaller heaps, with different applications) indicate<br>
that from ~33% of live set performance may start to degrade<br>
significantly. After that full gc we are already at ~66% of live<br>
data...<br>
<br>
So the behavior we are seeing now seems like expected (but of course<br>
undesirable) behavior of the GC algorithm to me.<br>
<br>
Now there is of course the question where that overhead comes from, and<br>
how it can be reduced as the memory used after full gc is quite a bit<br>
below the maximum heap. :)<br>
<br>
All that will probably be a bit clearer with more data.<br>
<span><br>
> Sorry for being so slow but I think I’m starting to understand the<br>
> underlying pathology here. I’ll hypothesis here and you can blow the<br>
> whole idea out of the water with more insightful comments..<br>
><br>
> Since tenured can grow into the space that the to-space is likely to<br>
> also want, preference towards tenured deflates survivor which in turn<br>
> puts more pressure on tenured which in turn deflates survivor even<br>
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t<br>
> happen with the (older) generational collectors, it seems as if it’s<br>
> possible that everything could be ok if a minimum to-space size could<br>
> be maintained.<br>
<br>
</span>Not sure what you mean with "preference towards tenured": of course the<br>
generations share the same total memory, and if one of them gets<br>
unusually large, the other needs to adapt. G1 already has a minimum<br>
young gen size (which is, as you can see, not a hard minimum), but in<br>
this degenerate situation it simply thinks: "okay, let's try limping<br>
along with less than that because we can. Maybe we are lucky, the<br>
application gets back to normal within the next few seconds and so can<br>
avoid full gc".<br>
<br>
The alternative would basically be to start a fully compacting gc<br>
earlier (the other collectors would do that btw) as soon as the space<br>
statically allocated to old gen runs out.<br>
<br>
Do you think this is preferable behavior?<br>
<span><br>
> > I added rough linear interpolations of the heap usage after gc (so<br>
> > that the slope of these lines corresponds to the promotion rates).<br>
> > I can see a large, significant difference in the slopes between the<br>
> > collection cycles before the full gc event (black lines) and the<br>
> > full gc event (red line), while all black ones are roughly the<br>
> > same. :)<br>
> ><br>
> > Note that my graphs were painted off-hand without any actual<br>
> > calculation, and particular the red one looks like an<br>
> > underestimation of the slope.<br>
> ><br>
> > > As I look at all of the views in Censum I see nothing outstanding<br>
> > > that leads me to believe that this Full is a by-product of some<br>
> > > interaction between the collector and the application (some form<br>
> > > of zombies????). Also, one certainly cannot rule out your<br>
> > > speculation <br>
> ><br>
> > It does not look like there is an issue e.g. with<br>
> > j.l.ref.References of<br>
> > any kind.<br>
><br>
> LinkList used to be susceptible to nepotism and all generations<br>
> collectors suffer from zombies.<br>
<br>
</span>Maybe. I did not preclude more issues here :)<br>
<br>
[...]<br>
<span>> > ><br>
> > > I think the answer right now is; we need more data.<br>
> ><br>
> > Agree.<br>
><br>
> It’s on it’s way!! Yeah for more data.<br>
> You might be interested in this for visualization. <a href="https://github.co" rel="noreferrer" target="_blank">https://github<wbr>.co</a><br>
> m/kcpeppe/regions<br>
> It’s a wee bit crude at the moment but I’ll accept pull requests for<br>
> any refinements.<br>
><br>
<br>
</span>Thanks,<br>
Thomas<br>
<br>
</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="gmail-">-- <br><div class="gmail-m_-2838348212345660517gmail_signature"><div dir="ltr"><div>mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank">+41 76 573 32 40</a><br>skype : Guillaume.Lederrey<br>Freenode: gehel</div><div><br>projects :<br>* <a href="https://signs-web.herokuapp.com/" target="_blank">https://signs-web.herokuapp.<wbr>com/</a><br></div></div></div>
</span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div>mobile : +41 76 573 32 40<br>skype : Guillaume.Lederrey<br>Freenode: gehel</div><div><br>projects :<br>* <a href="https://signs-web.herokuapp.com/" target="_blank">https://signs-web.herokuapp.com/</a><br></div></div></div>
</div></div>