<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hi all,<div class=""><br class=""></div><div class="">I’ve run the data through my crude visualization tool. I’ve screen casted it for those that don’t want to go through the pain of cloning and sorting out how to use the tooling.</div><div class=""><br class=""></div><div class=""><a href="https://drive.google.com/a/jclarity.com/file/d/0B6IuyTuhCQTPUjNyejVGN1U5Vms/view?usp=sharing" class="">https://drive.google.com/a/jclarity.com/file/d/0B6IuyTuhCQTPUjNyejVGN1U5Vms/view?usp=sharing</a></div><div class=""><br class=""></div><div class="">This log looks different in that the mixed collections are actually recovering space. However there seems to be an issue with RSet update times just as heap occupancy jumps though I would view this as a normal response to increasing tenured occupancies. The spike in tenured occupancy does force young to shrink to a size that should see “to-space” with no room to accept in-coming survivors.</div><div class=""><br class=""></div><div class="">Specific recommendations; the app is churning using enough weak references that your app would benefit from parallelizing reference processing (off by default), I would double max heap and limit the shrinking of young to 20% to start with (default is 5%).</div><div class=""><br class=""></div><div class="">Kind regards,</div><div class="">Kirk</div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Oct 17, 2017, at 2:36 PM, Guillaume Lederrey <<a href="mailto:guillaume.lederrey@gmail.com" class="">guillaume.lederrey@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">It looks like a 20MB attachement isn't the way to go... So here is the log: <a href="https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0" class="">https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0</a><br class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On 17 October 2017 at 14:21, Guillaume Lederrey <span dir="ltr" class=""><<a href="mailto:guillaume.lederrey@gmail.com" target="_blank" class="">guillaume.lederrey@gmail.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class="">Hello!<div class=""><br class=""></div><div class="">Thanks a lot for your analysis! I have to admit that I barely follow your train of thoughts.</div><div class=""><br class=""></div><div class="">Here is another GC log after adding "-XX:+<wbr class="">G1PrintRegionLivenessInfo". This is only a few hours worth of data, I'll send additional logs as they are generated.</div><div class=""><br class=""></div><div class="">From what I understand of your discussions, we also need to have a deeper look at why our application is allocating that much memory and what we can do to prevent this. My very high level guess: we have some clients doing some more expensive query. Since our SPARQL endpoint is publicly accessible, that would mean we need to find a way to proactively estimate the memory cost of a query and drop the too expensive ones. Not an easy problem either...</div><div class=""><br class=""></div><div class="">Thanks again!</div><span class="gmail-HOEnZb"><font color="#888888" class=""><div class=""><br class=""></div><div class=""> Guillaume</div></font></span></div><div class="gmail_extra"><div class=""><div class="gmail-h5"><br class=""><div class="gmail_quote">On 17 October 2017 at 13:31, Thomas Schatzl <span dir="ltr" class=""><<a href="mailto:thomas.schatzl@oracle.com" target="_blank" class="">thomas.schatzl@oracle.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br class="">
<span class=""><br class="">
On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:<br class="">
> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle<br class="">
> > .com> wrote:<br class="">
> ><br class="">
> > Hi Kirk,<br class="">
> ><br class="">
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:<br class="">
> > > Hi Thomas,<br class="">
> > ><br class="">
> > > Again, thank you for the detailed response.<br class="">
> > ><br class="">
> > ><br class="">
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl<br class="">
</span><div class=""><div class="gmail-m_-2838348212345660517h5">> > > > <<a href="mailto:thomas.schatzl@oracle.com" target="_blank" class="">thomas.schatzl@oracle.com</a>> wrote:<br class="">
> > > ><br class="">
> > > > For the allocation rate, please compare the slopes of heap<br class="">
> > > > usage after (young) gcs during these spikes (particularly in<br class="">
> > > > that full gc case) and normal operation.<br class="">
> > ><br class="">
> > > Censum estimates allocation rates as this is a metric that I<br class="">
> > > routinely evaluate.<br class="">
> > ><br class="">
> > > This log shows a spike at 10:07 which correlates with the Full GC<br class="">
> > > event. However the allocation rates while high, are well within<br class="">
> > > values I’ve seen with many other applications that are well<br class="">
> > > behaved. Censum also estimates rates of promotion and those seem<br class="">
> > > exceedingly high at 10:07. That said, there are spikes just after<br class="">
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases<br class="">
> > > the estimates for allocation rates are high though the estimates<br class="">
> > > for rates of promotion while high, are not as high as those seen<br class="">
> > > at 10:07.<br class="">
> > ><br class="">
> > > All in all, nothing here seems out of the ordinary and while I<br class="">
> > > want you to be right about the waste and PLAB behaviors, these<br class="">
> > > spikes feel artificial, i.e. I still want to blame the collector<br class="">
> > > for not being able to cope with some aspect of application<br class="">
> > > behavior that it should be able to cope with.. that is something<br class="">
> > > other than a high allocation rate with low recover due to data<br class="">
> > > simply being reference and therefor not eligible for<br class="">
> > > collection...<br class="">
> ><br class="">
> > I always meant "promotion rate" here as allocation rate. For this<br class="">
> > discussion (and in general) in a generational collector the<br class="">
> > application's real allocation rate is usually not very interesting.<br class="">
> ><br class="">
> > Sorry for being imprecise.<br class="">
><br class="">
> No worries, I get that promotion rates are very important but in when<br class="">
> deciding what needs to be tuned, allocation rates are also very<br class="">
> important so I try to be very precise here, maybe too precise at<br class="">
> times.<br class="">
<br class="">
</div></div>Let's qualify this further: for the purpose of investigating this full<br class="">
gc :)<br class="">
<span class=""><br class="">
> > > > In this application, given the information I have, every like<br class="">
> > > > 1500s, there seems to be some component in the application that<br class="">
> > > > allocates a lot of memory in a short time, and holds onto most<br class="">
> > > > of<br class="">
> > > > it for its duration.<br class="">
> > ><br class="">
> > > Sorry but I’m not seeing this pattern either in occupancy after<br class="">
> > > or<br class="">
> > > allocation rate views. What I do see is a systematic loss of free<br class="">
> > > heap over time (slow memory leak??? Effects of caching???).<br class="">
> ><br class="">
> > Let's have a look at the heap usage after gc over time for a few<br class="">
> > collection cycle before that full gc event. Please look at<br class="">
> > <a href="http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-" rel="noreferrer" target="_blank" class="">http://cr.openjdk.java.net/~ts<wbr class="">chatzl/g1-strange-log/strange-<wbr class="">g1-</a><br class="">
</span>> > promo.png which just shows a few of these.<br class="">
<span class="">><br class="">
> Sorry but I can’t see these charts.<br class="">
<br class="">
</span>My email client added some line breaks in the URL because it was too<br class="">
long, sorry. One option is to manually stitch it together; all graphs<br class="">
so far are collected in<br class="">
<br class="">
<a href="http://cr.openjdk.java.net/~tschatzl/g1-strange-log/" rel="noreferrer" target="_blank" class="">http://cr.openjdk.java.net/~ts<wbr class="">chatzl/g1-strange-log/</a> , the<br class="">
strange-g1-promo.png image.<br class="">
<br class="">
(and <a href="http://cr.openjdk.java.net/" rel="noreferrer" target="_blank" class="">cr.openjdk.java.net</a> is live and kicking)<br class="">
<span class=""><br class="">
> The issue with promotion rates is that you have to ask, is this the<br class="">
> dog wagging the tail, or the tail wagging the dog. In this case I<br class="">
> think it’s the tail wagging the dog. Would it be possible that a<br class="">
> significant portion of the data being promoted was directly due to<br class="">
> to-space being resized to 0?<br class="">
<br class="">
</span>No. At that point the full gc has already been inevitable from my<br class="">
understanding of the log. If you look at the log, the second to-space<br class="">
exhaustion before the full gc does not, as assumed in the policy, give<br class="">
back any free regions (typically they do), so there is no point in<br class="">
restarting the application. Finally some time later it performs the<br class="">
full gc.<br class="">
<br class="">
Overall, in my understanding there seems to be a basic out-of-memory<br class="">
situation here: the application's live set (plus any overheads caused<br class="">
by memory management) during that application phase is too much for the<br class="">
given available memory for G1.<br class="">
<br class="">
Full GC can free up some of that extra space (and potentially the<br class="">
application already let go of some data at that time), allowing the<br class="">
application to continue until everything returns to normal.<br class="">
<br class="">
That a GC algorithm needs extra space to work well is not a secret, and<br class="">
is something the user needs to account for. Iirc some (rather old now)<br class="">
literature (on smaller heaps, with different applications) indicate<br class="">
that from ~33% of live set performance may start to degrade<br class="">
significantly. After that full gc we are already at ~66% of live<br class="">
data...<br class="">
<br class="">
So the behavior we are seeing now seems like expected (but of course<br class="">
undesirable) behavior of the GC algorithm to me.<br class="">
<br class="">
Now there is of course the question where that overhead comes from, and<br class="">
how it can be reduced as the memory used after full gc is quite a bit<br class="">
below the maximum heap. :)<br class="">
<br class="">
All that will probably be a bit clearer with more data.<br class="">
<span class=""><br class="">
> Sorry for being so slow but I think I’m starting to understand the<br class="">
> underlying pathology here. I’ll hypothesis here and you can blow the<br class="">
> whole idea out of the water with more insightful comments..<br class="">
><br class="">
> Since tenured can grow into the space that the to-space is likely to<br class="">
> also want, preference towards tenured deflates survivor which in turn<br class="">
> puts more pressure on tenured which in turn deflates survivor even<br class="">
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t<br class="">
> happen with the (older) generational collectors, it seems as if it’s<br class="">
> possible that everything could be ok if a minimum to-space size could<br class="">
> be maintained.<br class="">
<br class="">
</span>Not sure what you mean with "preference towards tenured": of course the<br class="">
generations share the same total memory, and if one of them gets<br class="">
unusually large, the other needs to adapt. G1 already has a minimum<br class="">
young gen size (which is, as you can see, not a hard minimum), but in<br class="">
this degenerate situation it simply thinks: "okay, let's try limping<br class="">
along with less than that because we can. Maybe we are lucky, the<br class="">
application gets back to normal within the next few seconds and so can<br class="">
avoid full gc".<br class="">
<br class="">
The alternative would basically be to start a fully compacting gc<br class="">
earlier (the other collectors would do that btw) as soon as the space<br class="">
statically allocated to old gen runs out.<br class="">
<br class="">
Do you think this is preferable behavior?<br class="">
<span class=""><br class="">
> > I added rough linear interpolations of the heap usage after gc (so<br class="">
> > that the slope of these lines corresponds to the promotion rates).<br class="">
> > I can see a large, significant difference in the slopes between the<br class="">
> > collection cycles before the full gc event (black lines) and the<br class="">
> > full gc event (red line), while all black ones are roughly the<br class="">
> > same. :)<br class="">
> ><br class="">
> > Note that my graphs were painted off-hand without any actual<br class="">
> > calculation, and particular the red one looks like an<br class="">
> > underestimation of the slope.<br class="">
> ><br class="">
> > > As I look at all of the views in Censum I see nothing outstanding<br class="">
> > > that leads me to believe that this Full is a by-product of some<br class="">
> > > interaction between the collector and the application (some form<br class="">
> > > of zombies????). Also, one certainly cannot rule out your<br class="">
> > > speculation <br class="">
> ><br class="">
> > It does not look like there is an issue e.g. with<br class="">
> > j.l.ref.References of<br class="">
> > any kind.<br class="">
><br class="">
> LinkList used to be susceptible to nepotism and all generations<br class="">
> collectors suffer from zombies.<br class="">
<br class="">
</span>Maybe. I did not preclude more issues here :)<br class="">
<br class="">
[...]<br class="">
<span class="">> > ><br class="">
> > > I think the answer right now is; we need more data.<br class="">
> ><br class="">
> > Agree.<br class="">
><br class="">
> It’s on it’s way!! Yeah for more data.<br class="">
> You might be interested in this for visualization. <a href="https://github.co/" rel="noreferrer" target="_blank" class="">https://github<wbr class="">.co</a><br class="">
> m/kcpeppe/regions<br class="">
> It’s a wee bit crude at the moment but I’ll accept pull requests for<br class="">
> any refinements.<br class="">
><br class="">
<br class="">
</span>Thanks,<br class="">
Thomas<br class="">
<br class="">
</blockquote></div><br class=""><br clear="all" class=""><div class=""><br class=""></div></div></div><span class="gmail-">-- <br class=""><div class="gmail-m_-2838348212345660517gmail_signature"><div dir="ltr" class=""><div class="">mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank" class="">+41 76 573 32 40</a><br class="">skype : Guillaume.Lederrey<br class="">Freenode: gehel</div><div class=""><br class="">projects :<br class="">* <a href="https://signs-web.herokuapp.com/" target="_blank" class="">https://signs-web.herokuapp.<wbr class="">com/</a><br class=""></div></div></div>
</span></div>
</blockquote></div><br class=""><br clear="all" class=""><div class=""><br class=""></div>-- <br class=""><div class="gmail_signature"><div dir="ltr" class=""><div class="">mobile : +41 76 573 32 40<br class="">skype : Guillaume.Lederrey<br class="">Freenode: gehel</div><div class=""><br class="">projects :<br class="">* <a href="https://signs-web.herokuapp.com/" target="_blank" class="">https://signs-web.herokuapp.com/</a><br class=""></div></div></div>
</div></div>
</div></blockquote></div><br class=""></div></body></html>