<div dir="ltr">Thanks for your continuing interest in our issue!<div><br></div><div>I have been firefighting another issue with a user sending a bit too much traffic our way. Good news, this allowed us to tune our throttling and will probably result in a slightly smoother load in the future, which can only help...</div><div><br></div><div>I have prepared a change with your suggested improvements [1]. But I will wait until next Monday to deploy it. I'll send the new logs as soon as I have them.</div><div><br></div><div>Kirk suggested earlier to double the size of the heap (from 16GB to 32GB). I have not yet implemented that suggestion. Do you think it make sense to bundle that change with the changes suggested by Thomas? Or should I keep it for later?</div><div><br></div><div>Thanks again for your help!</div><div><br></div><div>[1] <a href="https://gerrit.wikimedia.org/r/#/c/385364/">https://gerrit.wikimedia.org/r/#/c/385364/</a></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 20 October 2017 at 14:45, Kirk Pepperdine <span dir="ltr"><<a href="mailto:kirk@kodewerk.com" target="_blank">kirk@kodewerk.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

> On Oct 20, 2017, at 1:41 PM, Thomas Schatzl <<a href="mailto:thomas.schatzl@oracle.com">thomas.schatzl@oracle.com</a>> wrote:<br>

><br>

> Hi all,<br>

><br>

> On Tue, 2017-10-17 at 23:48 +0200, Guillaume Lederrey wrote:<br>

>> Quick note before going to bed...<br>

>><br>

>> On 17 October 2017 at 23:28, Kirk Pepperdine <<a href="mailto:kirk@kodewerk.com">kirk@kodewerk.com</a>><br>

>> wrote:<br>

>>> Hi all,<br>

>>> [...]<br>

>>> This log looks different in that the mixed collections are actually<br>

>>> recovering space. However there seems to be an issue with RSet<br>

>>> update times just as heap occupancy jumps though I would view this<br>

>>> as a normal response to increasing tenured occupancies. The spike<br>

>>> in tenured occupancy does force young to shrink to a size that<br>

>>> should see “to-space” with no room to accept in-coming survivors.<br>

>>><br>

>>> Specific recommendations; the app is churning using enough weak<br>

>>> references that your app would benefit from parallelizing reference<br>

>>> processing (off by default), I would double max heap and limit the<br>

>>> shrinking of young to 20% to start with (default is 5%).<br>

>>><br>

>><br>

>> I'll double max heap tomorrow. Parallel ref processing is already<br>

>> enabled (-XX:+ParallelRefProcEnabled), and young is already limited<br>

>> to max 25% (-XX:G1MaxNewSizePercent=25), I'll add -<br>

>> XX:G1NewSizePercent=20 (, if that's the correct option).<br>

><br>

> Did that help?<br>

><br>

> I am not convinced that increasing the min young gen helps, as it will<br>

> only lengthen the time between mixed gcs, which potentially means that<br>

> more data could accumulate to be promoted, but the time goal within the<br>

> collection (the amount of memory reclaimed) will stay the same.<br>

> Of course, if increasing the eden gives the objects in there enough<br>

> time to die, then it's a win.<br>

<br>

</div></div>In my experience promotion rates are exacerbated by an overly small young gen (which translates into an overly small to-space). In these cases I believe it only adds to the overall pressure on tenured and it part of the reason why the full recovers as much as it does.   Not promoting has the benefit of not requiring a mixed collection to clean things up. Thus larger survivors still can play a positive role as they do in generational collectors. MMV will vary with each application.<br>

<span class=""><br>

><br>

> The problem with that is that during the time from start of marking to<br>

> the end of the mixed gc, more data is promoted than reclaimed ;)<br>

<br>

</span>Absolutely… and this is a case of the tail wagging the dog. Overly small results in premature promotion which results in more pressure on tenured results in more GC activity in tenured. GC activity in tenured is still to be avoided unless it shouldn’t be avoided.<br>

<span class=""><br>

> One problem is the marking algorithm G1 uses in JDK8 which can overflow<br>

> easily easily, causing it to restart marking ("concurrent-mark-reset-<br>

> for-overflow" message). [That has been fixed in JDK9]<br>

><br>

> To fix that, set -XX:MarkStackSize to the same value as<br>

> -XX:MarkStackSizeMax (i.e. -XX:MarkStackSize=512M<br>

> -XX:MarkStackSizeMax=512M - probably a bit lower is fine too, and since<br>

> you set the initial mark stack size to the same as max I think you can<br>

> leave MarkStackSizeMax off from the command line).<br>

<br>

</span>This is great information. Unfortunately there isn’t any data to help anyone understand what a reason able setting should be. Would it also be reasonable to double the mark stack size when you these failure. Also, is the max size of the stack bigger if you configure a larger heap?<br>

<span class=""><br>

<br>

><br>

> I do not think region liveness information is interesting any more (-<br>

> XX:+G1PrintRegionLivenessInfo)<wbr>, so you could remove it again.<br>

<br>

</span>+1, sorry I forgot to mention this… although having a clean run (one without failures) with the data would be intellectually interesting.<br>

<br>

Kind regards,<br>

Kirk<br>

<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>mobile : +41 76 573 32 40<br>skype : Guillaume.Lederrey<br>Freenode: gehel</div><div><br>projects :<br>* <a href="https://signs-web.herokuapp.com/" target="_blank">https://signs-web.herokuapp.com/</a><br></div></div></div>

</div>