Strange G1 behavior

Fri Oct 20 13:22:17 UTC 2017

Hi,

On Fri, 2017-10-20 at 14:45 +0200, Kirk Pepperdine wrote:
> > On Oct 20, 2017, at 1:41 PM, Thomas Schatzl <thomas.schatzl at oracle.
> > com> wrote:
> > 
> > Hi all,
> > 
> > On Tue, 2017-10-17 at 23:48 +0200, Guillaume Lederrey wrote:
> > > Quick note before going to bed...
> > > 
> > > On 17 October 2017 at 23:28, Kirk Pepperdine <kirk at kodewerk.com>
> > > wrote:
> > > > Hi all,
> > > > [...]
> > > > This log looks different in that the mixed collections are
> > > > actually
> > > > recovering space. However there seems to be an issue with RSet
> > > > update times just as heap occupancy jumps though I would view
> > > > this
> > > > as a normal response to increasing tenured occupancies. The
> > > > spike
> > > > in tenured occupancy does force young to shrink to a size that
> > > > should see “to-space” with no room to accept in-coming
> > > > survivors.
> > > > 
> > > > Specific recommendations; the app is churning using enough weak
> > > > references that your app would benefit from parallelizing
> > > > reference
> > > > processing (off by default), I would double max heap and limit
> > > > the
> > > > shrinking of young to 20% to start with (default is 5%).
> > > > 
> > > 
> > > I'll double max heap tomorrow. Parallel ref processing is already
> > > enabled (-XX:+ParallelRefProcEnabled), and young is already
> > > limited
> > > to max 25% (-XX:G1MaxNewSizePercent=25), I'll add -
> > > XX:G1NewSizePercent=20 (, if that's the correct option).
> > 
> > Did that help?
> > 
> > I am not convinced that increasing the min young gen helps, as it
> > will only lengthen the time between mixed gcs, which potentially
> > means that more data could accumulate to be promoted, but the time
> > goal within the collection (the amount of memory reclaimed) will
> > stay the same. Of course, if increasing the eden gives the objects
> > in there enough time to die, then it's a win.
> 
> In my experience promotion rates are exacerbated by an overly small
> young gen (which translates into an overly small to-space). In these
> cases I believe it only adds to the overall pressure on tenured and
> it part of the reason why the full recovers as much as it does.   Not
> promoting has the benefit of not requiring a mixed collection to
> clean things up. Thus larger survivors still can play a positive role
> as they do in generational collectors. MMV will vary with each
> application.

Yes. However as mentioned, in this case the death rate is already quite
low (1:4), so decreasing the young gen by >1/4th would even be a win if
everything needed to be promoted (I suggested to decrease the young gen
 during mixed gc to 1/5th ;)).

In practice you probably won't get a 1:1 rate even with very small
young gen.

> > The problem with that is that during the time from start of marking
> > to the end of the mixed gc, more data is promoted than reclaimed ;)
> 
> Absolutely… and this is a case of the tail wagging the dog. Overly
> small results in premature promotion which results in more pressure
> on tenured results in more GC activity in tenured. GC activity in
> tenured is still to be avoided unless it shouldn’t be avoided.

The alternative is to force g1 to reclaim more per mixed gc using
G1MixedGCountTarget - however that will impact the pause times (and you
can again counterbalance that with
G1HeapWastePercent/G1MixedGCLiveThresholdPercent). I did not suggest
that.

G1 heuristics could change these values dynamically according to heap
pressure of course.

Actually in this particular case I think it could be sufficient by just
making marking complete faster - not sure why it is that slow. But
since I wanted to decrease the overhead for Guillaume, I proposed some
what I think "safe" settings.

If successful and interested we can always iterate on that.

Of course it does not fix the issue with the server accepting requests
that may generate an unbounded amount of live data, which is always bad
if you have a heap limit.

> > One problem is the marking algorithm G1 uses in JDK8 which can
> > overflow easily easily, causing it to restart marking ("concurrent-
> > mark-reset-for-overflow" message). [That has been fixed in JDK9]
> > 
> > To fix that, set -XX:MarkStackSize to the same value as
> > -XX:MarkStackSizeMax (i.e. -XX:MarkStackSize=512M
> > -XX:MarkStackSizeMax=512M - probably a bit lower is fine too, and
> > since you set the initial mark stack size to the same as max I
> > think you can leave MarkStackSizeMax off from the command line).
> 
> This is great information. Unfortunately there isn’t any data to help
> anyone understand what a reason able setting should be. Would it also

For JDK8 the worst case is basically the number of references in your
largest j.l.O. array times max number of j.l.O. arrays iirc.

In JDK9, 1024 * max number of j.l.O. arrays; also in JDK9 work
distribution between mark threads is much better, i.e. the mark stack
will be processed much faster.

(I did not think hard about both JDK's worst cases, so I may be
completely off).

As for when to increase mark stack size, you do get a message when/if
the mark stack overflows... also, G1 could simply increase the mark
stack size in that case and continue without restarting the marking if
increasing space succeeded. There's a CR for that somewhere,
contributions welcome as usual (G1 already does that iirc when there is
a mark stack overflow during GC pause).

> be reasonable to double the mark stack size when you these failure.
> Also, is the max size of the stack bigger if you configure a larger
> heap?

Use JDK9 - there except in rather unlikely situations you should need
to increase it.

Also in JDK9, the transition from marking to mixed gcs is faster,
automatically decreasing the pressure.

Thanks,
  Thomas