Strange G1 behavior

Fri Oct 20 11:41:51 UTC 2017

Hi all,

On Tue, 2017-10-17 at 23:48 +0200, Guillaume Lederrey wrote:
> Quick note before going to bed...
> 
> On 17 October 2017 at 23:28, Kirk Pepperdine <kirk at kodewerk.com>
> wrote:
> > Hi all,
> > [...]
> > This log looks different in that the mixed collections are actually
> > recovering space. However there seems to be an issue with RSet
> > update times just as heap occupancy jumps though I would view this
> > as a normal response to increasing tenured occupancies. The spike
> > in tenured occupancy does force young to shrink to a size that
> > should see “to-space” with no room to accept in-coming survivors.
> > 
> > Specific recommendations; the app is churning using enough weak
> > references that your app would benefit from parallelizing reference
> > processing (off by default), I would double max heap and limit the
> > shrinking of young to 20% to start with (default is 5%).
> > 
> 
> I'll double max heap tomorrow. Parallel ref processing is already
> enabled (-XX:+ParallelRefProcEnabled), and young is already limited
> to max 25% (-XX:G1MaxNewSizePercent=25), I'll add -
> XX:G1NewSizePercent=20 (, if that's the correct option).

Did that help?

I am not convinced that increasing the min young gen helps, as it will
only lengthen the time between mixed gcs, which potentially means that
more data could accumulate to be promoted, but the time goal within the
collection (the amount of memory reclaimed) will stay the same.
Of course, if increasing the eden gives the objects in there enough
time to die, then it's a win.

>From what I can see from the log the promotion rate seems to be like
25% during that spike that leads to the full gc in the latest log (800m
young, +200M heap size increase/gc).

The problem with that is that during the time from start of marking to
the end of the mixed gc, more data is promoted than reclaimed ;)

So I would recommend some measure to decrease marking time, and others
to decrease the interval between mixed gcs (at the cost of getting an
even worse promotion rate, but since I will propose to decrease young
gen during mixed gc, that should be a net win).

As for marking times, I already noticed that in the last log during
these spikes the duration of the marking goes up significantly. This is
natural in that during that time there are just more live objects and
references to go through, but in this case the difference seems to be
way too much. I.e. outside of these spikes marking takes <500ms, rising
 to 8-9000ms fast.
One problem is the marking algorithm G1 uses in JDK8 which can overflow
easily easily, causing it to restart marking ("concurrent-mark-reset-
for-overflow" message). [That has been fixed in JDK9]

To fix that, set -XX:MarkStackSize to the same value as
-XX:MarkStackSizeMax (i.e. -XX:MarkStackSize=512M
-XX:MarkStackSizeMax=512M - probably a bit lower is fine too, and since
you set the initial mark stack size to the same as max I think you can
leave MarkStackSizeMax off from the command line).

As for general length of marking, it may be that during these times of
high activity the mark threads don't get enough CPU cycles to complete
their work quickly as they compete with application thread(s). The
defaults for the number is, I think five for your machine. Try setting
-XX:ConcGCThreads=10 for example.

The other problem is that mixed gcs do not reclaim enough space from
old gen while the application is still copying into it. I would try to
significantly reduce the young gen during that time, by e.g. setting
G1NewSizePercent to something like one, two or three and see what
happens. (I would probably start with -XX:G1NewSizePercent=1 just to
see).

Other findings:

  - I do not think humongous objects are an issue here, although they
are exactly 2^n+16 bytes in size as I thought they are. The waste is
around 150MB in total, given the heap size that does not seem too much.
(Of course that memory could be spent in better ways).

  - There does not seem to be a huge difference in what the marking
thinks is live and what is live after full gc. So I do not think
liveness analysis is incorrect here (of course, these are snapshots of
the heap at different times).

If you were so kind to do another run with these suggestions (in case
Kirk's suggestions already fixed the issue) I would be happy if you
were so kind to also enable PLAB statistics to show if fragmentation is
an issue as we talked before (-XX:+PrintPLAB). You can later turn that
off.

I do not think region liveness information is interesting any more (-
XX:+G1PrintRegionLivenessInfo), so you could remove it again.

Thanks,
  Thomas