Strange G1 behavior

Fri Oct 13 12:12:20 UTC 2017

Hi Kirk,

On Wed, 2017-10-11 at 19:05 +0200, Kirk Pepperdine wrote:
> Hi,
> 
> I should add that Charlie Hunt added some invaluable information in
> our Friend of jclarity list that I should add here. He looked at the
> adaptive sizing records that are in the log and came to the
> conclusion that the regions in tenured were not collected because
> they are over the 85% live mark. This according to the AS records is
> true.. but then one has to ask, how does a full recover so much
> memory if more there are so many tenured regions that are more than
> 85% full. I have other logs from older JVMs where heap occupancy
> after each mixed collection is greater than the previous collection
> occupancy.. and then you get the full that collects everything. I 

I suspect... the application is the main cause for the full gc here -
and bad timing.

> didn’t spam the list with those logs because; they didn’t have print
> adaptive sizing turned on and they were from older JVMs meaning the
> problem may have already been solved. This JVM is a 141 build so
> while it behaves some what differently, it’s the same spike up ‘till
> only a full can help, fun the full and life goes on as if nothing
> happened.

If you look at the previous cycles of allocation and reclamation, you
can see that most of them peak at 6G total heap.

There are a few spikes in the heap occupancy graph where during a short
interval more than 6G of memory is in use temporarily. That means,
memory usage spikes a few times in succession, with concurrent marking
running all the time.

Here are the locations of the peaks within the log:

- ~227938.857: peaking at 6.9G, after that young gc a single concurrent
mark cleans out 1G of completely empty memory.
- ~228085.205: peaking at 7G, again a following concurrent mark cleans
out 3G of completely empty.
- ~240985.106: the 12G peak with the full gc.
- ~242363.281: a 8.6G peak.
- ~246666.233: a 6.5G peak
- ~249968.146: a 6.5G peak
- ~251480.608: a 7.5G peak

In all/most of these cases, either the Cleanup pause or mixed GC can
reclaim lots of memory inbetween, but memory usage spiking again and
again, repeating this behavior a few times until things settle again.

All this sprinkled with initial marks caused by humongous allocations.

This leads to my hypothesis that humongous objects potentially being an
issue here, in combination with the application simply doing lots of
large allocations, holding on to them for a long time.

Let me explain a few (messy) heap usage graphs that show my reasoning:

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-overview
.png : shows an overview of heap usage after gc of the whole log file,
split by GC/GC reason.

You can easily spot the live set spikes in that graph, including the
one that led to full gc.

Another interesting line here is the brownish one with the triangles,
showing heap usage for initial mark gcs caused by humongous
allocations. As you can see there, the application uses a considerable
amount of humongous objects which may be the cause for the large memory
usage - i.e. if they are not ideally sized.

Btw, these spikes seem to be somewhat regular, at least from around the
middle of the log, could it be some "result flushing" occurring every
like 1500 seconds?

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-example.
png shows the heap usage graphs for the spike next to the full gc.

It shows that
- there are occasionally phases of the application where a lot of
memory is in use for like a minute or so (e.g. from second 242300 to
242380).
- during this phase, mixed gc can clean up quite a bit every time (pink
triangles), but apparently the application is also often allocating
faster than g1 can reclaim (which causes the full gc in the other
case).
- a lot of the marking is initiated by humongous allocation (brown
triangles)
- the cleanup pause after marking can often decrease memory usage quite
a bit by itself (green squares), which indicates humongous objects
dying fast (cleanup reclaims all completely empty regions, and
typically these are humongous objects).

I can't see any unusual issue with G1, apart from that it issues two
useless young gcs with eden size of zero - compared to the following
full gc they are probably negligible ;) (this is https://bugs.openjdk.j
ava.net/browse/JDK-8165150)

So my suggestion is to look at whether something can be done about the
humongous objects here, i.e. look if the waste caused by them is large
or not, and if so, try to decrease that.

There does not seem to be a significant if any fragmentation issue with
humongous objects here, and whether marking or eager reclaim catches
them has the same effect, although eager reclaim is cheaper. (I think I
saw some occurrences of eager reclaim).

Before suggesting potential fixes and optimization, I recommend
verifying this hypothesis ;)

The simplest way and least intrusive way (to me) to look at humongous
object sizes is to enable -XX:+G1TraceEagerReclaimHumongousObjects (it
is an experimental options) to get information about humognous objects
at every GC, but if you prefer somewhat less frequent output,
-XX:+G1PrintRegionLivenessInfo at the end of every concurrent marking.
Look at the occupancy of the last region of a set of regions and see if
there is lots of emptiness in there.

You can also do heap dumps with jmap and analyze the arrays with e.g.
MAT, but that is very intrusive.

_If_ there is lots of waste at the end of humongous regions, you can do
one of the following:

- fix the application to not use such awkward array sizes ;) I.e. the
typical double-the-array-size algorithm can create lots of waste for G1
as this creates 2^x + 16 bytes sized objects which just don't fit into
regions.

Maybe, by sizing these arrays better in advance (like if you have an
approximate idea about their size), you can actually avoid lots of
memory allocations and copying of array contents (independently of the
chosen collector) which is not free either, to get a significant boost
of your application.

- increase heap region size using G1HeapRegionSize to have less
humongous objects so that there are not that many huge arrays wasting
space. This one is probably the best short-term fix to get rid of a lot
of humongous object allocations.

- increase the heap

- contributions for https://bugs.openjdk.java.net/browse/JDK-8172713
are welcome.

Other issues:

- there are marking failures, i.e. g1 can't complete marking and needs
to restart it particularly during the ramp-up of heap usage that leads
to the full gc.
I do not think makes a difference for that particular situation, but it
might be awkward for others.
Increase -XX:MarkStackSize/-XX:MaxMarkStackSize (I think that's their
name) to something larger to avoid the "concurrent-mark-reset-for-
overflow" messages.
Or switch to jdk9 where this issue most likely will not occur.

- the time from end of marking to reclaiming memory is a bit long and
might be problematic sometimes. The only fix is to upgrade to jdk9,
which I highly recommend. (Also because the logging in jdk8 is a mess
;))

- the (by default) selected IHOP value seems too low, the application
wastes 6G of heap. Again, either update to jdk9 and profit from the
adaptive IHOP (and other improvements), or increase the IHOP manually
(-XX:InitiatingHeapOccupancyPercent) for increased throughput.

Note that increasing the IHOP makes no real difference in the frequency
of the full gcs as it does not change how much and how long the
application holds on to memory.
In your application, except in the cases when these spikes occur,
marking takes like 300ms max (see http://cr.openjdk.java.net/~tschatzl/
g1-strange-log/strange-g1-concmark.png).

In case of spikes, particularly even with the repeated marking, it
takes like 9s where the long marking times coincide with the spikes. I
assume that during these times a single very large array of j.l.O
serializes marking progress.

These spikes should be much more muted in jdk9 too :)

Thanks,
  Thomas

P.S.: Did I already mention to try out jdk9? ;)