Strange G1 behavior

Mon Oct 16 11:32:18 UTC 2017

Hi Kirk,

On Fri, 2017-10-13 at 15:23 +0200, Kirk Pepperdine wrote:
> Hi Thomas,
> 
> Thanks for the this very detailed analysis. Where to start…….
> 
> There is no disputing your analysis of this GC log as it’s pretty
> much matches my thoughts on what is happening. My question is
> centered around my experiences running into this spiking behavior in
> many different applications across many different domains. A common
> artifact prior to the Full is seeing survivor resized to 0B. In fact
> if you see survivor be resized to 0B there doesn’t seem to be a way
> to avoid the full. (Survivor X->0.0B appears to be a side effect of
> the root of the problem).
> 
> I guess it is possible to have the same conditions appearing in many
> applications running across many domains that result in this
> identical behavior. That said, I guess I’m very lucky that I am
> running into the accidental timing issue as frequently as I am. It
> feels like the question I want to as is; is there a common coding
> pattern that all the applications share that is aggravating the
> garbage collector in nasty way?

Given the current information I have, the application's live set and
allocation rate simply seems too high to support current GC settings
temporarily.

For the allocation rate, please compare the slopes of heap usage after
(young) gcs during these spikes (particularly in that full gc case) and
normal operation.

In this application, given the information I have, every like 1500s,
there seems to be some component in the application that allocates a
lot of memory in a short time, and holds onto most of it for its
duration.

Now, there can be other reasons for the high allocation rate in the old
gen (that the heap usage graphs show), one is fragmentation in the
PLABs (TLABs used during GC)/to-space during copying.

I.e. with high PLAB waste/fragmentation you potentially need more space
than necessary in the old gen. In addition to that, these mostly empty
regions are not going to be evacuated in the current mixed gc cycle
again as G1 does not have any liveness information about them except
"completely full". So you potentially waste tons (and I know of at
least one internal application that suffered a lot from that) of
memory.

That could explain why the full gc can free so much memory.

However there is no exact way in jdk8 to get information about that -
some of that information can be inferred from -XX:+PrintPLAB; for
testing purposes you could also try simply disabling PLAB resizing (-
XX:-ResizePLAB), set "reasonable" PLAB sizes manually and see if the
old gen allocation rate decreases during these spikes.

Again, in JDK9 we significantly improved this area, both about PLAB
resizing and reporting.
We have seen significant improvements on memory usage during gc and
decreases in gc frequency (and actually significant improvements on
throughput/latency because of that) on some internal applications.

Please look at the task JDK-8030849, and in particular the resolved
changes referenced there.

If you ask about common coding patterns that can avoid issues with
humongous object fragmentation, there are (to me at least ;) "easy"
ones: initialize your arrays with non-2^n element count values (unlike
the java collections do), or use a different exponential function than
doubling to increase the size of your arrays (or think a bit about
expected sizes of your containers to spend less cpu cycles on doubling
their memory backing, potentially improving overall application
throughput).

I.e. an initial array size of e.g. 16 ultimately results in an object
size of 2^n bytes + object header size bytes, which is enough to cause
lots of wasted space as soon as your array gets humongous. If you used
e.g. 15 as an initial value, everything would be fine from that POV.

Providing a patch for the previously mentioned JDK-8172713 would also
fix that - for future JDK versions only though.

> And if so, is there something the collector could do to prevent the
> Full? That is it reproducible in this particular case suggest that 
> the answer is yes.

Depends on further information. If the application holds onto too much
memory and allocates too quickly, there is ultimately nothing that GC
can do.

If it is e.g. the issue with PLAB fragmentation, then yes, and that
issue has already been fixed in jdk9 (to a very large degree; as you
can see from JDK-8030849, there are some ideas that were left out
because they seemed unnecessary.)

JDK 18.3 will get a parallel full gc to improve the worst-case
situation a bit; in JDK9 the young gcs with to-space exhaustion should
already be very fast (I just saw that in that log with jdk8 they take
like seconds too).

> Have I been able to eliminate this behavior by “adjusting” the
> applications use of humongous object? The answer to that is, I think
> it has made it better but hasn’t eliminated it.

That seems logical: improving humongous object space usage only delays
the inevitable as it only frees some more memory the application can
hold onto a bit longer.

> Quite typically this problem takes several days to show up which
> makes testing tricky.

At least in this case, the issue seems to be quite reproducible: there
is something happening every like 1500s that causes these spikes.

> So I have to ask the question,  is it possible that humongous regions
> are being ignored in such a way that they build up to create a
> condition where a Full GC is necessary to recover from it?

I kind of fail to understand the goal of this request: humongous
objects are allocated by the Java application, and as such part of the
Java heap, and obviously counted against heap usage, wherever they are
allocated. They will also always cause at least some kind of
fragmentation waste wherever they are located.

Note that without quantification of the impact of humongous object
fragmentation waste in this particular instance I do not know whether a
more clever humongous object sizing improves the situation
significantly enough.

It seems to exacerbate the situation though.

> I have in my GitHub repository a project that displays the data from
> -XX:+G1PrintRegionLivenessInfo. It was useful for visualizing
> humongous allocation behavior.
>
> As for Java 9, not a real possibility for the moment. I can hope but
> I’m not so confident after this find that it’s going to offer relief
> for this particular issue.

Thanks,
  Thomas