Strange G1 behavior

Mon Oct 16 18:07:15 UTC 2017

Hi Thomas,

Again, thank you for the detailed response.

> On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> Hi Kirk,
> 
> On Fri, 2017-10-13 at 15:23 +0200, Kirk Pepperdine wrote:
>> Hi Thomas,
>> 
>> Thanks for the this very detailed analysis. Where to start…….
>> 
>> There is no disputing your analysis of this GC log as it’s pretty
>> much matches my thoughts on what is happening. My question is
>> centered around my experiences running into this spiking behavior in
>> many different applications across many different domains. A common
>> artifact prior to the Full is seeing survivor resized to 0B. In fact
>> if you see survivor be resized to 0B there doesn’t seem to be a way
>> to avoid the full. (Survivor X->0.0B appears to be a side effect of
>> the root of the problem).
>> 
>> I guess it is possible to have the same conditions appearing in many
>> applications running across many domains that result in this
>> identical behavior. That said, I guess I’m very lucky that I am
>> running into the accidental timing issue as frequently as I am. It
>> feels like the question I want to as is; is there a common coding
>> pattern that all the applications share that is aggravating the
>> garbage collector in nasty way?
> 
> Given the current information I have, the application's live set and
> allocation rate simply seems too high to support current GC settings
> temporarily.
> 
> For the allocation rate, please compare the slopes of heap usage after
> (young) gcs during these spikes (particularly in that full gc case) and
> normal operation.

Censum estimates allocation rates as this is a metric that I routinely evaluate.

This log shows a spike at 10:07 which correlates with the Full GC event. However the allocation rates while high, are well within values I’ve seen with many other applications that are well behaved. Censum also estimates rates of promotion and those seem exceedingly high at 10:07. That said, there are spikes just after 10:10 and around 10:30 which don’t trigger a Full. In both cases the estimates for allocation rates are high though the estimates for rates of  promotion while high, are not as high as those seen at 10:07.

All in all, nothing here seems out of the ordinary and while I want you to be right about the waste and PLAB behaviors, these spikes feel artificial, i.e. I still want to blame the collector for not being able to cope with some aspect of application behavior that it should be able to cope with.. that is something other than a high allocation rate with low recover due to data simply being reference and therefor not eligible for collection...

> 
> In this application, given the information I have, every like 1500s,
> there seems to be some component in the application that allocates a
> lot of memory in a short time, and holds onto most of it for its
> duration.

Sorry but I’m not seeing this pattern either in occupancy after or allocation rate views. What I do see is a systematic loss of free heap over time (slow memory leak??? Effects of caching???).

As I look at all of the views in Censum I see nothing outstanding that leads me to believe that this Full is a by-product of some interaction between the collector and the application (some form of zombies????). Also, one certainly cannot rule out your speculation for heap fragmentation in PLABs. I simply don’t have the data to say anything about that though I know it can be a huge issue. What I can say is that even if there is 20% waste, it can’t account for the amount of memory being recovered. I qualify that with, unless there is a blast of barely humongous allocations taking place. I’d like to thing this is a waste issue but I’m suspicious. I’m also suspicious that it’s simply the application allocating in a burst then releasing. If that were the case I’d expect a much more gradual reduction in the live set size.

I think the answer right now is; we need more data. I’ll try to get the “client” to turn on the extra flags and see what that yields. I won’t play with plab sizing this go ‘round if you don’t mind. If you’re right and it is a problem with waste, then the beer is on me the next time we meet.

The don’t allocate arrays in size of powers of 2 is an interesting comment. While there are clear advantages to allocating arrays in size of powers of 2, I believe in that these cases are specialized and that I don’t generally see people dogmatically allocating this way.

Thanks for the invaluable thoughts.

Kind regards,
Kirk

> 
> Now, there can be other reasons for the high allocation rate in the old
> gen (that the heap usage graphs show), one is fragmentation in the
> PLABs (TLABs used during GC)/to-space during copying.
> 
> I.e. with high PLAB waste/fragmentation you potentially need more space
> than necessary in the old gen. In addition to that, these mostly empty
> regions are not going to be evacuated in the current mixed gc cycle
> again as G1 does not have any liveness information about them except
> "completely full". So you potentially waste tons (and I know of at
> least one internal application that suffered a lot from that) of
> memory.
> 
> That could explain why the full gc can free so much memory.
> 
> However there is no exact way in jdk8 to get information about that -
> some of that information can be inferred from -XX:+PrintPLAB; for
> testing purposes you could also try simply disabling PLAB resizing (-
> XX:-ResizePLAB), set "reasonable" PLAB sizes manually and see if the
> old gen allocation rate decreases during these spikes.
> 
> Again, in JDK9 we significantly improved this area, both about PLAB
> resizing and reporting.
> We have seen significant improvements on memory usage during gc and
> decreases in gc frequency (and actually significant improvements on
> throughput/latency because of that) on some internal applications.
> 
> Please look at the task JDK-8030849, and in particular the resolved
> changes referenced there.
> 
> If you ask about common coding patterns that can avoid issues with
> humongous object fragmentation, there are (to me at least ;) "easy"
> ones: initialize your arrays with non-2^n element count values (unlike
> the java collections do), or use a different exponential function than
> doubling to increase the size of your arrays (or think a bit about
> expected sizes of your containers to spend less cpu cycles on doubling
> their memory backing, potentially improving overall application
> throughput).
> 
> I.e. an initial array size of e.g. 16 ultimately results in an object
> size of 2^n bytes + object header size bytes, which is enough to cause
> lots of wasted space as soon as your array gets humongous. If you used
> e.g. 15 as an initial value, everything would be fine from that POV.
> 
> Providing a patch for the previously mentioned JDK-8172713 would also
> fix that - for future JDK versions only though.
> 
>> And if so, is there something the collector could do to prevent the
>> Full? That is it reproducible in this particular case suggest that 
>> the answer is yes.
> 
> Depends on further information. If the application holds onto too much
> memory and allocates too quickly, there is ultimately nothing that GC
> can do.
> 
> If it is e.g. the issue with PLAB fragmentation, then yes, and that
> issue has already been fixed in jdk9 (to a very large degree; as you
> can see from JDK-8030849, there are some ideas that were left out
> because they seemed unnecessary.)
> 
> JDK 18.3 will get a parallel full gc to improve the worst-case
> situation a bit; in JDK9 the young gcs with to-space exhaustion should
> already be very fast (I just saw that in that log with jdk8 they take
> like seconds too).
> 
>> Have I been able to eliminate this behavior by “adjusting” the
>> applications use of humongous object? The answer to that is, I think
>> it has made it better but hasn’t eliminated it.
> 
> That seems logical: improving humongous object space usage only delays
> the inevitable as it only frees some more memory the application can
> hold onto a bit longer.
> 
>> Quite typically this problem takes several days to show up which
>> makes testing tricky.
> 
> At least in this case, the issue seems to be quite reproducible: there
> is something happening every like 1500s that causes these spikes.
> 
>> So I have to ask the question,  is it possible that humongous regions
>> are being ignored in such a way that they build up to create a
>> condition where a Full GC is necessary to recover from it?
> 
> I kind of fail to understand the goal of this request: humongous
> objects are allocated by the Java application, and as such part of the
> Java heap, and obviously counted against heap usage, wherever they are
> allocated. They will also always cause at least some kind of
> fragmentation waste wherever they are located.
> 
> Note that without quantification of the impact of humongous object
> fragmentation waste in this particular instance I do not know whether a
> more clever humongous object sizing improves the situation
> significantly enough.
> 
> It seems to exacerbate the situation though.
> 
>> I have in my GitHub repository a project that displays the data from
>> -XX:+G1PrintRegionLivenessInfo. It was useful for visualizing
>> humongous allocation behavior.
>> 
>> As for Java 9, not a real possibility for the moment. I can hope but
>> I’m not so confident after this find that it’s going to offer relief
>> for this particular issue.
> 
> Thanks,
>  Thomas
>