Strange G1 behavior

Fri Oct 13 13:23:32 UTC 2017

Hi Thomas,

Thanks for the this very detailed analysis. Where to start…….

There is no disputing your analysis of this GC log as it’s pretty much matches my thoughts on what is happening. My question is centered around my experiences running into this spiking behavior in many different applications across many different domains. A common artifact prior to the Full is seeing survivor resized to 0B. In fact if you see survivor be resized to 0B there doesn’t seem to be a way to avoid the full. (Survivor X->0.0B appears to be a side effect of the root of the problem).

I guess it is possible to have the same conditions appearing in many applications running across many domains that result in this identical behavior. That said, I guess I’m very lucky that I am running into the accidental timing issue as frequently as I am. It feels like the question I want to as is; is there a common coding pattern that all the applications share that is aggravating the garbage collector in nasty way? And if so, is there something the collector could do to prevent the Full? That is it reproducible in this particular case suggest that the answer is yes. Have I been able to eliminate this behavior by “adjusting” the applications use of humongous object? The answer to that is, I think it has made it better but hasn’t eliminated it. Quite typically this problem takes several days to show up which makes testing tricky

So I have to ask the question,  is it possible that humongous regions are being ignored in such a way that they build up to create a condition where a Full GC is necessary to recover from it?

I have in my GitHub repository a project that displays the data from -XX:+G1PrintRegionLivenessInfo. It was useful for visualizing humongous allocation behavior.

As for Java 9, not a real possibility for the moment. I can hope but I’m not so confident after this find that it’s going to offer relief for this particular issue.

Kind regards,
Kirk

> On Oct 13, 2017, at 2:12 PM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> Hi Kirk,
> 
> On Wed, 2017-10-11 at 19:05 +0200, Kirk Pepperdine wrote:
>> Hi,
>> 
>> I should add that Charlie Hunt added some invaluable information in
>> our Friend of jclarity list that I should add here. He looked at the
>> adaptive sizing records that are in the log and came to the
>> conclusion that the regions in tenured were not collected because
>> they are over the 85% live mark. This according to the AS records is
>> true.. but then one has to ask, how does a full recover so much
>> memory if more there are so many tenured regions that are more than
>> 85% full. I have other logs from older JVMs where heap occupancy
>> after each mixed collection is greater than the previous collection
>> occupancy.. and then you get the full that collects everything. I 
> 
> I suspect... the application is the main cause for the full gc here -
> and bad timing.
> 
>> didn’t spam the list with those logs because; they didn’t have print
>> adaptive sizing turned on and they were from older JVMs meaning the
>> problem may have already been solved. This JVM is a 141 build so
>> while it behaves some what differently, it’s the same spike up ‘till
>> only a full can help, fun the full and life goes on as if nothing
>> happened.
> 
> If you look at the previous cycles of allocation and reclamation, you
> can see that most of them peak at 6G total heap.
> 
> There are a few spikes in the heap occupancy graph where during a short
> interval more than 6G of memory is in use temporarily. That means,
> memory usage spikes a few times in succession, with concurrent marking
> running all the time.
> 
> Here are the locations of the peaks within the log:
> 
> - ~227938.857: peaking at 6.9G, after that young gc a single concurrent
> mark cleans out 1G of completely empty memory.
> - ~228085.205: peaking at 7G, again a following concurrent mark cleans
> out 3G of completely empty.
> - ~240985.106: the 12G peak with the full gc.
> - ~242363.281: a 8.6G peak.
> - ~246666.233: a 6.5G peak
> - ~249968.146: a 6.5G peak
> - ~251480.608: a 7.5G peak
> 
> In all/most of these cases, either the Cleanup pause or mixed GC can
> reclaim lots of memory inbetween, but memory usage spiking again and
> again, repeating this behavior a few times until things settle again.
> 
> All this sprinkled with initial marks caused by humongous allocations.
> 
> This leads to my hypothesis that humongous objects potentially being an
> issue here, in combination with the application simply doing lots of
> large allocations, holding on to them for a long time.
> 
> Let me explain a few (messy) heap usage graphs that show my reasoning:
> 
> http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-overview
> .png : shows an overview of heap usage after gc of the whole log file,
> split by GC/GC reason.
> 
> You can easily spot the live set spikes in that graph, including the
> one that led to full gc.
> 
> Another interesting line here is the brownish one with the triangles,
> showing heap usage for initial mark gcs caused by humongous
> allocations. As you can see there, the application uses a considerable
> amount of humongous objects which may be the cause for the large memory
> usage - i.e. if they are not ideally sized.
> 
> Btw, these spikes seem to be somewhat regular, at least from around the
> middle of the log, could it be some "result flushing" occurring every
> like 1500 seconds?
> 
> http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-example.
> png shows the heap usage graphs for the spike next to the full gc.
> 
> It shows that
> - there are occasionally phases of the application where a lot of
> memory is in use for like a minute or so (e.g. from second 242300 to
> 242380).
> - during this phase, mixed gc can clean up quite a bit every time (pink
> triangles), but apparently the application is also often allocating
> faster than g1 can reclaim (which causes the full gc in the other
> case).
> - a lot of the marking is initiated by humongous allocation (brown
> triangles)
> - the cleanup pause after marking can often decrease memory usage quite
> a bit by itself (green squares), which indicates humongous objects
> dying fast (cleanup reclaims all completely empty regions, and
> typically these are humongous objects).
> 
> I can't see any unusual issue with G1, apart from that it issues two
> useless young gcs with eden size of zero - compared to the following
> full gc they are probably negligible ;) (this is https://bugs.openjdk.j
> ava.net/browse/JDK-8165150)
> 
> So my suggestion is to look at whether something can be done about the
> humongous objects here, i.e. look if the waste caused by them is large
> or not, and if so, try to decrease that.
> 
> There does not seem to be a significant if any fragmentation issue with
> humongous objects here, and whether marking or eager reclaim catches
> them has the same effect, although eager reclaim is cheaper. (I think I
> saw some occurrences of eager reclaim).
> 
> Before suggesting potential fixes and optimization, I recommend
> verifying this hypothesis ;)
> 
> The simplest way and least intrusive way (to me) to look at humongous
> object sizes is to enable -XX:+G1TraceEagerReclaimHumongousObjects (it
> is an experimental options) to get information about humognous objects
> at every GC, but if you prefer somewhat less frequent output,
> -XX:+G1PrintRegionLivenessInfo at the end of every concurrent marking.
> Look at the occupancy of the last region of a set of regions and see if
> there is lots of emptiness in there.
> 
> You can also do heap dumps with jmap and analyze the arrays with e.g.
> MAT, but that is very intrusive.
> 
> _If_ there is lots of waste at the end of humongous regions, you can do
> one of the following:
> 
> - fix the application to not use such awkward array sizes ;) I.e. the
> typical double-the-array-size algorithm can create lots of waste for G1
> as this creates 2^x + 16 bytes sized objects which just don't fit into
> regions.
> 
> Maybe, by sizing these arrays better in advance (like if you have an
> approximate idea about their size), you can actually avoid lots of
> memory allocations and copying of array contents (independently of the
> chosen collector) which is not free either, to get a significant boost
> of your application.
> 
> - increase heap region size using G1HeapRegionSize to have less
> humongous objects so that there are not that many huge arrays wasting
> space. This one is probably the best short-term fix to get rid of a lot
> of humongous object allocations.
> 
> - increase the heap
> 
> - contributions for https://bugs.openjdk.java.net/browse/JDK-8172713
> are welcome.
> 
> Other issues:
> 
> - there are marking failures, i.e. g1 can't complete marking and needs
> to restart it particularly during the ramp-up of heap usage that leads
> to the full gc.
> I do not think makes a difference for that particular situation, but it
> might be awkward for others.
> Increase -XX:MarkStackSize/-XX:MaxMarkStackSize (I think that's their
> name) to something larger to avoid the "concurrent-mark-reset-for-
> overflow" messages.
> Or switch to jdk9 where this issue most likely will not occur.
> 
> - the time from end of marking to reclaiming memory is a bit long and
> might be problematic sometimes. The only fix is to upgrade to jdk9,
> which I highly recommend. (Also because the logging in jdk8 is a mess
> ;))
> 
> - the (by default) selected IHOP value seems too low, the application
> wastes 6G of heap. Again, either update to jdk9 and profit from the
> adaptive IHOP (and other improvements), or increase the IHOP manually
> (-XX:InitiatingHeapOccupancyPercent) for increased throughput.
> 
> Note that increasing the IHOP makes no real difference in the frequency
> of the full gcs as it does not change how much and how long the
> application holds on to memory.
> In your application, except in the cases when these spikes occur,
> marking takes like 300ms max (see http://cr.openjdk.java.net/~tschatzl/
> g1-strange-log/strange-g1-concmark.png).
> 
> In case of spikes, particularly even with the repeated marking, it
> takes like 9s where the long marking times coincide with the spikes. I
> assume that during these times a single very large array of j.l.O
> serializes marking progress.
> 
> These spikes should be much more muted in jdk9 too :)
> 
> Thanks,
>  Thomas
> 
> P.S.: Did I already mention to try out jdk9? ;)
>