Strange G1 behavior

Tue Oct 17 21:48:06 UTC 2017

Quick note before going to bed...

On 17 October 2017 at 23:28, Kirk Pepperdine <kirk at kodewerk.com> wrote:

> Hi all,
>
> I’ve run the data through my crude visualization tool. I’ve screen casted
> it for those that don’t want to go through the pain of cloning and sorting
> out how to use the tooling.
>
> https://drive.google.com/a/jclarity.com/file/d/
> 0B6IuyTuhCQTPUjNyejVGN1U5Vms/view?usp=sharing
>

I don't seem to have access, I sent an access request...

>
> This log looks different in that the mixed collections are actually
> recovering space. However there seems to be an issue with RSet update times
> just as heap occupancy jumps though I would view this as a normal response
> to increasing tenured occupancies. The spike in tenured occupancy does
> force young to shrink to a size that should see “to-space” with no room to
> accept in-coming survivors.
>
> Specific recommendations; the app is churning using enough weak references
> that your app would benefit from parallelizing reference processing (off by
> default), I would double max heap and limit the shrinking of young to 20%
> to start with (default is 5%).
>

I'll double max heap tomorrow. Parallel ref processing is already enabled
(-XX:+ParallelRefProcEnabled), and young is already limited to max 25%
(-XX:G1MaxNewSizePercent=25), I'll add -XX:G1NewSizePercent=20 (, if that's
the correct option).

Thanks for the analysis!

>
> Kind regards,
> Kirk
>
> On Oct 17, 2017, at 2:36 PM, Guillaume Lederrey <
> guillaume.lederrey at gmail.com> wrote:
>
> It looks like a 20MB attachement isn't the way to go... So here is the
> log: https://people.wikimedia.org/~gehel/wdqs-blazegraph_
> jvm_gc.pid13326.log.0
>
> On 17 October 2017 at 14:21, Guillaume Lederrey <
> guillaume.lederrey at gmail.com> wrote:
>
>> Hello!
>>
>> Thanks a lot for your analysis! I have to admit that I barely follow your
>> train of thoughts.
>>
>> Here is another GC log after adding "-XX:+G1PrintRegionLivenessInfo".
>> This is only a few hours worth of data, I'll send additional logs as they
>> are generated.
>>
>> From what I understand of your discussions, we also need to have a deeper
>> look at why our application is allocating that much memory and what we can
>> do to prevent this. My very high level guess: we have some clients doing
>> some more expensive query. Since our SPARQL endpoint is publicly
>> accessible, that would mean we need to find a way to proactively estimate
>> the memory cost of a query and drop the too expensive ones. Not an easy
>> problem either...
>>
>> Thanks again!
>>
>>   Guillaume
>>
>> On 17 October 2017 at 13:31, Thomas Schatzl <thomas.schatzl at oracle.com>
>> wrote:
>>
>>> Hi,
>>>
>>> On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:
>>> > > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl at oracle
>>> > > .com> wrote:
>>> > >
>>> > > Hi Kirk,
>>> > >
>>> > > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
>>> > > > Hi Thomas,
>>> > > >
>>> > > > Again, thank you for the detailed response.
>>> > > >
>>> > > >
>>> > > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
>>> > > > > <thomas.schatzl at oracle.com> wrote:
>>> > > > >
>>> > > > > For the allocation rate, please compare the slopes of heap
>>> > > > > usage after (young) gcs during these spikes (particularly in
>>> > > > > that full gc case) and normal operation.
>>> > > >
>>> > > > Censum estimates allocation rates as this is a metric that I
>>> > > > routinely evaluate.
>>> > > >
>>> > > > This log shows a spike at 10:07 which correlates with the Full GC
>>> > > > event. However the allocation rates while high, are well within
>>> > > > values I’ve seen with many other applications that are well
>>> > > > behaved. Censum also estimates rates of promotion and those seem
>>> > > > exceedingly high at 10:07. That said, there are spikes just after
>>> > > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
>>> > > > the estimates for allocation rates are high though the estimates
>>> > > > for rates of  promotion while high, are not as high as those seen
>>> > > > at 10:07.
>>> > > >
>>> > > > All in all, nothing here seems out of the ordinary and while I
>>> > > > want you to be right about the waste and PLAB behaviors, these
>>> > > > spikes feel artificial, i.e. I still want to blame the collector
>>> > > > for not being able to cope with some aspect of application
>>> > > > behavior that it should be able to cope with.. that is something
>>> > > > other than a high allocation rate with low recover due to data
>>> > > > simply being reference and therefor not eligible for
>>> > > > collection...
>>> > >
>>> > > I always meant "promotion rate" here as allocation rate. For this
>>> > > discussion (and in general) in a generational collector the
>>> > > application's real allocation rate is usually not very interesting.
>>> > >
>>> > > Sorry for being imprecise.
>>> >
>>> > No worries, I get that promotion rates are very important but in when
>>> > deciding what needs to be tuned, allocation rates are also very
>>> > important so I try to be very precise here, maybe too precise at
>>> > times.
>>>
>>> Let's qualify this further: for the purpose of investigating this full
>>> gc :)
>>>
>>> > > > > In this application, given the information I have, every like
>>> > > > > 1500s, there seems to be some component in the application that
>>> > > > > allocates a lot of memory in a short time, and holds onto most
>>> > > > > of
>>> > > > > it for its duration.
>>> > > >
>>> > > > Sorry but I’m not seeing this pattern either in occupancy after
>>> > > > or
>>> > > > allocation rate views. What I do see is a systematic loss of free
>>> > > > heap over time (slow memory leak??? Effects of caching???).
>>> > >
>>> > > Let's have a look at the heap usage after gc over time for a few
>>> > > collection cycle before that full gc event. Please look at
>>> > > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
>>> > > promo.png which just shows a few of these.
>>> >
>>> > Sorry but I can’t see these charts.
>>>
>>> My email client added some line breaks in the URL because it was too
>>> long, sorry. One option is to manually stitch it together; all graphs
>>> so far are collected in
>>>
>>> http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
>>> strange-g1-promo.png image.
>>>
>>> (and cr.openjdk.java.net is live and kicking)
>>>
>>> > The issue with promotion rates is that you have to ask, is this the
>>> > dog wagging the tail, or the tail wagging the dog. In this case I
>>> > think it’s the tail wagging the dog. Would it be possible that a
>>> > significant portion of the data being promoted was directly due to
>>> > to-space being resized to 0?
>>>
>>> No. At that point the full gc has already been inevitable from my
>>> understanding of the log. If you look at the log, the second to-space
>>> exhaustion before the full gc does not, as assumed in the policy, give
>>> back any free regions (typically they do), so there is no point in
>>> restarting the application. Finally some time later it performs the
>>> full gc.
>>>
>>> Overall, in my understanding there seems to be a basic out-of-memory
>>> situation here: the application's live set (plus any overheads caused
>>> by memory management) during that application phase is too much for the
>>> given available memory for G1.
>>>
>>> Full GC can free up some of that extra space (and potentially the
>>> application already let go of some data at that time), allowing the
>>> application to continue until everything returns to normal.
>>>
>>> That a GC algorithm needs extra space to work well is not a secret, and
>>> is something the user needs to account for. Iirc some (rather old now)
>>> literature (on smaller heaps, with different applications) indicate
>>> that from ~33% of live set performance may start to degrade
>>> significantly. After that full gc we are already at ~66% of live
>>> data...
>>>
>>> So the behavior we are seeing now seems like expected (but of course
>>> undesirable) behavior of the GC algorithm to me.
>>>
>>> Now there is of course the question where that overhead comes from, and
>>> how it can be reduced as the memory used after full gc is quite a bit
>>> below the maximum heap. :)
>>>
>>> All that will probably be a bit clearer with more data.
>>>
>>> > Sorry for being so slow but I think I’m starting to understand the
>>> > underlying pathology here. I’ll hypothesis here and you can blow the
>>> > whole idea out of the water with more insightful comments..
>>> >
>>> > Since tenured can grow into the space that the to-space is likely to
>>> > also want, preference towards tenured deflates survivor which in turn
>>> > puts more pressure on tenured which in turn deflates survivor even
>>> > further, wash rinse, repeat, Full GC to clean up. Since this didn’t
>>> > happen with the (older) generational collectors, it seems as if it’s
>>> > possible that everything could be ok if a minimum to-space size could
>>> > be maintained.
>>>
>>> Not sure what you mean with "preference towards tenured": of course the
>>> generations share the same total memory, and if one of them gets
>>> unusually large, the other needs to adapt. G1 already has a minimum
>>> young gen size (which is, as you can see, not a hard minimum), but in
>>> this degenerate situation it simply thinks: "okay, let's try limping
>>> along with less than that because we can. Maybe we are lucky, the
>>> application gets back to normal within the next few seconds and so can
>>> avoid full gc".
>>>
>>> The alternative would basically be to start a fully compacting gc
>>> earlier (the other collectors would do that btw) as soon as the space
>>> statically allocated to old gen runs out.
>>>
>>> Do you think this is preferable behavior?
>>>
>>> > > I added rough linear interpolations of the heap usage after gc (so
>>> > > that the slope of these lines corresponds to the promotion rates).
>>> > > I can see a large, significant difference in the slopes between the
>>> > > collection cycles before the full gc event (black lines) and the
>>> > > full gc event (red line), while all black ones are roughly the
>>> > > same. :)
>>> > >
>>> > > Note that my graphs were painted off-hand without any actual
>>> > > calculation, and particular the red one looks like an
>>> > > underestimation of the slope.
>>> > >
>>> > > > As I look at all of the views in Censum I see nothing outstanding
>>> > > > that leads me to believe that this Full is a by-product of some
>>> > > > interaction between the collector and the application (some form
>>> > > > of zombies????). Also, one certainly cannot rule out your
>>> > > > speculation
>>> > >
>>> > > It does not look like there is an issue e.g. with
>>> > > j.l.ref.References of
>>> > > any kind.
>>> >
>>> > LinkList used to be susceptible to nepotism and all generations
>>> > collectors suffer from zombies.
>>>
>>> Maybe. I did not preclude more issues here :)
>>>
>>> [...]
>>> > > >
>>> > > > I think the answer right now is; we need more data.
>>> > >
>>> > > Agree.
>>> >
>>> > It’s on it’s way!! Yeah for more data.
>>> > You might be interested in this for visualization.  https://github.co
>>> > m/kcpeppe/regions
>>> > It’s a wee bit crude at the moment but I’ll accept pull requests for
>>> > any refinements.
>>> >
>>>
>>> Thanks,
>>>   Thomas
>>>
>>>
>>
>>
>> --
>> mobile : +41 76 573 32 40 <+41%2076%20573%2032%2040>
>> skype : Guillaume.Lederrey
>> Freenode: gehel
>>
>> projects :
>> * https://signs-web.herokuapp.com/
>>
>
>
>
> --
> mobile : +41 76 573 32 40 <+41%2076%20573%2032%2040>
> skype : Guillaume.Lederrey
> Freenode: gehel
>
> projects :
> * https://signs-web.herokuapp.com/
>
>
>

-- 
mobile : +41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel

projects :
* https://signs-web.herokuapp.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20171017/8e56aa0c/attachment.htm>