YGC time increasing suddenly

Fri Dec 20 16:38:31 PST 2013

Hi Luciano --

On Fri, Dec 20, 2013 at 11:14 AM, Luciano Molinari <lucmolinari at gmail.com>wrote:

> ...
> Yes, as soon as I start to see longer pauses the promotion rate rapidly
> increases.
>
> As it was pointed out in one of the previous e-mail and I checked today
> using JConsole, survivor space is almost empty while YGC time is OK, with
> just a few MB in use. But when YGC time increase, survivor also starts to
> become full, which must explain the increasing promotion rate.
>
> Both young gc and promotion rate doesn't drop after full gc. In the log
> you commented the GC that took place at 1992 seconds was the last "quick"
> minor GC. I attached a gc log and jstat to this e-mail with the whole
> information.
>
> I'll try to test -XX:+NeverPromote.
>

Thanks for the complete logs. I am pretty sure NeverPromote won't help at
all.

As you can see there's almost a phase change in the survival behaviour of
your program at around
2290 - 2291 s, where survival rate suddenly sky-rockets. Since the
behaviour appears to be very
deterministic, I'd venture that perhaps there's something happening in your
program at that point
which causes many more objects to be retained. Perhaps this is a leak of
some kind, or something
being inadvertently retained?

There wasn't any PatNew data after the first CMS gc to see if that made any
difference to survival
(as to my original theory of this being a case of premature promotion
causing nepotism). Eyeballing
your GC logs, it doesn't immediately seem to me to be a case of that.

This is one place where the promotion census tool would have helped. May be
you could get close enough,
though, by comparing class histogram dumps or heap dumps before the onset
of the high survival rates
with those after the onset.

I am almost sure this is a program issue and not an artifact of GC. To test
that hypothesis, you could, as
I think Bernd suggested earlier, run with a young gen and heap twice as
large (or half as large) as you have
now, and see if the survival rate jumps up at around the sane time as it
does now. If the "jump time"
remains invariant under heap reshaping (to larger values) then it's very
likely your application itself
and not an artifact of GC.

Thanks for sharing, all the best, and let us know how this turned out!
-- ramki

> I really appreciate the support and time of all of you.
>
> Regards,
>
>
> On Fri, Dec 20, 2013 at 7:26 AM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
>
>> Hi Luciano --
>>
>> Look at the rate of promotion. It sounds as if the rate of promotion and
>> of survival suddenly jumps up where you start
>> seeing longer pauses. If you believe the promotion and survivor volume
>> should not increase like that, one possibility
>> is "nepotism", where objects that may have been prematurely promoted and
>> then died, cause the objects they are
>> referring to, to be kept alive. Try to do the following experiment.
>>
>> (1) In the one case, let the system run as it has always run. Notice the
>> increase in young gc time. Let the system
>>      continue to run until a full gc occurs, and see if the young gc time
>> drops
>>
>> (2) In a second case, let the system run util you see the increased young
>> gc times. At this time induce an explicit full gc
>>      (via jconsole) and see if the young gc times fall.
>>
>> (3) Repeat (2), but with -XX:+ExplicitGCInvokesConcurrent
>>
>> Finally, one question, do you see the increased in young gc times after
>> an explicit full gc? Or are you saying that the
>> minor gc at 1992 seconds that you show below is the last "quick" minor
>> GC, and the next one onwards is slow?
>>
>> If you share a complete GC log, we might be able to help. The age 15
>> cohort in the last quick minor GC probably holds the
>> clue and once promoted, allows nepotism to set in. If that is the case,
>> then -XX:+NeverPromote (if
>> the flag still exists) might be worth experimenting with (although it
>> carries its own dangers, and isn't really a fix, just
>> a means to figuring out If this might be the issue).
>>
>> Basically, I think something like a member of a linked list is getting
>> into the old generation before it does
>> and then proceeds to keep its successor alive, and so on. A full gc may
>> or may not fix that kind of issue.
>> You could try to clear the links of list members when you pull them out
>> of the list and that might reduce the
>> occurrence of such phenomena. I vaguely recall this behaviour with
>> LinkedBlockingQueue's because of which
>> Doug Lea was forced to hold his nose and clear the references when
>> removing members from the list to
>> prevent this kind of false retention. (Ah, I found a reference:
>> http://thread.gmane.org/gmane.comp.java.jsr.166-concurrency/5758
>> and https://bugs.openjdk.java.net/browse/JDK-6805775 )
>>
>> -- ramki
>>
>>
>> On Wed, Dec 18, 2013 at 10:58 AM, Luciano Molinari <lucmolinari at gmail.com
>> > wrote:
>>
>>> Hi everybody,
>>>
>>> We have a standalone Java app that receives requests through RMI and
>>> almost all the objects created by it are short (< ~100ms) lived objects.
>>> This app is running on a 24 cores server with 16 GB RAM (Red Hat Linux).
>>> During our performance tests (10k requests/second) we started to face a
>>> problem where the throughput decreases suddenly just a few minutes
>>> after the app was started.
>>> So, I started to investigate GC behaviour and to make some adjustments
>>> (increase memory, use CMS...) and now we are able to run our app
>>> properly for about 35 minutes. At this point the time spent during young
>>> collections grows sharply although no Full GC is executed (old gen is only
>>> ~4% full).
>>>
>>> I've done tests with many different parameters, but currently I'm using
>>> the following ones:
>>> java -server -verbose:gc -XX:+PrintGCDetails
>>> -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
>>> -XX:PrintFLSStatistics=1 -XX:SurvivorRatio=4
>>> -XX:ParallelGCThreads=8 -XX:PermSize=256m -XX:+UseParNewGC
>>> -XX:MaxPermSize=256m -Xms7g -Xmx7g -XX:NewSize=4608m -XX:MaxNewSize=4608m
>>> -XX:MaxTenuringThreshold=15 -Dsun.rmi.dgc.client.gcInterval=3600000
>>> -Dsun.rmi.dgc.server.gcInterval=3600000
>>>  -Djava.rmi.server.hostname=IP_ADDRESS
>>>
>>> If I use this same configuration (without CMS) the same problem occurs
>>> after 20minutes, so it doesn't seem to be related to CMS. Actually, as I
>>> mentioned above, CMS (Full GC) isn't executed during the tests.
>>>
>>> Some logs I've collected:
>>>
>>> 1992.748: [ParNew
>>> Desired survivor size 402653184 bytes, new threshold 15 (max 15)
>>> - age   1:    9308728 bytes,    9308728 total
>>> - age   2:       3448 bytes,    9312176 total
>>> - age   3:       1080 bytes,    9313256 total
>>> - age   4:         32 bytes,    9313288 total
>>> - age   5:      34768 bytes,    9348056 total
>>> - age   6:         32 bytes,    9348088 total
>>> - age  15:       2712 bytes,    9350800 total
>>> : 3154710K->10313K(3932160K), 0.0273150 secs] 3215786K->71392K(6553600K)
>>>
>>> //14 YGC happened during this window
>>>
>>> 2021.165: [ParNew
>>> Desired survivor size 402653184 bytes, new threshold 15 (max 15)
>>> - age   1:    9459544 bytes,    9459544 total
>>> - age   2:    3648200 bytes,   13107744 total
>>> - age   3:    3837976 bytes,   16945720 total
>>> - age   4:    3472448 bytes,   20418168 total
>>> - age   5:    3586896 bytes,   24005064 total
>>> - age   6:    3475560 bytes,   27480624 total
>>> - age   7:    3520952 bytes,   31001576 total
>>> - age   8:    3612088 bytes,   34613664 total
>>> - age   9:    3355160 bytes,   37968824 total
>>> - age  10:    3823032 bytes,   41791856 total
>>> - age  11:    3304576 bytes,   45096432 total
>>> - age  12:    3671288 bytes,   48767720 total
>>> - age  13:    3558696 bytes,   52326416 total
>>> - age  14:    3805744 bytes,   56132160 total
>>> - age  15:    3429672 bytes,   59561832 total
>>> : 3230658K->77508K(3932160K), 0.1143860 secs] 3291757K->142447K(6553600K)
>>>
>>> Besides the longer time to perform collection, I also realized that all
>>> 15 ages started to have larger values.
>>>
>>> I must say I'm a little confused about this scenario. Does anyone have
>>> some tip?
>>>
>>> Thanks in advance,
>>> --
>>> Luciano
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>
>
>
> --
> Luciano Davoglio Molinari
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20131220/e66039df/attachment.html