Repeated ParNews when Young Gen is Empty?

Wed Jul 10 16:20:43 PDT 2013

Thanks for the suggestions.

> CR opened with Oracle
No CR with Oracle yet.  I'll see what we can do.

> turn on per-GC-worker stats per phase
Is this something I can turn on now?  A quick scan of the GC options that
I know of don't look like they do.

> trying ParallelGC
Just to be clear, are you recommending we try the throughput collector,
e.g. -XX:+UseParallelOldGC.  I'm up for that, given how bad things are
with the current configuration.

> trying the latest JVM
Definitely a good idea.  We'll try the latest.  If we were to roll-back,
is there any particular version you would recommend?

-Andrew

On 7/10/13 3:44 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:

>[Just some off the cuff suggestions here, no solutions.]
>
>Yikes, this looks to me like a pretty serious bug. Has a CR been
>opened with Oracle for this problem?
>Could you downrev yr version of the JVM to see if you can determine
>when the problem may have started?
>
>The growing par new times are definitely concerning. I'd suggest that
>at the very least, if not already the case,
>we should be able to turn on per-GC-worker stats per phase for ParNew
>in much the same way that Parallel and
>G1 provide.
>
>You might also try ParallelGC to see if you can replicate the growing
>minor gc problem. Given how bad it is, my guess is
>that it stems from some linear single-threaded root-scanning issue and
>so (at least with an instrumented JVM -- which
>you might be able to request from Oracle or build on yr own) could be
>tracked down quickly.
>
>Also (if possible) try the latest JVM to see if the problem is a known
>one that has already been fixed perhaps.
>
>-- ramki
>
>
>On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com>
>wrote:
>> Thanks for the response and help.  I've done some more investigation and
>> learning, and I have another _fascinating_ log from production.  First,
>> here are some things we've done.
>>
>> * We're going to enable -XX:+PrintTLAB as a way of learning more about
>how
>> the application is allocating memory in Eden.
>> * We're examining areas of the code base that might be allocating large
>> objects (we haven't found anything egregious, e.g exceeding 10~100MB
>> allocation).  Nevertheless, we have a few changes that will reduce the
>> size of these objects, and we're deploying the change this week.
>>
>> Now check out this event (more of the log is below, since it's pretty
>damn
>> interesting):
>>
>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>> [Times: user=2811.37 sys=1.10, real=122.59 secs]
>> 30914.319: [GC 30914.320: [ParNew
>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>> [Times: user=3050.21 sys=0.74, real=132.86 secs]
>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
>> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>>
>> Notable things:
>>
>> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
>> understanding is that the copy collector's performance is primarily
>>bound
>> by the number of objects that end up getting copied to survivor or
>> tenured.  Looking at these numbers, approximately 100MB survived the
>> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
>> * Then it prints an empty ParNew line.  What's that about?  Feels like
>the
>> garbage collector is either: i) hitting a race (two threads are racing
>>to
>> initiate the ParNew, so they both print the ParNew line), ii) following
>an
>> unusual sequence of branches, and it's a bug that it accidentally prints
>> the second ParNew.  In either case, I'm going to try to track it down in
>> the hotspot source.
>> * Another ParNew hits, which takes even longer, but otherwise looks
>> similar to the first.
>> * Third ParNew, and most interesting: the GC reports that Young Gen
>> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I
>>really
>> allocate a 12GB object? doesn't seem likely), and ends at 312MB.  That's
>> fascinating.
>>
>> My next step is to learn what I can by examining the ParNew source.  If
>> anyone has ever seen, or understands why, allocations grow during
>>garbage
>> collection, I would be very grateful for your guidance.
>>
>> Thanks for your help,
>> -Andrew
>>
>> Here are more stats about my VM and additional log output:
>>
>> java version "1.7.0_21"
>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>
>> Here are all the GC relevant parameters we are setting:
>>
>> -Dsun.rmi.dgc.client.gcInterval=3600000
>> -Dsun.rmi.dgc.server.gcInterval=3600000
>> -Xms74752m
>> -Xmx74752m
>> -Xmn15000m
>> -XX:PermSize=192m
>> -XX:MaxPermSize=1500m
>> -XX:CMSInitiatingOccupancyFraction=60
>> -XX:+UseConcMarkSweepGC
>> -XX:+UseParNewGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:+ExplicitGCInvokesConcurrent
>> -verbose:gc
>> -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps
>> -XX:+PrintGCDateStamps // I removed this from the output above to make
>>it
>> slightly more concise
>> -Xloggc:gc.log
>>
>> And some GC Logs, notice how the time to GC grows exponentially for the
>> first five allocations
>>
>>
>> 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K),
>> 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times:
>> user=3.82 sys=0.02, real=0.19 secs]
>> 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K),
>1.5997950
>> secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times:
>>user=35.92
>> sys=0.02, real=1.60 secs]
>> 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K),
>5.2605870
>> secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times:
>user=119.75
>> sys=0.06, real=5.26 secs]
>> 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K),
>> 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs]
>[Times:
>> user=243.13 sys=0.10, real=10.65 secs]
>> 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K),
>> 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs]
>[Times:
>> user=740.98 sys=0.34, real=32.41 secs]
>> 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K),
>> 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs]
>[Times:
>> user=1169.93 sys=0.38, real=51.05 secs]
>> 30365.735: [GC 30365.736: [ParNew
>> 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K),
>> 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs]
>[Times:
>> user=1472.58 sys=0.43, real=64.27 secs]
>> 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K),
>> 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs]
>[Times:
>> user=1799.07 sys=0.50, real=78.48 secs]
>> 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K),
>> 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs]
>[Times:
>> user=2181.58 sys=0.71, real=95.17 secs]
>> 30655.141: [GC 30655.142: [ParNew
>> 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K),
>> 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs]
>> [Times: user=2546.12 sys=0.73, real=111.00 secs]
>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>> [Times: user=2811.37 sys=1.10, real=122.59 secs]
>> 30914.319: [GC 30914.320: [ParNew
>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>> [Times: user=3050.21 sys=0.74, real=132.86 secs]
>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
>> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>> 31202.704: [GC 31202.705: [ParNew
>> 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K),
>> 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs]
>> [Times: user=3206.88 sys=0.86, real=139.75 secs]
>> 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K),
>> 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times:
>> user=15.23 sys=0.12, real=0.86 secs]
>> 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K),
>> 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times:
>> user=2.49 sys=0.02, real=0.14 secs]
>>
>>
>>
>>
>> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
>> wrote:
>>
>>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi
>>><acolombi at palantir.com>:
>>>> My question is, Why would it do three ParNews, only 300ms apart from
>>>> each other, when the young gen is essentially empty?  Here are three
>>>> hypotheses that I have:
>>>> * Is the application trying to allocate something giant, e.g. a 1
>>>> billion element double[]? Is there a way I can test for this, i.e.
>>>>some
>>>>
>>>> JVM level logging that would indicate very large objects being
>>>>allocated.
>>>
>>>That was a suspicion of me as well. (And I dont know a good tool for Sun
>>>VM (with IBM you can trace it)).
>>>
>>>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>>>> clean.) We're going to disable explicit GC in our next maintenance
>>>> period.  But this theory doesn't explain concurrent mode failure.
>>>
>>>I think System.gc() will also not trigger 3 ParNew in a row.
>>>
>>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>>>> every collection?  I've read that CMS tenured gen can suffer from
>>>> fragmentation.
>>>
>>>ParNew is a copy collector, this is automatically compacting. But the
>>>promoted objects might of course fragment due to the PLABs in old. Your
>>>log is from 13h uptime, do you see it before/after as well?
>>>
>>>Because there was no follow up, I will just mention some more candidates
>>>to look out for, the changes around CMSWaitDuration (RFE 7189971) I
>>>think
>>>
>>>they have been merged to 1.7.0.
>>>
>>>And maybe enabling more diagnostics can help:
>>>
>>>-XX:PrintFLSStatistics=2
>>>
>>>Greetings
>>>Bernd
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> 
>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/ma
>il
>man/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0
>sh
>ECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=O3HnZfgUY9w17tgWd7EY%2F88UU4QBZYad
>qv
>ET5oXWDAc%3D%0A&s=89dcbdc795a4b7b32320ff5efcc411ef3cebc788e226b0d3842918c8
>b8
>efbb0b
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5019 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130710/e8733a9a/smime-0001.p7s