Repeated ParNews when Young Gen is Empty?

Wed Jul 10 15:44:39 PDT 2013

[Just some off the cuff suggestions here, no solutions.]

Yikes, this looks to me like a pretty serious bug. Has a CR been
opened with Oracle for this problem?
Could you downrev yr version of the JVM to see if you can determine
when the problem may have started?

The growing par new times are definitely concerning. I'd suggest that
at the very least, if not already the case,
we should be able to turn on per-GC-worker stats per phase for ParNew
in much the same way that Parallel and
G1 provide.

You might also try ParallelGC to see if you can replicate the growing
minor gc problem. Given how bad it is, my guess is
that it stems from some linear single-threaded root-scanning issue and
so (at least with an instrumented JVM -- which
you might be able to request from Oracle or build on yr own) could be
tracked down quickly.

Also (if possible) try the latest JVM to see if the problem is a known
one that has already been fixed perhaps.

-- ramki

On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com> wrote:
> Thanks for the response and help.  I've done some more investigation and
> learning, and I have another _fascinating_ log from production.  First,
> here are some things we've done.
>
> * We're going to enable -XX:+PrintTLAB as a way of learning more about how
> the application is allocating memory in Eden.
> * We're examining areas of the code base that might be allocating large
> objects (we haven't found anything egregious, e.g exceeding 10~100MB
> allocation).  Nevertheless, we have a few changes that will reduce the
> size of these objects, and we're deploying the change this week.
>
> Now check out this event (more of the log is below, since it's pretty damn
> interesting):
>
> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
> [Times: user=2811.37 sys=1.10, real=122.59 secs]
> 30914.319: [GC 30914.320: [ParNew
> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
> [Times: user=3050.21 sys=0.74, real=132.86 secs]
> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>
> Notable things:
>
> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
> understanding is that the copy collector's performance is primarily bound
> by the number of objects that end up getting copied to survivor or
> tenured.  Looking at these numbers, approximately 100MB survived the
> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
> * Then it prints an empty ParNew line.  What's that about?  Feels like the
> garbage collector is either: i) hitting a race (two threads are racing to
> initiate the ParNew, so they both print the ParNew line), ii) following an
> unusual sequence of branches, and it's a bug that it accidentally prints
> the second ParNew.  In either case, I'm going to try to track it down in
> the hotspot source.
> * Another ParNew hits, which takes even longer, but otherwise looks
> similar to the first.
> * Third ParNew, and most interesting: the GC reports that Young Gen
> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I really
> allocate a 12GB object? doesn't seem likely), and ends at 312MB.  That's
> fascinating.
>
> My next step is to learn what I can by examining the ParNew source.  If
> anyone has ever seen, or understands why, allocations grow during garbage
> collection, I would be very grateful for your guidance.
>
> Thanks for your help,
> -Andrew
>
> Here are more stats about my VM and additional log output:
>
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>
> Here are all the GC relevant parameters we are setting:
>
> -Dsun.rmi.dgc.client.gcInterval=3600000
> -Dsun.rmi.dgc.server.gcInterval=3600000
> -Xms74752m
> -Xmx74752m
> -Xmn15000m
> -XX:PermSize=192m
> -XX:MaxPermSize=1500m
> -XX:CMSInitiatingOccupancyFraction=60
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:+CMSParallelRemarkEnabled
> -XX:+ExplicitGCInvokesConcurrent
> -verbose:gc
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps // I removed this from the output above to make it
> slightly more concise
> -Xloggc:gc.log
>
> And some GC Logs, notice how the time to GC grows exponentially for the
> first five allocations
>
>
> 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K),
> 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times:
> user=3.82 sys=0.02, real=0.19 secs]
> 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K), 1.5997950
> secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times: user=35.92
> sys=0.02, real=1.60 secs]
> 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K), 5.2605870
> secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times: user=119.75
> sys=0.06, real=5.26 secs]
> 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K),
> 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs] [Times:
> user=243.13 sys=0.10, real=10.65 secs]
> 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K),
> 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs] [Times:
> user=740.98 sys=0.34, real=32.41 secs]
> 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K),
> 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs] [Times:
> user=1169.93 sys=0.38, real=51.05 secs]
> 30365.735: [GC 30365.736: [ParNew
> 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K),
> 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs] [Times:
> user=1472.58 sys=0.43, real=64.27 secs]
> 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K),
> 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs] [Times:
> user=1799.07 sys=0.50, real=78.48 secs]
> 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K),
> 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs] [Times:
> user=2181.58 sys=0.71, real=95.17 secs]
> 30655.141: [GC 30655.142: [ParNew
> 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K),
> 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs]
> [Times: user=2546.12 sys=0.73, real=111.00 secs]
> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
> [Times: user=2811.37 sys=1.10, real=122.59 secs]
> 30914.319: [GC 30914.320: [ParNew
> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
> [Times: user=3050.21 sys=0.74, real=132.86 secs]
> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
> [Times: user=3398.88 sys=0.64, real=147.94 secs]
> 31202.704: [GC 31202.705: [ParNew
> 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K),
> 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs]
> [Times: user=3206.88 sys=0.86, real=139.75 secs]
> 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K),
> 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times:
> user=15.23 sys=0.12, real=0.86 secs]
> 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K),
> 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times:
> user=2.49 sys=0.02, real=0.14 secs]
>
>
>
>
> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
> wrote:
>
>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi <acolombi at palantir.com>:
>>> My question is, Why would it do three ParNews, only 300ms apart from
>>> each other, when the young gen is essentially empty?  Here are three
>>> hypotheses that I have:
>>> * Is the application trying to allocate something giant, e.g. a 1
>>> billion element double[]? Is there a way I can test for this, i.e. some
>>>
>>> JVM level logging that would indicate very large objects being
>>>allocated.
>>
>>That was a suspicion of me as well. (And I dont know a good tool for Sun
>>VM (with IBM you can trace it)).
>>
>>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>>> clean.) We're going to disable explicit GC in our next maintenance
>>> period.  But this theory doesn't explain concurrent mode failure.
>>
>>I think System.gc() will also not trigger 3 ParNew in a row.
>>
>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>>> every collection?  I've read that CMS tenured gen can suffer from
>>> fragmentation.
>>
>>ParNew is a copy collector, this is automatically compacting. But the
>>promoted objects might of course fragment due to the PLABs in old. Your
>>log is from 13h uptime, do you see it before/after as well?
>>
>>Because there was no follow up, I will just mention some more candidates
>>to look out for, the changes around CMSWaitDuration (RFE 7189971) I think
>>
>>they have been merged to 1.7.0.
>>
>>And maybe enabling more diagnostics can help:
>>
>>-XX:PrintFLSStatistics=2
>>
>>Greetings
>>Bernd
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>