Repeated ParNews when Young Gen is Empty?

Thu Jul 11 02:56:54 PDT 2013

Hello,

With user time 10x the wall time it seems rather unlikely that some single threaded root scanning section could be the problem, and unless there is some active spinning it wont be related to safepointing/jni eighter. Looks very strange.

Is that on a real or virtual hardware?

Bernd

-- 
bernd.eckenfels.net

Am 11.07.2013 um 11:40 schrieb Thomas Schatzl <thomas.schatzl at oracle.com>:

> Hi,
> 
> On Wed, 2013-07-10 at 23:20 +0000, Andrew Colombi wrote:
>> Thanks for the suggestions.
>> 
>>> CR opened with Oracle
>> No CR with Oracle yet.  I'll see what we can do.
>> 
>>> turn on per-GC-worker stats per phase
>> Is this something I can turn on now?  A quick scan of the GC options that
>> I know of don't look like they do.
> 
>  there is nothing that is available in regular builds; you have to go
> to src/share/vm/utilities/taskqueue.hpp and define TASKQUEUE_STATS to 1.
> 
> Then use -XX:+PrintGCDetails and -XX:+ParallelGCVerbose to print some
> per-thread timing statistics that e.g. look like the following:
> 
>     elapsed  --strong roots-- -------termination-------
> thr     ms        ms       %       ms       %   attempts
> --- --------- --------- ------ --------- ------ --------
>  0      1.06      0.77  73.20      0.07   6.63        1
>  1      1.12      0.82  72.92      0.00   0.00        1
>  2      1.13      0.71  62.81      0.00   0.35        1
> 
> i.e. showing total elapsed time per thread, strong root processing time
> (e.g. evacuation) and task termination time.
> 
> Compared to eg. g1 it is primitive though.
> 
>>> trying ParallelGC
>> Just to be clear, are you recommending we try the throughput collector,
>> e.g. -XX:+UseParallelOldGC.  I'm up for that, given how bad things are
>> with the current configuration.
>> 
>>> trying the latest JVM
>> Definitely a good idea.  We'll try the latest.  If we were to roll-back,
>> is there any particular version you would recommend?
>> 
>> -Andrew
>> 
>> On 7/10/13 3:44 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:
>> 
>>> [Just some off the cuff suggestions here, no solutions.]
>>> 
>>> Yikes, this looks to me like a pretty serious bug. Has a CR been
>>> opened with Oracle for this problem?
>>> Could you downrev yr version of the JVM to see if you can determine
>>> when the problem may have started?
>>> 
>>> The growing par new times are definitely concerning. I'd suggest that
>>> at the very least, if not already the case,
>>> we should be able to turn on per-GC-worker stats per phase for ParNew
>>> in much the same way that Parallel and G1 provide.
> 
> Afaik the hack above is all there is.
> 
>>> 
>>> You might also try ParallelGC to see if you can replicate the growing
>>> minor gc problem. Given how bad it is, my guess is
>>> that it stems from some linear single-threaded root-scanning issue and
>>> so (at least with an instrumented JVM -- which
>>> you might be able to request from Oracle or build on yr own) could be
>>> tracked down quickly.
> 
> Above changes should show that imo.
> 
>>> 
>>> Also (if possible) try the latest JVM to see if the problem is a known
>>> one that has already been fixed perhaps.
>>> 
>>> 
>>> On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com>
>>> wrote:
>>>> Thanks for the response and help.  I've done some more investigation and
>>>> learning, and I have another _fascinating_ log from production.  First,
>>>> here are some things we've done.
>>>> 
>>>> * We're going to enable -XX:+PrintTLAB as a way of learning more about
>>>> how the application is allocating memory in Eden.
>>>> * We're examining areas of the code base that might be allocating large
>>>> objects (we haven't found anything egregious, e.g exceeding 10~100MB
>>>> allocation).  Nevertheless, we have a few changes that will reduce the
>>>> size of these objects, and we're deploying the change this week.
>>>> 
>>>> Now check out this event (more of the log is below, since it's pretty
>>>> damn interesting):
>>>> 
>>>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>>>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>>>> [Times: user=2811.37 sys=1.10, real=122.59 secs]
>>>> 30914.319: [GC 30914.320: [ParNew
>>>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>>>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>>>> [Times: user=3050.21 sys=0.74, real=132.86 secs]
>>>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>>>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
>>>> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>>>> 
>>>> Notable things:
>>>> 
>>>> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
>>>> understanding is that the copy collector's performance is primarily
>>>> bound
>>>> by the number of objects that end up getting copied to survivor or
>>>> tenured.  Looking at these numbers, approximately 100MB survived the
>>>> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
> 
> Note that finding space in the tenured generation may be an issue here
> as it uses free lists. Using PrintFLSStatistics as suggested earlier may
> help in finding issues.
> 
>>>> * Then it prints an empty ParNew line.  What's that about?  Feels like
>>>> the garbage collector is either: i) hitting a race (two threads are racing
>>>> to initiate the ParNew, so they both print the ParNew line), ii)
> 
> The message is printed after serializing the collection requests so
> there should be no race. This is odd.
> 
>>>> following an
>>>> unusual sequence of branches, and it's a bug that it accidentally prints
>>>> the second ParNew.  In either case, I'm going to try to track it down in
>>>> the hotspot source.
>>>> * Another ParNew hits, which takes even longer, but otherwise looks
>>>> similar to the first.
>>>> * Third ParNew, and most interesting: the GC reports that Young Gen
>>>> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I
>>>> really allocate a 12GB object? doesn't seem likely), and ends at
>>>> 312MB.  That's fascinating.
> 
> Maybe there is something odd with tlab sizing. Can you add -XX:
> +PrintTLAB?
> 
>>>> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
>>>> wrote:
>>>> 
>>>>> Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi
>>>>> <acolombi at palantir.com>:
>>>>>> My question is, Why would it do three ParNews, only 300ms apart from
>>>>>> each other, when the young gen is essentially empty?  Here are three
>>>>>> hypotheses that I have:
>>>>>> * Is the application trying to allocate something giant, e.g. a 1
>>>>>> billion element double[]? Is there a way I can test for this, i.e.
>>>>>> some
>>>>>> 
>>>>>> JVM level logging that would indicate very large objects being
>>>>>> allocated.
>>>>> 
>>>>> That was a suspicion of me as well. (And I dont know a good tool for Sun
>>>>> VM (with IBM you can trace it)).
> 
> There has been some discussion here recently; the best option seemed to
> be bytecode rewriting, but for some reason it did not work in some cases
> (on g1).
> 
> Thread starts here:
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2013-June/001555.html
> 
> 
> Hth,
>  Thomas
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use