From kinnari.darji at citi.com  Tue Jan  3 13:36:18 2012
From: kinnari.darji at citi.com (Darji, Kinnari )
Date: Tue, 3 Jan 2012 16:36:18 -0500
Subject: ParNew garbage collection
Message-ID: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net>

Hello GC team,
I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise.

2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew
Desired survivor size 19628032 bytes, new threshold 4 (max 4)
- age   1:    4466024 bytes,    4466024 total
- age   2:    3568136 bytes,    8034160 total
- age   3:    3559808 bytes,   11593968 total
- age   4:    1737520 bytes,   13331488 total
: 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs]
Total time for which application threads were stopped: 1.7197830 seconds
Application time: 8.4134780 seconds


Thank you
Kinnari

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120103/3e75a750/attachment.html 

From jon.masamitsu at oracle.com  Wed Jan  4 09:43:34 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Wed, 04 Jan 2012 09:43:34 -0800
Subject: ParNew garbage collection
In-Reply-To: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net>
References: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net>
Message-ID: <4F048FC6.30907@oracle.com>

Try turning on TraceSafepointCleanupTime.  I haven't used it myself.  If
that's not it, look in share/vm/runtime/globals.hpp for some other flag
that traces safepoints.

On 1/3/2012 1:36 PM, Darji, Kinnari wrote:
> Hello GC team,
> I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise.
>
> 2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew
> Desired survivor size 19628032 bytes, new threshold 4 (max 4)
> - age   1:    4466024 bytes,    4466024 total
> - age   2:    3568136 bytes,    8034160 total
> - age   3:    3559808 bytes,   11593968 total
> - age   4:    1737520 bytes,   13331488 total
> : 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs]
> Total time for which application threads were stopped: 1.7197830 seconds
> Application time: 8.4134780 seconds
>
>
>
> Thank you
> Kinnari
>
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120104/1d676f2c/attachment.html 

From ysr1729 at gmail.com  Wed Jan  4 09:53:52 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Wed, 4 Jan 2012 09:53:52 -0800
Subject: ParNew garbage collection
In-Reply-To: <4F048FC6.30907@oracle.com>
References: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net>
	<4F048FC6.30907@oracle.com>
Message-ID: <CABzyjyn4VZfew7rR-FGJNDkw3kCRw2mhCBipKbmwkGjgNRTdvQ@mail.gmail.com>

May be also +PrintSafepointStatistics (and related parms) to drill down a
bit further, although TraceSafepointCleanup
would probably provide all of the info on a per-safepoint basis. There was
an old issue wrt monitor
deflation that was foixed a few releases ago, so Kinnari should check the
version of the JVM she's
running on as well.... (There are now a couple of flags related to monitor
list handling policies i believe
but i have no experience with them and do not have the code in front of me
-- make sure to cc the runtime
list if that turns out to be the issue again and you are already on a very
recent version of the JVM.)

-- ramki

On Wed, Jan 4, 2012 at 9:43 AM, Jon Masamitsu <jon.masamitsu at oracle.com>wrote:

> **
> Try turning on TraceSafepointCleanupTime.  I haven't used it myself.  If
> that's not it, look in share/vm/runtime/globals.hpp for some other flag
> that traces safepoints.
>
>
> On 1/3/2012 1:36 PM, Darji, Kinnari wrote:
>
> Hello GC team,
> I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise.
>
> 2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew
> Desired survivor size 19628032 bytes, new threshold 4 (max 4)
> - age   1:    4466024 bytes,    4466024 total
> - age   2:    3568136 bytes,    8034160 total
> - age   3:    3559808 bytes,   11593968 total
> - age   4:    1737520 bytes,   13331488 total
> : 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs]
> Total time for which application threads were stopped: 1.7197830 seconds
> Application time: 8.4134780 seconds
>
>
>
> Thank you
> Kinnari
>
>
>
>
> _______________________________________________
> hotspot-gc-use mailing listhotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120104/c186dd04/attachment.html 

From taras.tielkes at gmail.com  Thu Jan  5 15:32:50 2012
From: taras.tielkes at gmail.com (Taras Tielkes)
Date: Fri, 6 Jan 2012 00:32:50 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <4EF9FCAC.3030208@oracle.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
Message-ID: <CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>

Hi Jon,

We've enabled the PrintPromotionFailure flag in our preprod
environment, but so far, no failures yet.
We know that the load we generate there is not representative. But
perhaps we'll catch something, given enough patience.

The flag will also be enabled in our production environment next week
- so one way or the other, we'll get more diagnostic data soon.
I'll also do some allocation profiling of the application in isolation
- I know that there is abusive large byte[] and char[] allocation in
there.

I've got two questions for now:

1) From googling around on the output to expect
(http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
I see that -XX:+PrintPromotionFailure will generate output like this:
-------
592.079: [ParNew (0: promotion failure size = 2698)  (promotion
failed): 135865K->134943K(138240K), 0.1433555 secs]
-------
In that example line, what does the "0" stand for?

2) Below is a snippet of (real) gc log from our production application:
-------
2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
345951K->40960K(368640K), 0.0676780 secs]
3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
sys=0.01, real=0.06 secs]
2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
368640K->40959K(368640K), 0.0618880 secs]
3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
sys=0.00, real=0.06 secs]
2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
user=0.04 sys=0.00, real=0.04 secs]
2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start]
2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
2011-12-30T22:42:24.099+0100: 2136593.001:
[CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
[Times: user=5.70 sys=0.23, real=5.23 secs]
2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
(368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
[ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
3432839K->3423755K(5201920
K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
refs processing, 0.0034280 secs]2136605.804: [class unloading,
0.0289480 secs]2136605.833: [scrub symbol & string tables, 0.0093940
secs] [1 CMS-remark: 3318289K(4833280K
)] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
real=7.61 secs]
2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start]
2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
[CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
 (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
secs] 3491471K->291853K(5201920K), [CMS Perm :
121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
sys=0.00, real=10.29 secs]
2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
-------

In this case I don't know how to interpret the output.
a) There's a promotion failure that took 7.49 secs
b) There's a full GC that took 14.08 secs
c) There's a concurrent mode failure that took 10.29 secs

How are these events, and their (real) times related to each other?

Thanks in advance,
Taras

On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
> Taras,
>
> PrintPromotionFailure seems like it would go a long
> way to identify the root of your promotion failures (or
> at least eliminating some possible causes). ? ?I think it
> would help focus the discussion if you could send
> the result of that experiment early.
>
> Jon
>
> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>> Hi,
>>
>> We're running an application with the CMS/ParNew collectors that is
>> experiencing occasional promotion failures.
>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>> I've listed the specific JVM options used below (a).
>>
>> The application is deployed across a handful of machines, and the
>> promotion failures are fairly uniform across those.
>>
>> The first kind of failure we observe is a promotion failure during
>> ParNew collection, I've included a snipped from the gc log below (b).
>> The second kind of failure is a concurrrent mode failure (perhaps
>> triggered by the same cause), see (c) below.
>> The frequency (after running for a some weeks) is approximately once
>> per day. This is bearable, but obviously we'd like to improve on this.
>>
>> Apart from high-volume request handling (which allocates a lot of
>> small objects), the application also runs a few dozen background
>> threads that download and process XML documents, typically in the 5-30
>> MB range.
>> A known deficiency in the existing code is that the XML content is
>> copied twice before processing (once to a byte[], and later again to a
>> String/char[]).
>> Given that a 30 MB XML stream will result in a 60 MB
>> java.lang.String/char[], my suspicion is that these big array
>> allocations are causing us to run into the CMS fragmentation issue.
>>
>> My questions are:
>> 1) Does the data from the GC logs provide sufficient evidence to
>> conclude that CMS fragmentation is the cause of the promotion failure?
>> 2) If not, what's the next step of investigating the cause?
>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a
>> feeling for the size of the objects that fail promotion.
>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>> case?
>>
>> Thanks in advance,
>> Taras
>>
>> a) Current JVM options:
>> --------------------------------
>> -server
>> -Xms5g
>> -Xmx5g
>> -Xmn400m
>> -XX:PermSize=256m
>> -XX:MaxPermSize=256m
>> -XX:+PrintGCTimeStamps
>> -verbose:gc
>> -XX:+PrintGCDateStamps
>> -XX:+PrintGCDetails
>> -XX:SurvivorRatio=8
>> -XX:+UseConcMarkSweepGC
>> -XX:+UseParNewGC
>> -XX:+DisableExplicitGC
>> -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+CMSClassUnloadingEnabled
>> -XX:+CMSScavengeBeforeRemark
>> -XX:CMSInitiatingOccupancyFraction=68
>> -Xloggc:gc.log
>> --------------------------------
>>
>> b) Promotion failure during ParNew
>> --------------------------------
>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>> 368640K->40959K(368640K), 0.0693460 secs]
>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>> sys=0.01, real=0.07 secs]
>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>> 368639K->31321K(368640K), 0.0511400 secs]
>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>> sys=0.00, real=0.05 secs]
>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>> 359001K->18694K(368640K), 0.0272970 secs]
>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>> sys=0.00, real=0.03 secs]
>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>> 3505808K->434291K
>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K),
>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K),
>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K),
>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>> --------------------------------
>>
>> c) Promotion failure during CMS
>> --------------------------------
>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>> 357228K->40960K(368640K), 0.0525110 secs]
>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>> sys=0.00, real=0.05 secs]
>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>> 366075K->37119K(368640K), 0.0479780 secs]
>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>> sys=0.01, real=0.05 secs]
>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>> 364792K->40960K(368640K), 0.0421740 secs]
>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>> sys=0.00, real=0.04 secs]
>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>> user=0.02 sys=0.00, real=0.03 secs]
>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start]
>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>> 368640K->40960K(368640K), 0.0836690 secs]
>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>> sys=0.01, real=0.08 secs]
>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start]
>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>> 2011-12-14T08:29:30.938+0100: 703019.840:
>> [CMS-concurrent-abortable-preclean-start]
>> 2011-12-14T08:29:32.337+0100: 703021.239:
>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>> user=6.68 sys=0.27, real=1.40 secs]
>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs]
>> ? 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>> sys=2.58, real=9.88 secs]
>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs
>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>> secs]703031.419: [scrub symbol& ?string tables, 0.0094960 secs] [1 CMS
>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start]
>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>> ? (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300
>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>> sys=0.00, real=10.96 secs]
>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K),
>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K),
>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K),
>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>> --------------------------------
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From jon.masamitsu at oracle.com  Thu Jan  5 23:27:44 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Thu, 05 Jan 2012 23:27:44 -0800
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
Message-ID: <4F06A270.3010701@oracle.com>


On 1/5/2012 3:32 PM, Taras Tielkes wrote:
> Hi Jon,
>
> We've enabled the PrintPromotionFailure flag in our preprod
> environment, but so far, no failures yet.
> We know that the load we generate there is not representative. But
> perhaps we'll catch something, given enough patience.
>
> The flag will also be enabled in our production environment next week
> - so one way or the other, we'll get more diagnostic data soon.
> I'll also do some allocation profiling of the application in isolation
> - I know that there is abusive large byte[] and char[] allocation in
> there.
>
> I've got two questions for now:
>
> 1) From googling around on the output to expect
> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
> I see that -XX:+PrintPromotionFailure will generate output like this:
> -------
> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
> failed): 135865K->134943K(138240K), 0.1433555 secs]
> -------
> In that example line, what does the "0" stand for?

It's the index of the GC worker thread  that experienced the promotion 
failure.

> 2) Below is a snippet of (real) gc log from our production application:
> -------
> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
> 345951K->40960K(368640K), 0.0676780 secs]
> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
> sys=0.01, real=0.06 secs]
> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
> 368640K->40959K(368640K), 0.0618880 secs]
> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
> sys=0.00, real=0.06 secs]
> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
> user=0.04 sys=0.00, real=0.04 secs]
> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start]
> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
> 2011-12-30T22:42:24.099+0100: 2136593.001:
> [CMS-concurrent-abortable-preclean-start]
>   CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
> [Times: user=5.70 sys=0.23, real=5.23 secs]
> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
> 3432839K->3423755K(5201920
> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
> refs processing, 0.0034280 secs]2136605.804: [class unloading,
> 0.0289480 secs]2136605.833: [scrub symbol&  string tables, 0.0093940
> secs] [1 CMS-remark: 3318289K(4833280K
> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
> real=7.61 secs]
> 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start]
> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>   (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
> secs] 3491471K->291853K(5201920K), [CMS Perm :
> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
> sys=0.00, real=10.29 secs]
> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
> -------
>
> In this case I don't know how to interpret the output.
> a) There's a promotion failure that took 7.49 secs
This is the time it took to attempt the minor collection (ParNew) and to 
do recovery
from the failure.

> b) There's a full GC that took 14.08 secs
> c) There's a concurrent mode failure that took 10.29 secs

Not sure about b) and c) because the output is mixed up with the 
concurrent-sweep
output but  I think the "concurrent mode failure" message is part of the 
"Full GC"
message.  My guess is that the 10.29 is the time for the Full GC and the 
14.08
maybe is part of the concurrent-sweep message.  Really hard to be sure.

Jon
> How are these events, and their (real) times related to each other?
>
> Thanks in advance,
> Taras
>
> On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu<jon.masamitsu at oracle.com>  wrote:
>> Taras,
>>
>> PrintPromotionFailure seems like it would go a long
>> way to identify the root of your promotion failures (or
>> at least eliminating some possible causes).    I think it
>> would help focus the discussion if you could send
>> the result of that experiment early.
>>
>> Jon
>>
>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>> Hi,
>>>
>>> We're running an application with the CMS/ParNew collectors that is
>>> experiencing occasional promotion failures.
>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>> I've listed the specific JVM options used below (a).
>>>
>>> The application is deployed across a handful of machines, and the
>>> promotion failures are fairly uniform across those.
>>>
>>> The first kind of failure we observe is a promotion failure during
>>> ParNew collection, I've included a snipped from the gc log below (b).
>>> The second kind of failure is a concurrrent mode failure (perhaps
>>> triggered by the same cause), see (c) below.
>>> The frequency (after running for a some weeks) is approximately once
>>> per day. This is bearable, but obviously we'd like to improve on this.
>>>
>>> Apart from high-volume request handling (which allocates a lot of
>>> small objects), the application also runs a few dozen background
>>> threads that download and process XML documents, typically in the 5-30
>>> MB range.
>>> A known deficiency in the existing code is that the XML content is
>>> copied twice before processing (once to a byte[], and later again to a
>>> String/char[]).
>>> Given that a 30 MB XML stream will result in a 60 MB
>>> java.lang.String/char[], my suspicion is that these big array
>>> allocations are causing us to run into the CMS fragmentation issue.
>>>
>>> My questions are:
>>> 1) Does the data from the GC logs provide sufficient evidence to
>>> conclude that CMS fragmentation is the cause of the promotion failure?
>>> 2) If not, what's the next step of investigating the cause?
>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a
>>> feeling for the size of the objects that fail promotion.
>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>> case?
>>>
>>> Thanks in advance,
>>> Taras
>>>
>>> a) Current JVM options:
>>> --------------------------------
>>> -server
>>> -Xms5g
>>> -Xmx5g
>>> -Xmn400m
>>> -XX:PermSize=256m
>>> -XX:MaxPermSize=256m
>>> -XX:+PrintGCTimeStamps
>>> -verbose:gc
>>> -XX:+PrintGCDateStamps
>>> -XX:+PrintGCDetails
>>> -XX:SurvivorRatio=8
>>> -XX:+UseConcMarkSweepGC
>>> -XX:+UseParNewGC
>>> -XX:+DisableExplicitGC
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+CMSClassUnloadingEnabled
>>> -XX:+CMSScavengeBeforeRemark
>>> -XX:CMSInitiatingOccupancyFraction=68
>>> -Xloggc:gc.log
>>> --------------------------------
>>>
>>> b) Promotion failure during ParNew
>>> --------------------------------
>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>> 368640K->40959K(368640K), 0.0693460 secs]
>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>> sys=0.01, real=0.07 secs]
>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>> 368639K->31321K(368640K), 0.0511400 secs]
>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>> sys=0.00, real=0.05 secs]
>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>> 359001K->18694K(368640K), 0.0272970 secs]
>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>> sys=0.00, real=0.03 secs]
>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>> 3505808K->434291K
>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K),
>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K),
>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K),
>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>> --------------------------------
>>>
>>> c) Promotion failure during CMS
>>> --------------------------------
>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>> 357228K->40960K(368640K), 0.0525110 secs]
>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>> sys=0.00, real=0.05 secs]
>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>> 366075K->37119K(368640K), 0.0479780 secs]
>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>> sys=0.01, real=0.05 secs]
>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>> 364792K->40960K(368640K), 0.0421740 secs]
>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>> sys=0.00, real=0.04 secs]
>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>> user=0.02 sys=0.00, real=0.03 secs]
>>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start]
>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>> 368640K->40960K(368640K), 0.0836690 secs]
>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>> sys=0.01, real=0.08 secs]
>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start]
>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>> [CMS-concurrent-abortable-preclean-start]
>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>> user=6.68 sys=0.27, real=1.40 secs]
>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs]
>>>    3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>> sys=2.58, real=9.88 secs]
>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs
>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>> secs]703031.419: [scrub symbol&    string tables, 0.0094960 secs] [1 CMS
>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start]
>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>    (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300
>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>> sys=0.00, real=10.96 secs]
>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K),
>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K),
>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K),
>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>> --------------------------------
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From java at java4.info  Mon Jan  9 03:08:28 2012
From: java at java4.info (Florian Binder)
Date: Mon, 09 Jan 2012 12:08:28 +0100
Subject: Very long young gc pause (ParNew with CMS)
Message-ID: <4F0ACAAC.8020103@java4.info>

Hi everybody,

I am using CMS (with ParNew) gc and have very long (> 6 seconds) young 
gc pauses.
As you can see in the log below the old-gen-heap consists of one large 
block, the new Size has 256m, it uses 13 worker threads and it has to 
copy 27505761 words (~210mb) directly from eden to old gen.
I have seen that this problem occurs only after about one week of 
uptime. Even thought we make a full (compacting) gc every night.
Since real-time > user-time I assume it might be a synchronization 
problem. Can this be true?

Do you have any Ideas how I can speed up this gcs?

Please let me know, if you need more informations.

Thank you,
Flo


##### java -version #####
java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

##### The startup parameters: #####
-Xms28G -Xmx28G
-XX:+UseConcMarkSweepGC \
-XX:CMSMaxAbortablePrecleanTime=10000 \
-XX:SurvivorRatio=8 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=31 \
-XX:CMSInitiatingOccupancyFraction=80 \
-XX:NewSize=256M \

-verbose:gc \
-XX:+PrintFlagsFinal \
-XX:PrintFLSStatistics=1 \
-XX:+PrintGCDetails \
-XX:+PrintGCDateStamps \
-XX:-TraceClassUnloading \
-XX:+PrintGCApplicationConcurrentTime \
-XX:+PrintGCApplicationStoppedTime \
-XX:+PrintTenuringDistribution \
-XX:+CMSClassUnloadingEnabled \
-Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
-Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \

-Djava.awt.headless=true

##### From the out-file (as of +PrintFlagsFinal): #####
ParallelGCThreads                         = 13

##### The gc.log-excerpt: #####
Application time: 20,0617700 seconds
2011-12-22T12:02:03.289+0100: [GC Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 1183290963
Max   Chunk Size: 1183290963
Number of Blocks: 1
Av.  Block  Size: 1183290963
Tree      Height: 1
Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
[ParNew
Desired survivor size 25480392 bytes, new threshold 1 (max 31)
- age   1:   28260160 bytes,   28260160 total
: 249216K->27648K(249216K), 6,1808130 secs] 
20061765K->20056210K(29332480K)After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 1155785202
Max   Chunk Size: 1155785202
Number of Blocks: 1
Av.  Block  Size: 1155785202
Tree      Height: 1
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
, 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
Total time for which application threads were stopped: 6,1818730 seconds

From ysr1729 at gmail.com  Mon Jan  9 10:40:43 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Mon, 9 Jan 2012 10:40:43 -0800
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <4F0ACAAC.8020103@java4.info>
References: <4F0ACAAC.8020103@java4.info>
Message-ID: <CABzyjy=23c+csJ3ffo5c=2VQeXmuiiZ1z+EyVbfgYV5M2MZcCA@mail.gmail.com>

Haven't looked at any logs, but setting MaxTenuringThreshold to 31 can be
bad. I'd dial that down to 8,
or leave it at the default of 15. (Your GC logs which must presumably
include the tenuring distribution should
inform you as to a more optimal size to use. As Kirk noted, premature
promotion can be bad, and so can
survivor space overflow, which can lead to premature promotion and
exacerbate fragmentation.)

-- ramki

On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder <java at java4.info> wrote:

> Hi everybody,
>
> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
> gc pauses.
> As you can see in the log below the old-gen-heap consists of one large
> block, the new Size has 256m, it uses 13 worker threads and it has to
> copy 27505761 words (~210mb) directly from eden to old gen.
> I have seen that this problem occurs only after about one week of
> uptime. Even thought we make a full (compacting) gc every night.
> Since real-time > user-time I assume it might be a synchronization
> problem. Can this be true?
>
> Do you have any Ideas how I can speed up this gcs?
>
> Please let me know, if you need more informations.
>
> Thank you,
> Flo
>
>
> ##### java -version #####
> java version "1.6.0_29"
> Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>
> ##### The startup parameters: #####
> -Xms28G -Xmx28G
> -XX:+UseConcMarkSweepGC \
> -XX:CMSMaxAbortablePrecleanTime=10000 \
> -XX:SurvivorRatio=8 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=31 \
> -XX:CMSInitiatingOccupancyFraction=80 \
> -XX:NewSize=256M \
>
> -verbose:gc \
> -XX:+PrintFlagsFinal \
> -XX:PrintFLSStatistics=1 \
> -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps \
> -XX:-TraceClassUnloading \
> -XX:+PrintGCApplicationConcurrentTime \
> -XX:+PrintGCApplicationStoppedTime \
> -XX:+PrintTenuringDistribution \
> -XX:+CMSClassUnloadingEnabled \
> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>
> -Djava.awt.headless=true
>
> ##### From the out-file (as of +PrintFlagsFinal): #####
> ParallelGCThreads                         = 13
>
> ##### The gc.log-excerpt: #####
> Application time: 20,0617700 seconds
> 2011-12-22T12:02:03.289+0100: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1183290963
> Max   Chunk Size: 1183290963
> Number of Blocks: 1
> Av.  Block  Size: 1183290963
> Tree      Height: 1
> Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> [ParNew
> Desired survivor size 25480392 bytes, new threshold 1 (max 31)
> - age   1:   28260160 bytes,   28260160 total
> : 249216K->27648K(249216K), 6,1808130 secs]
> 20061765K->20056210K(29332480K)After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1155785202
> Max   Chunk Size: 1155785202
> Number of Blocks: 1
> Av.  Block  Size: 1155785202
> Tree      Height: 1
> After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
> Total time for which application threads were stopped: 6,1818730 seconds
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/17f1facd/attachment.html 

From java at java4.info  Mon Jan  9 11:18:13 2012
From: java at java4.info (Florian Binder)
Date: Mon, 09 Jan 2012 20:18:13 +0100
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <CABzyjy=23c+csJ3ffo5c=2VQeXmuiiZ1z+EyVbfgYV5M2MZcCA@mail.gmail.com>
References: <4F0ACAAC.8020103@java4.info>
	<CABzyjy=23c+csJ3ffo5c=2VQeXmuiiZ1z+EyVbfgYV5M2MZcCA@mail.gmail.com>
Message-ID: <4F0B3D75.9060602@java4.info>

Hi Ramki,

Yes, I am agreed with you. 31 is too large and I have removed the 
parameter (using default now). Nevertheless this is not the problem as 
the max used age was always 1.

Since the most (more than 90%) new allocated objects in our application 
live for a long time (>1h) we mostly will have premature promotion.
Is there a way to optimize this?

I have seen most time, when young gc needs much time (> 6 secs) there is 
only one large block in the old gen. If there has been a 
cms-old-gen-collection and there are more than one blocks in the old 
generation it is mostly (not always) much faster (needs less than 200ms).

Is it possible that premature promotion can not be done parallel if 
there is only one large block in the old gen?

In the past we have had a problem with fragmentation on this server but 
this is gone since we increased memory for it and triggered a full gc 
(compacting) every night, like Tony advised us. With setting the 
initiating occupancy fraction to 80% we have only a few (~10) old 
generation collections (which are very fast) and the heap fragmentation 
is low.

Flo


Am 09.01.2012 19:40, schrieb Srinivas Ramakrishna:
> Haven't looked at any logs, but setting MaxTenuringThreshold to 31 can 
> be bad. I'd dial that down to 8,
> or leave it at the default of 15. (Your GC logs which must presumably 
> include the tenuring distribution should
> inform you as to a more optimal size to use. As Kirk noted, premature 
> promotion can be bad, and so can
> survivor space overflow, which can lead to premature promotion and 
> exacerbate fragmentation.)
>
> -- ramki
>
> On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder <java at java4.info 
> <mailto:java at java4.info>> wrote:
>
>     Hi everybody,
>
>     I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
>     gc pauses.
>     As you can see in the log below the old-gen-heap consists of one large
>     block, the new Size has 256m, it uses 13 worker threads and it has to
>     copy 27505761 words (~210mb) directly from eden to old gen.
>     I have seen that this problem occurs only after about one week of
>     uptime. Even thought we make a full (compacting) gc every night.
>     Since real-time > user-time I assume it might be a synchronization
>     problem. Can this be true?
>
>     Do you have any Ideas how I can speed up this gcs?
>
>     Please let me know, if you need more informations.
>
>     Thank you,
>     Flo
>
>
>     ##### java -version #####
>     java version "1.6.0_29"
>     Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
>     Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>
>     ##### The startup parameters: #####
>     -Xms28G -Xmx28G
>     -XX:+UseConcMarkSweepGC \
>     -XX:CMSMaxAbortablePrecleanTime=10000 \
>     -XX:SurvivorRatio=8 \
>     -XX:TargetSurvivorRatio=90 \
>     -XX:MaxTenuringThreshold=31 \
>     -XX:CMSInitiatingOccupancyFraction=80 \
>     -XX:NewSize=256M \
>
>     -verbose:gc \
>     -XX:+PrintFlagsFinal \
>     -XX:PrintFLSStatistics=1 \
>     -XX:+PrintGCDetails \
>     -XX:+PrintGCDateStamps \
>     -XX:-TraceClassUnloading \
>     -XX:+PrintGCApplicationConcurrentTime \
>     -XX:+PrintGCApplicationStoppedTime \
>     -XX:+PrintTenuringDistribution \
>     -XX:+CMSClassUnloadingEnabled \
>     -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
>     -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>
>     -Djava.awt.headless=true
>
>     ##### From the out-file (as of +PrintFlagsFinal): #####
>     ParallelGCThreads                         = 13
>
>     ##### The gc.log-excerpt: #####
>     Application time: 20,0617700 seconds
>     2011-12-22T12:02:03.289+0100: [GC Before GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 1183290963
>     Max   Chunk Size: 1183290963
>     Number of Blocks: 1
>     Av.  Block  Size: 1183290963
>     Tree      Height: 1
>     Before GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 0
>     Max   Chunk Size: 0
>     Number of Blocks: 0
>     Tree      Height: 0
>     [ParNew
>     Desired survivor size 25480392 bytes, new threshold 1 (max 31)
>     - age   1:   28260160 bytes,   28260160 total
>     : 249216K->27648K(249216K), 6,1808130 secs]
>     20061765K->20056210K(29332480K)After GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 1155785202
>     Max   Chunk Size: 1155785202
>     Number of Blocks: 1
>     Av.  Block  Size: 1155785202
>     Tree      Height: 1
>     After GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 0
>     Max   Chunk Size: 0
>     Number of Blocks: 0
>     Tree      Height: 0
>     , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
>     Total time for which application threads were stopped: 6,1818730
>     seconds
>     _______________________________________________
>     hotspot-gc-use mailing list
>     hotspot-gc-use at openjdk.java.net
>     <mailto:hotspot-gc-use at openjdk.java.net>
>     http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/a2997a2e/attachment.html 

From jon.masamitsu at oracle.com  Mon Jan  9 11:24:05 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Mon, 09 Jan 2012 11:24:05 -0800
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <4F0ACAAC.8020103@java4.info>
References: <4F0ACAAC.8020103@java4.info>
Message-ID: <4F0B3ED5.6010802@oracle.com>

Florian,

Have you even turned on

PrintReferenceGC

to see if you are spending a significant amount of time
doing Reference processing?

If you do see significant Reference processing times , you can
try turning on ParallelRefProcEnabled.

Jon

On 01/09/12 03:08, Florian Binder wrote:
> Hi everybody,
>
> I am using CMS (with ParNew) gc and have very long (>  6 seconds) young
> gc pauses.
> As you can see in the log below the old-gen-heap consists of one large
> block, the new Size has 256m, it uses 13 worker threads and it has to
> copy 27505761 words (~210mb) directly from eden to old gen.
> I have seen that this problem occurs only after about one week of
> uptime. Even thought we make a full (compacting) gc every night.
> Since real-time>  user-time I assume it might be a synchronization
> problem. Can this be true?
>
> Do you have any Ideas how I can speed up this gcs?
>
> Please let me know, if you need more informations.
>
> Thank you,
> Flo
>
>
> ##### java -version #####
> java version "1.6.0_29"
> Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>
> ##### The startup parameters: #####
> -Xms28G -Xmx28G
> -XX:+UseConcMarkSweepGC \
> -XX:CMSMaxAbortablePrecleanTime=10000 \
> -XX:SurvivorRatio=8 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=31 \
> -XX:CMSInitiatingOccupancyFraction=80 \
> -XX:NewSize=256M \
>
> -verbose:gc \
> -XX:+PrintFlagsFinal \
> -XX:PrintFLSStatistics=1 \
> -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps \
> -XX:-TraceClassUnloading \
> -XX:+PrintGCApplicationConcurrentTime \
> -XX:+PrintGCApplicationStoppedTime \
> -XX:+PrintTenuringDistribution \
> -XX:+CMSClassUnloadingEnabled \
> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>
> -Djava.awt.headless=true
>
> ##### From the out-file (as of +PrintFlagsFinal): #####
> ParallelGCThreads                         = 13
>
> ##### The gc.log-excerpt: #####
> Application time: 20,0617700 seconds
> 2011-12-22T12:02:03.289+0100: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1183290963
> Max   Chunk Size: 1183290963
> Number of Blocks: 1
> Av.  Block  Size: 1183290963
> Tree      Height: 1
> Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> [ParNew
> Desired survivor size 25480392 bytes, new threshold 1 (max 31)
> - age   1:   28260160 bytes,   28260160 total
> : 249216K->27648K(249216K), 6,1808130 secs]
> 20061765K->20056210K(29332480K)After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1155785202
> Max   Chunk Size: 1155785202
> Number of Blocks: 1
> Av.  Block  Size: 1155785202
> Tree      Height: 1
> After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
> Total time for which application threads were stopped: 6,1818730 seconds
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From kirk at kodewerk.com  Mon Jan  9 11:06:26 2012
From: kirk at kodewerk.com (Kirk Pepperdine)
Date: Mon, 9 Jan 2012 20:06:26 +0100
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <CABzyjy=23c+csJ3ffo5c=2VQeXmuiiZ1z+EyVbfgYV5M2MZcCA@mail.gmail.com>
References: <4F0ACAAC.8020103@java4.info>
	<CABzyjy=23c+csJ3ffo5c=2VQeXmuiiZ1z+EyVbfgYV5M2MZcCA@mail.gmail.com>
Message-ID: <F43F72FC-EA5E-4017-93B6-3783473104AF@kodewerk.com>

Hi Ramki,

AFAICT given the limited GC log, the calculated tenuring threshold is always 1 which mean's he always flooding survivor spaces (i.e. suffering from premature promotion). My guess is that the tuning strategy assumes cost of long lived objects dominates and so heap is configured to minimize (survivor) copy costs. But it would appear that this strategy has backfired. Look at young gen size and if you do the maths you can see that there is no chance of there not being premature promotion. WIth the 80% initiating occupancy fraction.. well, that can't lead to anything good either. WIth the VM so misconfigured it's difficult to estimate true live set size which could then be used to calculate more reasonable pool sizes.

So, with all the promtion going on, I suspect that fragmentation is making it difficult to reallocate the object in tenuring... hence long pause time. Would you say with these large data strictures that it might be difficult for the CMS to parallelize the scan for roots? The abortable pre-clean aborts on time which means that it's not able to clear out much and given the apparent life-cycle, is it worth running this phase? In fact, would you not guess that the parallel collector do better in this scenario?

-- Kirk

ps. I'm always happy beat you to the punch.. 'cos it's very difficult to do. ;-)

On 2012-01-09, at 7:40 PM, Srinivas Ramakrishna wrote:

> Haven't looked at any logs, but setting MaxTenuringThreshold to 31 can be bad. I'd dial that down to 8,
> or leave it at the default of 15. (Your GC logs which must presumably include the tenuring distribution should
> inform you as to a more optimal size to use. As Kirk noted, premature promotion can be bad, and so can
> survivor space overflow, which can lead to premature promotion and exacerbate fragmentation.)
> 
> -- ramki
> 
> On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder <java at java4.info> wrote:
> Hi everybody,
> 
> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
> gc pauses.
> As you can see in the log below the old-gen-heap consists of one large
> block, the new Size has 256m, it uses 13 worker threads and it has to
> copy 27505761 words (~210mb) directly from eden to old gen.
> I have seen that this problem occurs only after about one week of
> uptime. Even thought we make a full (compacting) gc every night.
> Since real-time > user-time I assume it might be a synchronization
> problem. Can this be true?
> 
> Do you have any Ideas how I can speed up this gcs?
> 
> Please let me know, if you need more informations.
> 
> Thank you,
> Flo
> 
> 
> ##### java -version #####
> java version "1.6.0_29"
> Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
> 
> ##### The startup parameters: #####
> -Xms28G -Xmx28G
> -XX:+UseConcMarkSweepGC \
> -XX:CMSMaxAbortablePrecleanTime=10000 \
> -XX:SurvivorRatio=8 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=31 \
> -XX:CMSInitiatingOccupancyFraction=80 \
> -XX:NewSize=256M \
> 
> -verbose:gc \
> -XX:+PrintFlagsFinal \
> -XX:PrintFLSStatistics=1 \
> -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps \
> -XX:-TraceClassUnloading \
> -XX:+PrintGCApplicationConcurrentTime \
> -XX:+PrintGCApplicationStoppedTime \
> -XX:+PrintTenuringDistribution \
> -XX:+CMSClassUnloadingEnabled \
> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
> 
> -Djava.awt.headless=true
> 
> ##### From the out-file (as of +PrintFlagsFinal): #####
> ParallelGCThreads                         = 13
> 
> ##### The gc.log-excerpt: #####
> Application time: 20,0617700 seconds
> 2011-12-22T12:02:03.289+0100: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1183290963
> Max   Chunk Size: 1183290963
> Number of Blocks: 1
> Av.  Block  Size: 1183290963
> Tree      Height: 1
> Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> [ParNew
> Desired survivor size 25480392 bytes, new threshold 1 (max 31)
> - age   1:   28260160 bytes,   28260160 total
> : 249216K->27648K(249216K), 6,1808130 secs]
> 20061765K->20056210K(29332480K)After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1155785202
> Max   Chunk Size: 1155785202
> Number of Blocks: 1
> Av.  Block  Size: 1155785202
> Tree      Height: 1
> After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
> Total time for which application threads were stopped: 6,1818730 seconds
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/08d12ac9/attachment.html 

From chkwok at digibites.nl  Mon Jan  9 11:33:52 2012
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Mon, 9 Jan 2012 20:33:52 +0100
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <4F0ACAAC.8020103@java4.info>
References: <4F0ACAAC.8020103@java4.info>
Message-ID: <CAG7eTFrzjtWPpHcvJeG86nKEUBFV3j1nzzO00t7O_fWPnT6JyQ@mail.gmail.com>

Just making sure the obvious case is covered: is it just me or is 6s real >
3.5s user+sys with 13 threads just plain weird? That means there was 0.5
thread actually running on the average during that collection.

Do a sar -B (requires package sysstat) and see if there were any major
pagefaults (or indirectly via cacti and other monitoring tools via memory
usage, load average etc, or even via cat /proc/vmstat and pgmajfault), I've
seen those cause these kind of times during GC.


Chi Ho Kwok

On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder <java at java4.info> wrote:

> Hi everybody,
>
> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
> gc pauses.
> As you can see in the log below the old-gen-heap consists of one large
> block, the new Size has 256m, it uses 13 worker threads and it has to
> copy 27505761 words (~210mb) directly from eden to old gen.
> I have seen that this problem occurs only after about one week of
> uptime. Even thought we make a full (compacting) gc every night.
> Since real-time > user-time I assume it might be a synchronization
> problem. Can this be true?
>
> Do you have any Ideas how I can speed up this gcs?
>
> Please let me know, if you need more informations.
>
> Thank you,
> Flo
>
>
> ##### java -version #####
> java version "1.6.0_29"
> Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>
> ##### The startup parameters: #####
> -Xms28G -Xmx28G
> -XX:+UseConcMarkSweepGC \
> -XX:CMSMaxAbortablePrecleanTime=10000 \
> -XX:SurvivorRatio=8 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=31 \
> -XX:CMSInitiatingOccupancyFraction=80 \
> -XX:NewSize=256M \
>
> -verbose:gc \
> -XX:+PrintFlagsFinal \
> -XX:PrintFLSStatistics=1 \
> -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps \
> -XX:-TraceClassUnloading \
> -XX:+PrintGCApplicationConcurrentTime \
> -XX:+PrintGCApplicationStoppedTime \
> -XX:+PrintTenuringDistribution \
> -XX:+CMSClassUnloadingEnabled \
> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>
> -Djava.awt.headless=true
>
> ##### From the out-file (as of +PrintFlagsFinal): #####
> ParallelGCThreads                         = 13
>
> ##### The gc.log-excerpt: #####
> Application time: 20,0617700 seconds
> 2011-12-22T12:02:03.289+0100: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1183290963
> Max   Chunk Size: 1183290963
> Number of Blocks: 1
> Av.  Block  Size: 1183290963
> Tree      Height: 1
> Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> [ParNew
> Desired survivor size 25480392 bytes, new threshold 1 (max 31)
> - age   1:   28260160 bytes,   28260160 total
> : 249216K->27648K(249216K), 6,1808130 secs]
> 20061765K->20056210K(29332480K)After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 1155785202
> Max   Chunk Size: 1155785202
> Number of Blocks: 1
> Av.  Block  Size: 1155785202
> Tree      Height: 1
> After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
> Total time for which application threads were stopped: 6,1818730 seconds
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/ffc7400e/attachment.html 

From java at java4.info  Mon Jan  9 11:47:32 2012
From: java at java4.info (Florian Binder)
Date: Mon, 09 Jan 2012 20:47:32 +0100
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <CAG7eTFrzjtWPpHcvJeG86nKEUBFV3j1nzzO00t7O_fWPnT6JyQ@mail.gmail.com>
References: <4F0ACAAC.8020103@java4.info>
	<CAG7eTFrzjtWPpHcvJeG86nKEUBFV3j1nzzO00t7O_fWPnT6JyQ@mail.gmail.com>
Message-ID: <4F0B4454.2010206@java4.info>

Yes!
You are right!
I have a lot of page faults when gc is taking so much time.

For example (sar -B):
00:00:01     pgpgin/s pgpgout/s   fault/s  majflt/s
00:50:01         0,01     45,18    162,29      0,00
01:00:01         0,02     46,58    170,45      0,00
01:10:02     25313,71  27030,39  27464,37      0,02
01:20:02     23456,85  25371,28  13621,92      0,01
01:30:01     22778,76  22918,60  10136,71      0,03
01:40:11     19020,44  22723,65   8617,42      0,15
01:50:01         5,52     44,22    147,26      0,05

What is this meaning and how can I avoid it?


Flo


Am 09.01.2012 20:33, schrieb Chi Ho Kwok:
> Just making sure the obvious case is covered: is it just me or is 6s 
> real > 3.5s user+sys with 13 threads just plain weird? That means 
> there was 0.5 thread actually running on the average during that 
> collection.
>
> Do a sar -B (requires package sysstat) and see if there were any major 
> pagefaults (or indirectly via cacti and other monitoring tools via 
> memory usage, load average etc, or even via cat /proc/vmstat and 
> pgmajfault), I've seen those cause these kind of times during GC.
>
>
> Chi Ho Kwok
>
> On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder <java at java4.info 
> <mailto:java at java4.info>> wrote:
>
>     Hi everybody,
>
>     I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
>     gc pauses.
>     As you can see in the log below the old-gen-heap consists of one large
>     block, the new Size has 256m, it uses 13 worker threads and it has to
>     copy 27505761 words (~210mb) directly from eden to old gen.
>     I have seen that this problem occurs only after about one week of
>     uptime. Even thought we make a full (compacting) gc every night.
>     Since real-time > user-time I assume it might be a synchronization
>     problem. Can this be true?
>
>     Do you have any Ideas how I can speed up this gcs?
>
>     Please let me know, if you need more informations.
>
>     Thank you,
>     Flo
>
>
>     ##### java -version #####
>     java version "1.6.0_29"
>     Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
>     Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>
>     ##### The startup parameters: #####
>     -Xms28G -Xmx28G
>     -XX:+UseConcMarkSweepGC \
>     -XX:CMSMaxAbortablePrecleanTime=10000 \
>     -XX:SurvivorRatio=8 \
>     -XX:TargetSurvivorRatio=90 \
>     -XX:MaxTenuringThreshold=31 \
>     -XX:CMSInitiatingOccupancyFraction=80 \
>     -XX:NewSize=256M \
>
>     -verbose:gc \
>     -XX:+PrintFlagsFinal \
>     -XX:PrintFLSStatistics=1 \
>     -XX:+PrintGCDetails \
>     -XX:+PrintGCDateStamps \
>     -XX:-TraceClassUnloading \
>     -XX:+PrintGCApplicationConcurrentTime \
>     -XX:+PrintGCApplicationStoppedTime \
>     -XX:+PrintTenuringDistribution \
>     -XX:+CMSClassUnloadingEnabled \
>     -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
>     -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>
>     -Djava.awt.headless=true
>
>     ##### From the out-file (as of +PrintFlagsFinal): #####
>     ParallelGCThreads                         = 13
>
>     ##### The gc.log-excerpt: #####
>     Application time: 20,0617700 seconds
>     2011-12-22T12:02:03.289+0100: [GC Before GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 1183290963
>     Max   Chunk Size: 1183290963
>     Number of Blocks: 1
>     Av.  Block  Size: 1183290963
>     Tree      Height: 1
>     Before GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 0
>     Max   Chunk Size: 0
>     Number of Blocks: 0
>     Tree      Height: 0
>     [ParNew
>     Desired survivor size 25480392 bytes, new threshold 1 (max 31)
>     - age   1:   28260160 bytes,   28260160 total
>     : 249216K->27648K(249216K), 6,1808130 secs]
>     20061765K->20056210K(29332480K)After GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 1155785202
>     Max   Chunk Size: 1155785202
>     Number of Blocks: 1
>     Av.  Block  Size: 1155785202
>     Tree      Height: 1
>     After GC:
>     Statistics for BinaryTreeDictionary:
>     ------------------------------------
>     Total Free Space: 0
>     Max   Chunk Size: 0
>     Number of Blocks: 0
>     Tree      Height: 0
>     , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
>     Total time for which application threads were stopped: 6,1818730
>     seconds
>     _______________________________________________
>     hotspot-gc-use mailing list
>     hotspot-gc-use at openjdk.java.net
>     <mailto:hotspot-gc-use at openjdk.java.net>
>     http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/4bdedec2/attachment-0001.html 

From chkwok at digibites.nl  Mon Jan  9 21:21:48 2012
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Tue, 10 Jan 2012 06:21:48 +0100
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <4F0B4454.2010206@java4.info>
References: <4F0ACAAC.8020103@java4.info>
	<CAG7eTFrzjtWPpHcvJeG86nKEUBFV3j1nzzO00t7O_fWPnT6JyQ@mail.gmail.com>
	<4F0B4454.2010206@java4.info>
Message-ID: <CAG7eTFoR34bsy0kDv4yQy3M_Y-sUmn_42fz3MUaCEF=9UvyU_w@mail.gmail.com>

Hi Florian,

Uh, you might want to try sar -r as well, that reports memory usage (and
man sar for other reporting options, and -f /var/log/sysstat/saXX where xx
is the day for older data is useful as well). Page in / out means reading
or writing to the swap file, usual cause here is one or more huge
background task / cron jobs taking up too much memory forcing other things
to swap out. You can try reducing the size of the heap and see if it helps
if you're just a little bit short, but otherwise I don't think you can
solve this with just VM options.


Here's the relevant section from the manual:

       -B     Report paging statistics. Some of the metrics below are
> available only with post 2.5 kernels. The following values are displayed:
>               pgpgin/s
>                      Total  number of kilobytes the system paged in from
> disk per second.  Note: With old kernels (2.2.x) this value is a number of
> blocks
>                      per second (and not kilobytes).
>               pgpgout/s
>                      Total number of kilobytes the system paged out to
> disk per second.  Note: With old kernels (2.2.x) this value is a number  of
>  blocks
>                      per second (and not kilobytes).
>               fault/s
>                      Number  of  page faults (major + minor) made by the
> system per second.  This is not a count of page faults that generate I/O,
> because
>                      some page faults can be resolved without I/O.
>               majflt/s
>                      Number of major faults the system has made per
> second, those which have required loading a memory page from disk.


I'm not sure what kernel you're on, but pgpgin / out being high is a bad
thing. Sar seems to report that all faults are minor, but that conflicts
with the first two columns.


Chi Ho Kwok

On Mon, Jan 9, 2012 at 8:47 PM, Florian Binder <java at java4.info> wrote:

>  Yes!
> You are right!
> I have a lot of page faults when gc is taking so much time.
>
> For example (sar -B):
> 00:00:01     pgpgin/s pgpgout/s   fault/s  majflt/s
> 00:50:01         0,01     45,18    162,29      0,00
> 01:00:01         0,02     46,58    170,45      0,00
> 01:10:02     25313,71  27030,39  27464,37      0,02
> 01:20:02     23456,85  25371,28  13621,92      0,01
> 01:30:01     22778,76  22918,60  10136,71      0,03
> 01:40:11     19020,44  22723,65   8617,42      0,15
> 01:50:01         5,52     44,22    147,26      0,05
>
> What is this meaning and how can I avoid it?
>
>
> Flo
>
>
>
> Am 09.01.2012 20:33, schrieb Chi Ho Kwok:
>
> Just making sure the obvious case is covered: is it just me or is 6s real
> > 3.5s user+sys with 13 threads just plain weird? That means there was 0.5
> thread actually running on the average during that collection.
>
>  Do a sar -B (requires package sysstat) and see if there were any major
> pagefaults (or indirectly via cacti and other monitoring tools via memory
> usage, load average etc, or even via cat /proc/vmstat and pgmajfault), I've
> seen those cause these kind of times during GC.
>
>
>  Chi Ho Kwok
>
>  On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder <java at java4.info> wrote:
>
>> Hi everybody,
>>
>> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
>> gc pauses.
>> As you can see in the log below the old-gen-heap consists of one large
>> block, the new Size has 256m, it uses 13 worker threads and it has to
>> copy 27505761 words (~210mb) directly from eden to old gen.
>> I have seen that this problem occurs only after about one week of
>> uptime. Even thought we make a full (compacting) gc every night.
>> Since real-time > user-time I assume it might be a synchronization
>> problem. Can this be true?
>>
>> Do you have any Ideas how I can speed up this gcs?
>>
>> Please let me know, if you need more informations.
>>
>> Thank you,
>> Flo
>>
>>
>> ##### java -version #####
>> java version "1.6.0_29"
>> Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>>
>> ##### The startup parameters: #####
>> -Xms28G -Xmx28G
>> -XX:+UseConcMarkSweepGC \
>> -XX:CMSMaxAbortablePrecleanTime=10000 \
>> -XX:SurvivorRatio=8 \
>> -XX:TargetSurvivorRatio=90 \
>> -XX:MaxTenuringThreshold=31 \
>> -XX:CMSInitiatingOccupancyFraction=80 \
>> -XX:NewSize=256M \
>>
>> -verbose:gc \
>> -XX:+PrintFlagsFinal \
>> -XX:PrintFLSStatistics=1 \
>> -XX:+PrintGCDetails \
>> -XX:+PrintGCDateStamps \
>> -XX:-TraceClassUnloading \
>> -XX:+PrintGCApplicationConcurrentTime \
>> -XX:+PrintGCApplicationStoppedTime \
>> -XX:+PrintTenuringDistribution \
>> -XX:+CMSClassUnloadingEnabled \
>> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
>> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>>
>> -Djava.awt.headless=true
>>
>> ##### From the out-file (as of +PrintFlagsFinal): #####
>> ParallelGCThreads                         = 13
>>
>> ##### The gc.log-excerpt: #####
>> Application time: 20,0617700 seconds
>> 2011-12-22T12:02:03.289+0100: [GC Before GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 1183290963
>> Max   Chunk Size: 1183290963
>> Number of Blocks: 1
>> Av.  Block  Size: 1183290963
>> Tree      Height: 1
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree      Height: 0
>> [ParNew
>> Desired survivor size 25480392 bytes, new threshold 1 (max 31)
>> - age   1:   28260160 bytes,   28260160 total
>> : 249216K->27648K(249216K), 6,1808130 secs]
>> 20061765K->20056210K(29332480K)After GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 1155785202
>> Max   Chunk Size: 1155785202
>> Number of Blocks: 1
>> Av.  Block  Size: 1155785202
>> Tree      Height: 1
>> After GC:
>> Statistics for BinaryTreeDictionary:
>> ------------------------------------
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree      Height: 0
>> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
>> Total time for which application threads were stopped: 6,1818730 seconds
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/ea863255/attachment.html 

From vitalyd at gmail.com  Mon Jan  9 21:43:36 2012
From: vitalyd at gmail.com (Vitaly Davidovich)
Date: Tue, 10 Jan 2012 00:43:36 -0500
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <CAG7eTFoR34bsy0kDv4yQy3M_Y-sUmn_42fz3MUaCEF=9UvyU_w@mail.gmail.com>
References: <4F0ACAAC.8020103@java4.info>
	<CAG7eTFrzjtWPpHcvJeG86nKEUBFV3j1nzzO00t7O_fWPnT6JyQ@mail.gmail.com>
	<4F0B4454.2010206@java4.info>
	<CAG7eTFoR34bsy0kDv4yQy3M_Y-sUmn_42fz3MUaCEF=9UvyU_w@mail.gmail.com>
Message-ID: <CAHjP37ECd0JFc_FzA85-QkTut938p-cMWYF_CXCN7j=jjrEkrg@mail.gmail.com>

Apparently pgpgin/pgpgout may not be that accurate to determine swap file
usage:
http://help.lockergnome.com/linux/pgpgin-pgpgout-measure--ftopict506279.html

May need to use vmstat and look at si/so instead.
On Jan 10, 2012 12:24 AM, "Chi Ho Kwok" <chkwok at digibites.nl> wrote:

> Hi Florian,
>
> Uh, you might want to try sar -r as well, that reports memory usage (and
> man sar for other reporting options, and -f /var/log/sysstat/saXX where xx
> is the day for older data is useful as well). Page in / out means reading
> or writing to the swap file, usual cause here is one or more huge
> background task / cron jobs taking up too much memory forcing other things
> to swap out. You can try reducing the size of the heap and see if it helps
> if you're just a little bit short, but otherwise I don't think you can
> solve this with just VM options.
>
>
> Here's the relevant section from the manual:
>
>        -B     Report paging statistics. Some of the metrics below are
>> available only with post 2.5 kernels. The following values are displayed:
>>               pgpgin/s
>>                      Total  number of kilobytes the system paged in from
>> disk per second.  Note: With old kernels (2.2.x) this value is a number of
>> blocks
>>                      per second (and not kilobytes).
>>               pgpgout/s
>>                      Total number of kilobytes the system paged out to
>> disk per second.  Note: With old kernels (2.2.x) this value is a number  of
>>  blocks
>>                      per second (and not kilobytes).
>>               fault/s
>>                      Number  of  page faults (major + minor) made by the
>> system per second.  This is not a count of page faults that generate I/O,
>> because
>>                      some page faults can be resolved without I/O.
>>               majflt/s
>>                      Number of major faults the system has made per
>> second, those which have required loading a memory page from disk.
>
>
> I'm not sure what kernel you're on, but pgpgin / out being high is a bad
> thing. Sar seems to report that all faults are minor, but that conflicts
> with the first two columns.
>
>
> Chi Ho Kwok
>
> On Mon, Jan 9, 2012 at 8:47 PM, Florian Binder <java at java4.info> wrote:
>
>>  Yes!
>> You are right!
>> I have a lot of page faults when gc is taking so much time.
>>
>> For example (sar -B):
>> 00:00:01     pgpgin/s pgpgout/s   fault/s  majflt/s
>> 00:50:01         0,01     45,18    162,29      0,00
>> 01:00:01         0,02     46,58    170,45      0,00
>> 01:10:02     25313,71  27030,39  27464,37      0,02
>> 01:20:02     23456,85  25371,28  13621,92      0,01
>> 01:30:01     22778,76  22918,60  10136,71      0,03
>> 01:40:11     19020,44  22723,65   8617,42      0,15
>> 01:50:01         5,52     44,22    147,26      0,05
>>
>> What is this meaning and how can I avoid it?
>>
>>
>> Flo
>>
>>
>>
>> Am 09.01.2012 20:33, schrieb Chi Ho Kwok:
>>
>> Just making sure the obvious case is covered: is it just me or is 6s real
>> > 3.5s user+sys with 13 threads just plain weird? That means there was 0.5
>> thread actually running on the average during that collection.
>>
>>  Do a sar -B (requires package sysstat) and see if there were any major
>> pagefaults (or indirectly via cacti and other monitoring tools via memory
>> usage, load average etc, or even via cat /proc/vmstat and pgmajfault), I've
>> seen those cause these kind of times during GC.
>>
>>
>>  Chi Ho Kwok
>>
>>  On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder <java at java4.info> wrote:
>>
>>> Hi everybody,
>>>
>>> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young
>>> gc pauses.
>>> As you can see in the log below the old-gen-heap consists of one large
>>> block, the new Size has 256m, it uses 13 worker threads and it has to
>>> copy 27505761 words (~210mb) directly from eden to old gen.
>>> I have seen that this problem occurs only after about one week of
>>> uptime. Even thought we make a full (compacting) gc every night.
>>> Since real-time > user-time I assume it might be a synchronization
>>> problem. Can this be true?
>>>
>>> Do you have any Ideas how I can speed up this gcs?
>>>
>>> Please let me know, if you need more informations.
>>>
>>> Thank you,
>>> Flo
>>>
>>>
>>> ##### java -version #####
>>> java version "1.6.0_29"
>>> Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
>>> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
>>>
>>> ##### The startup parameters: #####
>>> -Xms28G -Xmx28G
>>> -XX:+UseConcMarkSweepGC \
>>> -XX:CMSMaxAbortablePrecleanTime=10000 \
>>> -XX:SurvivorRatio=8 \
>>> -XX:TargetSurvivorRatio=90 \
>>> -XX:MaxTenuringThreshold=31 \
>>> -XX:CMSInitiatingOccupancyFraction=80 \
>>> -XX:NewSize=256M \
>>>
>>> -verbose:gc \
>>> -XX:+PrintFlagsFinal \
>>> -XX:PrintFLSStatistics=1 \
>>> -XX:+PrintGCDetails \
>>> -XX:+PrintGCDateStamps \
>>> -XX:-TraceClassUnloading \
>>> -XX:+PrintGCApplicationConcurrentTime \
>>> -XX:+PrintGCApplicationStoppedTime \
>>> -XX:+PrintTenuringDistribution \
>>> -XX:+CMSClassUnloadingEnabled \
>>> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \
>>> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \
>>>
>>> -Djava.awt.headless=true
>>>
>>> ##### From the out-file (as of +PrintFlagsFinal): #####
>>> ParallelGCThreads                         = 13
>>>
>>> ##### The gc.log-excerpt: #####
>>> Application time: 20,0617700 seconds
>>> 2011-12-22T12:02:03.289+0100: [GC Before GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 1183290963
>>> Max   Chunk Size: 1183290963
>>> Number of Blocks: 1
>>> Av.  Block  Size: 1183290963
>>> Tree      Height: 1
>>> Before GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 0
>>> Max   Chunk Size: 0
>>> Number of Blocks: 0
>>> Tree      Height: 0
>>> [ParNew
>>> Desired survivor size 25480392 bytes, new threshold 1 (max 31)
>>> - age   1:   28260160 bytes,   28260160 total
>>> : 249216K->27648K(249216K), 6,1808130 secs]
>>> 20061765K->20056210K(29332480K)After GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 1155785202
>>> Max   Chunk Size: 1155785202
>>> Number of Blocks: 1
>>> Av.  Block  Size: 1155785202
>>> Tree      Height: 1
>>> After GC:
>>> Statistics for BinaryTreeDictionary:
>>> ------------------------------------
>>> Total Free Space: 0
>>> Max   Chunk Size: 0
>>> Number of Blocks: 0
>>> Tree      Height: 0
>>> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs]
>>> Total time for which application threads were stopped: 6,1818730 seconds
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>
>>
>>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/7c061fcb/attachment-0001.html 

From fancyerii at gmail.com  Tue Jan 10 01:31:07 2012
From: fancyerii at gmail.com (Li Li)
Date: Tue, 10 Jan 2012 17:31:07 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
Message-ID: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>

hi all
   I have an application that generating many large objects and then
discard them. I found that full gc can free memory from 70% to 40%.
   I want to let this objects in young generation longer. I found
-XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
   But I found a blog that says MaxTenuringThreshold is not used in
ParNewGC.
And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
seems no difference.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/1fd48a4c/attachment.html 

From fancyerii at gmail.com  Tue Jan 10 01:49:10 2012
From: fancyerii at gmail.com (Li Li)
Date: Tue, 10 Jan 2012 17:49:10 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
Message-ID: <CAFAd71Uqti_s+Of11x0-0wjcr+ejh56nLL1R_1PEodC8a8Vm-A@mail.gmail.com>

btw, is there any web page that list all the jvm parameters and their
default values? I am confused that they are distributed in many documents
and some are deprecated.

On Tue, Jan 10, 2012 at 5:31 PM, Li Li <fancyerii at gmail.com> wrote:

> hi all
>    I have an application that generating many large objects and then
> discard them. I found that full gc can free memory from 70% to 40%.
> btw, is there any web page that list all JVM parameters and their default
> values?
>


>  I want to let this objects in young generation longer. I found
> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>    But I found a blog that says MaxTenuringThreshold is not used in
> ParNewGC.
> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
> seems no difference.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/9e974a80/attachment.html 

From java at java4.info  Tue Jan 10 02:23:26 2012
From: java at java4.info (Florian Binder)
Date: Tue, 10 Jan 2012 11:23:26 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71Uqti_s+Of11x0-0wjcr+ejh56nLL1R_1PEodC8a8Vm-A@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CAFAd71Uqti_s+Of11x0-0wjcr+ejh56nLL1R_1PEodC8a8Vm-A@mail.gmail.com>
Message-ID: <4F0C119E.7090600@java4.info>

At
http://cr.openjdk.java.net/~brutisso/7016112/webrev.02/src/share/vm/runtime/globals.hpp.html

you have the source code with most jvm-parameters.
I know, it is a webrev and not the newest file, but there are the most 
parameters with a short description ;-)

An other way is to enable PrintFlagsFinal or PrintFlagsInitial or just run:
java -XX:+PrintFlagsFinal

Flo


Am 10.01.2012 10:49, schrieb Li Li:
> btw, is there any web page that list all the jvm parameters and their 
> default values? I am confused that they are distributed in many 
> documents and some are deprecated.
>
> On Tue, Jan 10, 2012 at 5:31 PM, Li Li <fancyerii at gmail.com 
> <mailto:fancyerii at gmail.com>> wrote:
>
>     hi all
>        I have an application that generating many large objects and
>     then discard them. I found that full gc can free memory from 70%
>     to 40%.
>     btw, is there any web page that list all JVM parameters and their
>     default values?
>
>
>      I want to let this objects in young generation longer. I found
>     -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>        But I found a blog that says MaxTenuringThreshold is not used
>     in ParNewGC.
>     And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10,
>     but it seems no difference.
>
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/084c12d1/attachment.html 

From bengt.rutisson at oracle.com  Tue Jan 10 05:17:18 2012
From: bengt.rutisson at oracle.com (Bengt Rutisson)
Date: Tue, 10 Jan 2012 14:17:18 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <4F0C119E.7090600@java4.info>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>	<CAFAd71Uqti_s+Of11x0-0wjcr+ejh56nLL1R_1PEodC8a8Vm-A@mail.gmail.com>
	<4F0C119E.7090600@java4.info>
Message-ID: <4F0C3A5E.5010300@oracle.com>

On 2012-01-10 11:23, Florian Binder wrote:
> At
> http://cr.openjdk.java.net/~brutisso/7016112/webrev.02/src/share/vm/runtime/globals.hpp.html

This is actually a link to one of my webrevs. It could be removed any 
day. A more stable way of finding the source for globals.hpp is to look 
in the mercurial repository for OpenJDK:
http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/file/97c00e21fecb/src/share/vm/runtime/globals.hpp

Bengt

>
> you have the source code with most jvm-parameters.
> I know, it is a webrev and not the newest file, but there are the most 
> parameters with a short description ;-)
>
> An other way is to enable PrintFlagsFinal or PrintFlagsInitial or just 
> run:
> java -XX:+PrintFlagsFinal
>
> Flo
>
>
> Am 10.01.2012 10:49, schrieb Li Li:
>> btw, is there any web page that list all the jvm parameters and their 
>> default values? I am confused that they are distributed in many 
>> documents and some are deprecated.
>>
>> On Tue, Jan 10, 2012 at 5:31 PM, Li Li <fancyerii at gmail.com 
>> <mailto:fancyerii at gmail.com>> wrote:
>>
>>     hi all
>>        I have an application that generating many large objects and
>>     then discard them. I found that full gc can free memory from 70%
>>     to 40%.
>>     btw, is there any web page that list all JVM parameters and their
>>     default values?
>>
>>
>>      I want to let this objects in young generation longer. I found
>>     -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>>        But I found a blog that says MaxTenuringThreshold is not used
>>     in ParNewGC.
>>     And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10,
>>     but it seems no difference.
>>
>>
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/8957224e/attachment.html 

From charlie.hunt at oracle.com  Tue Jan 10 05:11:02 2012
From: charlie.hunt at oracle.com (charlie hunt)
Date: Tue, 10 Jan 2012 07:11:02 -0600
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <4F0C119E.7090600@java4.info>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>	<CAFAd71Uqti_s+Of11x0-0wjcr+ejh56nLL1R_1PEodC8a8Vm-A@mail.gmail.com>
	<4F0C119E.7090600@java4.info>
Message-ID: <4F0C38E6.80301@oracle.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/40491552/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5166 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/40491552/smime-0001.p7s 

From kinnari.darji at citi.com  Tue Jan 10 08:24:01 2012
From: kinnari.darji at citi.com (Darji, Kinnari )
Date: Tue, 10 Jan 2012 11:24:01 -0500
Subject: ParNew garbage collection
In-Reply-To: <CABzyjyn4VZfew7rR-FGJNDkw3kCRw2mhCBipKbmwkGjgNRTdvQ@mail.gmail.com>
References: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net>
	<4F048FC6.30907@oracle.com>
	<CABzyjyn4VZfew7rR-FGJNDkw3kCRw2mhCBipKbmwkGjgNRTdvQ@mail.gmail.com>
Message-ID: <21ED8E3420CDB647B88C7F80A7D64DAC06F0901CA6@exnjmb89.nam.nsroot.net>

I am using jdk-1.6.0_16 version. Is there any known issue with this version?
Also I have tried +PrintSafepointStatistics option in past but it prints out time which does not actual correlate to the time. It is hard to match it up the occurrence of event. I will try with TraceSafepointCleanup option.

Thank you
Kinnari

From: hotspot-gc-use-bounces at openjdk.java.net [mailto:hotspot-gc-use-bounces at openjdk.java.net] On Behalf Of Srinivas Ramakrishna
Sent: Wednesday, January 04, 2012 12:54 PM
To: Jon Masamitsu
Cc: hotspot-gc-use at openjdk.java.net
Subject: Re: ParNew garbage collection

May be also +PrintSafepointStatistics (and related parms) to drill down a bit further, although TraceSafepointCleanup
would probably provide all of the info on a per-safepoint basis. There was an old issue wrt monitor
deflation that was foixed a few releases ago, so Kinnari should check the version of the JVM she's
running on as well.... (There are now a couple of flags related to monitor list handling policies i believe
but i have no experience with them and do not have the code in front of me -- make sure to cc the runtime
list if that turns out to be the issue again and you are already on a very recent version of the JVM.)

-- ramki
On Wed, Jan 4, 2012 at 9:43 AM, Jon Masamitsu <jon.masamitsu at oracle.com<mailto:jon.masamitsu at oracle.com>> wrote:
Try turning on TraceSafepointCleanupTime.  I haven't used it myself.  If
that's not it, look in share/vm/runtime/globals.hpp for some other flag
that traces safepoints.


On 1/3/2012 1:36 PM, Darji, Kinnari wrote:

Hello GC team,

I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise.


2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew

Desired survivor size 19628032 bytes, new threshold 4 (max 4)

- age   1:    4466024 bytes,    4466024 total

- age   2:    3568136 bytes,    8034160 total

- age   3:    3559808 bytes,   11593968 total

- age   4:    1737520 bytes,   13331488 total

: 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs]

Total time for which application threads were stopped: 1.7197830 seconds

Application time: 8.4134780 seconds


Thank you

Kinnari


_______________________________________________

hotspot-gc-use mailing list

hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>

http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/1de08526/attachment.html 

From ysr1729 at gmail.com  Tue Jan 10 09:23:53 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Tue, 10 Jan 2012 09:23:53 -0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
Message-ID: <CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>

I recommend Charlie's excellent book as well.

To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold
(henceforth MTT),
but in order to allow objects to age you also need sufficiently large
survivor spaces to hold
them for however long you wish, otherwise the adaptive tenuring policy will
adjust the
"current" tenuring threshold so as to prevent overflow. That may be what
you saw.
Check out the info printed by +PrintTenuringThreshold.

-- ramki

On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:

> hi all
>    I have an application that generating many large objects and then
> discard them. I found that full gc can free memory from 70% to 40%.
>    I want to let this objects in young generation longer. I found
> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>    But I found a blog that says MaxTenuringThreshold is not used in
> ParNewGC.
> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
> seems no difference.
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/1116311d/attachment.html 

From fancyerii at gmail.com  Tue Jan 10 20:45:21 2012
From: fancyerii at gmail.com (Li Li)
Date: Wed, 11 Jan 2012 12:45:21 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
Message-ID: <CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>

  if the young generation is too small that it can't afford space for
survivors and it have to throw them to old generation. and jvm found this,
it will turn down TenuringThreshold ?
   I set TenuringThreshold to 10. and found that the full gc is less
frequent and every full gc collect less garbage. it seems the parameter
have the effect. But I found the load average is up and young gc time is
much more than before. And the response time is also increased.
   I guess that there are more objects in young generation. so it have to
do more young gc. although they are garbage, it's not a good idea to
collect them too early. because ParNewGC will stop the world, the response
time is increasing.
   So I adjust TenuringThreshold to 3 and there are no remarkable
difference.
   maybe I should use object pool for my application because it use many
large temporary objects.
   Another question, when my application runs for about 1-2 days. I found
the response time increases. I guess it's the problem of large young
generation.
    in the beginning, the total memory usage is about 4-5GB and young
generation is 100-200MB, the rest is old generation.
    After running for days, the total memory usage is 8GB and young
generation is about 2GB(I set new Ration 1:3)
    I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation
and ?XX:MaxHeapFreeRation
   the default value is 40 and 70. the memory manage white paper says if
the total heap free space is less than 40%, it will increase heap. if the
free space is larger than 70%, it will decrease heap size.
   But why I see the young generation is 200mb while old is 4gb. does the
adjustment of young related to old generation?
   I read in
http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
young
generation should be less than 512MB, is it correct?


On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:

> I recommend Charlie's excellent book as well.
>
> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold
> (henceforth MTT),
> but in order to allow objects to age you also need sufficiently large
> survivor spaces to hold
> them for however long you wish, otherwise the adaptive tenuring policy
> will adjust the
> "current" tenuring threshold so as to prevent overflow. That may be what
> you saw.
> Check out the info printed by +PrintTenuringThreshold.
>
> -- ramki
>
> On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:
>
>> hi all
>>    I have an application that generating many large objects and then
>> discard them. I found that full gc can free memory from 70% to 40%.
>>    I want to let this objects in young generation longer. I found
>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>>    But I found a blog that says MaxTenuringThreshold is not used in
>> ParNewGC.
>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
>> seems no difference.
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/86b651b6/attachment.html 

From fancyerii at gmail.com  Tue Jan 10 23:47:29 2012
From: fancyerii at gmail.com (Li Li)
Date: Wed, 11 Jan 2012 15:47:29 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
Message-ID: <CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>

1. I don't understand why tenuring thresholds are
calculated to be 1
2. I don't set Xms, I just set Xmx=8g
3. as for memory leak, I will try to find it.

On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine <kirk at kodewerk.com> wrote:

> Hi Li LI,
>
> I fear that you are off in the wrong direction. Resetting tenuring
> thresholds in this case will never work because they are being calculated
> to be 1. You're suggesting numbers greater than 1 and so 1 will always be
> used which explains why you're not seeing a difference between runs. Having
> a calculated tenuring threshold set to 1 implies that the memory pool is
> too small. If the a memory pool is too small the only thing you can do to
> fix that is to make it bigger. In this case, your young generational space
> (as I've indicated in previous postings) is too small. Also, the cost of a
> young generational collection is dependent mostly upon the number of
> surviving objects, not dead ones. Pooling temporary objects will only make
> the problem worse. If I recall your flag settings, you've set netsize to a
> fixed value. That setting will override the the new ratio setting. You also
> set Xmx==Xms and that also override adaptive sizing. Also you are using CMS
> which is inherently not size adaptable.
>
> Last point, and this is the biggest one. The numbers you're publishing
> right now suggest that you have a memory leak. There is no way you're going
> to stabilize the memory /gc behaviour with a memory leak. Things will get
> progressively worse as you consume more and more heap. This is a blocking
> issue to all tuning efforts. It is the first thing that must be dealt with.
>
> To find the leak;
> Identify the leaking object useing VisualVM's memory profiler with
> generational counts and collect allocation stack traces turned on. Sort the
> profile by generational counts. When you've identified the leaking object,
> the domain class with the highest and always increasing generational count.
> take an allocation stack trace snapshot and a heap dump. The heap dump
> should be loaded into a heap walker. Use the knowledge gained from
> generational counts to inspect the linkages for the leaking object and then
> use that information in the allocation stack traces to identify casual
> execution paths for creation. After that, it's into application code to
> determine the fix.
>
> Kind regards,
> Kirk Pepperdine
>
> On 2012-01-11, at 5:45 AM, Li Li wrote:
>
>   if the young generation is too small that it can't afford space for
> survivors and it have to throw them to old generation. and jvm found this,
> it will turn down TenuringThreshold ?
>    I set TenuringThreshold to 10. and found that the full gc is less
> frequent and every full gc collect less garbage. it seems the parameter
> have the effect. But I found the load average is up and young gc time is
> much more than before. And the response time is also increased.
>    I guess that there are more objects in young generation. so it have to
> do more young gc. although they are garbage, it's not a good idea to
> collect them too early. because ParNewGC will stop the world, the response
> time is increasing.
>    So I adjust TenuringThreshold to 3 and there are no remarkable
> difference.
>    maybe I should use object pool for my application because it use many
> large temporary objects.
>    Another question, when my application runs for about 1-2 days. I found
> the response time increases. I guess it's the problem of large young
> generation.
>     in the beginning, the total memory usage is about 4-5GB and young
> generation is 100-200MB, the rest is old generation.
>     After running for days, the total memory usage is 8GB and young
> generation is about 2GB(I set new Ration 1:3)
>     I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation
> and ?XX:MaxHeapFreeRation
>    the default value is 40 and 70. the memory manage white paper says if
> the total heap free space is less than 40%, it will increase heap. if the
> free space is larger than 70%, it will decrease heap size.
>    But why I see the young generation is 200mb while old is 4gb. does the
> adjustment of young related to old generation?
>    I read in
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young
> generation should be less than 512MB, is it correct?
>
>
>
> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
>
>> I recommend Charlie's excellent book as well.
>>
>> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold
>> (henceforth MTT),
>> but in order to allow objects to age you also need sufficiently large
>> survivor spaces to hold
>> them for however long you wish, otherwise the adaptive tenuring policy
>> will adjust the
>> "current" tenuring threshold so as to prevent overflow. That may be what
>> you saw.
>> Check out the info printed by +PrintTenuringThreshold.
>>
>> -- ramki
>>
>> On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:
>>
>>> hi all
>>>    I have an application that generating many large objects and then
>>> discard them. I found that full gc can free memory from 70% to 40%.
>>>    I want to let this objects in young generation longer. I found
>>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>>>    But I found a blog that says MaxTenuringThreshold is not used in
>>> ParNewGC.
>>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
>>> seems no difference.
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/1d7218e8/attachment.html 

From fancyerii at gmail.com  Wed Jan 11 00:06:48 2012
From: fancyerii at gmail.com (Li Li)
Date: Wed, 11 Jan 2012 16:06:48 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
Message-ID: <CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>

I understand the first one.
as for Xmx, when it reach the maxium 8GB, the young generation is in deed
1.8G and Eden:s0:s1=8:1:1. That's correct.
but when I restart it for a few minutes. old is 4GB while young is 200-300MB
I don't think there is memory leak because it has running for more than a
month without OOV.
My application is using lucene+solr to provide search service which need
large memory.

On Wed, Jan 11, 2012 at 3:55 PM, Kirk Pepperdine
<kirk.pepperdine at gmail.com>wrote:

>
> On 2012-01-11, at 8:47 AM, Li Li wrote:
>
> 1. I don't understand why tenuring thresholds are
> calculated to be 1
>
>
> because the number of expected survivors exceeds the size of the survivor
> space
>
> 2. I don't set Xms, I just set Xmx=8g
>
>
> with a new ratio of 3.. you should have 2 gigs of young gen meaning a .2
> gigs for each survivor space and 1.6 for young gen. Do you have a GC log
> you can use to confirm these values? If not try visualvm and this plugin
> should give you a clear view (www.java.net/projects/memorypoolview).
>
>
> 3. as for memory leak, I will try to find it.
>
> On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine <kirk at kodewerk.com>wrote:
>
>> Hi Li LI,
>>
>> I fear that you are off in the wrong direction. Resetting tenuring
>> thresholds in this case will never work because they are being calculated
>> to be 1. You're suggesting numbers greater than 1 and so 1 will always be
>> used which explains why you're not seeing a difference between runs. Having
>> a calculated tenuring threshold set to 1 implies that the memory pool is
>> too small. If the a memory pool is too small the only thing you can do to
>> fix that is to make it bigger. In this case, your young generational space
>> (as I've indicated in previous postings) is too small. Also, the cost of a
>> young generational collection is dependent mostly upon the number of
>> surviving objects, not dead ones. Pooling temporary objects will only make
>> the problem worse. If I recall your flag settings, you've set netsize to a
>> fixed value. That setting will override the the new ratio setting. You also
>> set Xmx==Xms and that also override adaptive sizing. Also you are using CMS
>> which is inherently not size adaptable.
>>
>> Last point, and this is the biggest one. The numbers you're publishing
>> right now suggest that you have a memory leak. There is no way you're going
>> to stabilize the memory /gc behaviour with a memory leak. Things will get
>> progressively worse as you consume more and more heap. This is a blocking
>> issue to all tuning efforts. It is the first thing that must be dealt with.
>>
>> To find the leak;
>> Identify the leaking object useing VisualVM's memory profiler with
>> generational counts and collect allocation stack traces turned on. Sort the
>> profile by generational counts. When you've identified the leaking object,
>> the domain class with the highest and always increasing generational count.
>> take an allocation stack trace snapshot and a heap dump. The heap dump
>> should be loaded into a heap walker. Use the knowledge gained from
>> generational counts to inspect the linkages for the leaking object and then
>> use that information in the allocation stack traces to identify casual
>> execution paths for creation. After that, it's into application code to
>> determine the fix.
>>
>> Kind regards,
>> Kirk Pepperdine
>>
>> On 2012-01-11, at 5:45 AM, Li Li wrote:
>>
>>   if the young generation is too small that it can't afford space for
>> survivors and it have to throw them to old generation. and jvm found this,
>> it will turn down TenuringThreshold ?
>>    I set TenuringThreshold to 10. and found that the full gc is less
>> frequent and every full gc collect less garbage. it seems the parameter
>> have the effect. But I found the load average is up and young gc time is
>> much more than before. And the response time is also increased.
>>    I guess that there are more objects in young generation. so it have to
>> do more young gc. although they are garbage, it's not a good idea to
>> collect them too early. because ParNewGC will stop the world, the response
>> time is increasing.
>>    So I adjust TenuringThreshold to 3 and there are no remarkable
>> difference.
>>    maybe I should use object pool for my application because it use many
>> large temporary objects.
>>    Another question, when my application runs for about 1-2 days. I found
>> the response time increases. I guess it's the problem of large young
>> generation.
>>     in the beginning, the total memory usage is about 4-5GB and young
>> generation is 100-200MB, the rest is old generation.
>>     After running for days, the total memory usage is 8GB and young
>> generation is about 2GB(I set new Ration 1:3)
>>     I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation
>> and ?XX:MaxHeapFreeRation
>>    the default value is 40 and 70. the memory manage white paper says if
>> the total heap free space is less than 40%, it will increase heap. if the
>> free space is larger than 70%, it will decrease heap size.
>>    But why I see the young generation is 200mb while old is 4gb. does the
>> adjustment of young related to old generation?
>>    I read in
>> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young
>> generation should be less than 512MB, is it correct?
>>
>>
>>
>> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
>>
>>> I recommend Charlie's excellent book as well.
>>>
>>> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold
>>> (henceforth MTT),
>>> but in order to allow objects to age you also need sufficiently large
>>> survivor spaces to hold
>>> them for however long you wish, otherwise the adaptive tenuring policy
>>> will adjust the
>>> "current" tenuring threshold so as to prevent overflow. That may be what
>>> you saw.
>>> Check out the info printed by +PrintTenuringThreshold.
>>>
>>> -- ramki
>>>
>>> On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:
>>>
>>>> hi all
>>>>    I have an application that generating many large objects and then
>>>> discard them. I found that full gc can free memory from 70% to 40%.
>>>>    I want to let this objects in young generation longer. I found
>>>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>>>>    But I found a blog that says MaxTenuringThreshold is not used in
>>>> ParNewGC.
>>>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
>>>> seems no difference.
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>
>>>>
>>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/243309d6/attachment-0001.html 

From ysr1729 at gmail.com  Wed Jan 11 01:00:59 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Wed, 11 Jan 2012 01:00:59 -0800
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <4F0ACAAC.8020103@java4.info>
References: <4F0ACAAC.8020103@java4.info>
Message-ID: <CABzyjykkn0ri8w-W-JtSys8ph-WwtySYsaC9=_SEbVDZUhe2FQ@mail.gmail.com>

On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder <java at java4.info> wrote:

> ...
> I have seen that this problem occurs only after about one week of
> uptime. Even thought we make a full (compacting) gc every night.
> Since real-time > user-time I assume it might be a synchronization
> problem. Can this be true?
>
>
Together with your and Chi-Ho's conclusion that this is possibly related to
paging,
a question to ponder is why this happens only after a week. Since your
process'
heap size is presumably fixed and you have seen multiple full GC's (from
which
i assume that your heap's pages have all been touched), have you checked to
see if the size of either this process (i.e. its native size) or of another
process
on the machine has grown during the week so that you start swapping?

I also find it interesting that you state that whenever you see this problem
there's always a single block in the old gen, and that the problem seems to
go
away when there are more than one block in the old gen. That would seem
to throw out the paging theory, and point the finger of suspicion to some
kind
of bottleneck in the allocation out of a large block. You also state that
you
do a compacting collection every night, but the bad behaviour sets in only
after a week.

So let me ask you if you see that the slow scavenge happens to be the first
scavenge after a full gc, or does the condition persist for a long time and
is independent if whether a full gc has happened recently?

Try turning on -XX:+PrintOldPLAB to see if it sheds any light...

-- ramki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/0d137518/attachment.html 

From fancyerii at gmail.com  Wed Jan 11 01:24:02 2012
From: fancyerii at gmail.com (Li Li)
Date: Wed, 11 Jan 2012 17:24:02 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>
	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>
Message-ID: <CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>

the log is too large to post here
I just post some lines here. I grep the lines that gc time is larger than
100ms.
the question is: at the beginning, young generation is about 50M. but after
running a while, the memory is growing to 1.8GB. 1.75GB is Eden and 0.2G is
s0 and s1.
e.g. [GC [ParNew: 1843200K->204800K(1843200K), 0.2584570 secs]
it is clear that the eden is 1843200K(1.75G), s0 is 0.2G.
before young gc, eden are all used. after gc, s1 is all used(other live
object are moved to old generation)


2012-01-10T18:26:45.992+0800: [GC [ParNew: 58732K->6528K(59072K), 0.1234300
secs] 1391982K->1375194K(1707564K), 0.1234900 secs] [Times: user=1.44
sys=0.02, real=0.12 secs]
2012-01-10T18:26:47.185+0800: [GC [ParNew: 59072K->6528K(59072K), 0.1335480
secs] 1507767K->1490151K(2340184K), 0.1336020 secs] [Times: user=1.60
sys=0.01, real=0.13 secs]
2012-01-10T18:26:56.605+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0992650
secs] 1523647K->1509678K(2522312K), 0.0993220 secs] [Times: user=1.22
sys=0.01, real=0.10 secs]
2012-01-10T18:26:57.395+0800: [GC [ParNew: 52998K->6528K(59072K), 0.1948650
secs] 1556149K->1544918K(2522312K), 0.1949120 secs] [Times: user=2.46
sys=0.01, real=0.19 secs]
2012-01-10T18:27:05.072+0800: [GC [ParNew: 38463K->6528K(59072K), 0.1571700
secs] 2449032K->2447103K(2864820K), 0.1572150 secs] [Times: user=1.98
sys=0.02, real=0.16 secs]
2012-01-10T18:27:06.220+0800: [GC [ParNew: 59072K->6528K(59072K), 0.1641610
secs] 2499647K->2483866K(2864820K), 0.1642060 secs] [Times: user=2.07
sys=0.01, real=0.17 secs]
2012-01-10T22:24:08.939+0800: [GC [ParNew: 1826901K->204800K(1843200K),
0.1418510 secs] 3923985K->2352398K(7987200K), 0.1420700 secs] [Times:
user=1.59 sys=0.05, real=0.14 secs]
2012-01-10T22:24:09.343+0800: [GC [ParNew: 1843200K->175652K(1843200K),
0.1994980 secs] 3990798K->2536312K(7987200K), 0.1996880 secs] [Times:
user=1.98 sys=0.02, real=0.20 secs]
2012-01-10T22:24:10.049+0800: [GC [ParNew: 1814052K->151709K(1843200K),
0.1409050 secs] 4174712K->2618929K(7987200K), 0.1410940 secs] [Times:
user=1.51 sys=0.00, real=0.14 secs]
2012-01-10T22:24:11.015+0800: [GC [ParNew: 1843200K->204800K(1843200K),
0.2584570 secs] 4311783K->2831783K(7987200K), 0.2586440 secs] [Times:
user=2.83 sys=0.00, real=0.26 secs]
2012-01-10T22:24:11.543+0800: [GC [ParNew: 1843200K->188261K(1843200K),
0.2356920 secs] 4470183K->3028255K(7987200K), 0.2358800 secs] [Times:
user=2.41 sys=0.01, real=0.24 secs]


On Wed, Jan 11, 2012 at 4:24 PM, Kirk Pepperdine
<kirk.pepperdine at gmail.com>wrote:

>
> On 2012-01-11, at 9:06 AM, Li Li wrote:
>
> I understand the first one.
> as for Xmx, when it reach the maxium 8GB, the young generation is in deed
> 1.8G and Eden:s0:s1=8:1:1. That's correct.
> but when I restart it for a few minutes. old is 4GB while young is
> 200-300MB
>
>
> Right, ratios are adaptive and if you're using CMS, will require a full GC
> to occur before they will adapt. Size will start off small and then get
> bigger as needed.
>
> I don't think there is memory leak because it has running for more than a
> month without OOV.
> My application is using lucene+solr to provide search service which need
> large memory.
>
>
> Well, if memory use stabilizes than you don't have a leak. But I'd need to
> see a GC log to give you better advice. All I can say is that the more
> switches you touch the more you've got to understand about how things work
> in order to make effective changes. I generally start with minimal switch
> settings and then adjust as needed. Starting with a ratio is better than
> starting with a fixed value. If the ratio isn't working for you then moved
> to a fixed size. But use the data in the gc log to tell you how to proceed.
>
> Also, if your application is swapping during GC you will increase the
> duration of the collection. You need to monitor system level activity as
> part of the investigation.
>
> Regards,
> Kirk
>
>
> On Wed, Jan 11, 2012 at 3:55 PM, Kirk Pepperdine <
> kirk.pepperdine at gmail.com> wrote:
>
>>
>> On 2012-01-11, at 8:47 AM, Li Li wrote:
>>
>> 1. I don't understand why tenuring thresholds are
>> calculated to be 1
>>
>>
>> because the number of expected survivors exceeds the size of the survivor
>> space
>>
>> 2. I don't set Xms, I just set Xmx=8g
>>
>>
>> with a new ratio of 3.. you should have 2 gigs of young gen meaning a .2
>> gigs for each survivor space and 1.6 for young gen. Do you have a GC log
>> you can use to confirm these values? If not try visualvm and this plugin
>> should give you a clear view (www.java.net/projects/memorypoolview).
>>
>>
>> 3. as for memory leak, I will try to find it.
>>
>> On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine <kirk at kodewerk.com>wrote:
>>
>>> Hi Li LI,
>>>
>>> I fear that you are off in the wrong direction. Resetting tenuring
>>> thresholds in this case will never work because they are being calculated
>>> to be 1. You're suggesting numbers greater than 1 and so 1 will always be
>>> used which explains why you're not seeing a difference between runs. Having
>>> a calculated tenuring threshold set to 1 implies that the memory pool is
>>> too small. If the a memory pool is too small the only thing you can do to
>>> fix that is to make it bigger. In this case, your young generational space
>>> (as I've indicated in previous postings) is too small. Also, the cost of a
>>> young generational collection is dependent mostly upon the number of
>>> surviving objects, not dead ones. Pooling temporary objects will only make
>>> the problem worse. If I recall your flag settings, you've set netsize to a
>>> fixed value. That setting will override the the new ratio setting. You also
>>> set Xmx==Xms and that also override adaptive sizing. Also you are using CMS
>>> which is inherently not size adaptable.
>>>
>>> Last point, and this is the biggest one. The numbers you're publishing
>>> right now suggest that you have a memory leak. There is no way you're going
>>> to stabilize the memory /gc behaviour with a memory leak. Things will get
>>> progressively worse as you consume more and more heap. This is a blocking
>>> issue to all tuning efforts. It is the first thing that must be dealt with.
>>>
>>> To find the leak;
>>> Identify the leaking object useing VisualVM's memory profiler with
>>> generational counts and collect allocation stack traces turned on. Sort the
>>> profile by generational counts. When you've identified the leaking object,
>>> the domain class with the highest and always increasing generational count.
>>> take an allocation stack trace snapshot and a heap dump. The heap dump
>>> should be loaded into a heap walker. Use the knowledge gained from
>>> generational counts to inspect the linkages for the leaking object and then
>>> use that information in the allocation stack traces to identify casual
>>> execution paths for creation. After that, it's into application code to
>>> determine the fix.
>>>
>>> Kind regards,
>>> Kirk Pepperdine
>>>
>>> On 2012-01-11, at 5:45 AM, Li Li wrote:
>>>
>>>   if the young generation is too small that it can't afford space for
>>> survivors and it have to throw them to old generation. and jvm found this,
>>> it will turn down TenuringThreshold ?
>>>    I set TenuringThreshold to 10. and found that the full gc is less
>>> frequent and every full gc collect less garbage. it seems the parameter
>>> have the effect. But I found the load average is up and young gc time is
>>> much more than before. And the response time is also increased.
>>>    I guess that there are more objects in young generation. so it have
>>> to do more young gc. although they are garbage, it's not a good idea to
>>> collect them too early. because ParNewGC will stop the world, the response
>>> time is increasing.
>>>    So I adjust TenuringThreshold to 3 and there are no remarkable
>>> difference.
>>>    maybe I should use object pool for my application because it use many
>>> large temporary objects.
>>>    Another question, when my application runs for about 1-2 days. I
>>> found the response time increases. I guess it's the problem of large young
>>> generation.
>>>     in the beginning, the total memory usage is about 4-5GB and young
>>> generation is 100-200MB, the rest is old generation.
>>>     After running for days, the total memory usage is 8GB and young
>>> generation is about 2GB(I set new Ration 1:3)
>>>     I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation
>>> and ?XX:MaxHeapFreeRation
>>>    the default value is 40 and 70. the memory manage white paper says if
>>> the total heap free space is less than 40%, it will increase heap. if the
>>> free space is larger than 70%, it will decrease heap size.
>>>    But why I see the young generation is 200mb while old is 4gb. does
>>> the adjustment of young related to old generation?
>>>    I read in
>>> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young
>>> generation should be less than 512MB, is it correct?
>>>
>>>
>>>
>>> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna <ysr1729 at gmail.com
>>> > wrote:
>>>
>>>> I recommend Charlie's excellent book as well.
>>>>
>>>> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold
>>>> (henceforth MTT),
>>>> but in order to allow objects to age you also need sufficiently large
>>>> survivor spaces to hold
>>>> them for however long you wish, otherwise the adaptive tenuring policy
>>>> will adjust the
>>>> "current" tenuring threshold so as to prevent overflow. That may be
>>>> what you saw.
>>>> Check out the info printed by +PrintTenuringThreshold.
>>>>
>>>> -- ramki
>>>>
>>>> On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:
>>>>
>>>>> hi all
>>>>>    I have an application that generating many large objects and then
>>>>> discard them. I found that full gc can free memory from 70% to 40%.
>>>>>    I want to let this objects in young generation longer. I found
>>>>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>>>>>    But I found a blog that says MaxTenuringThreshold is not used in
>>>>> ParNewGC.
>>>>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it
>>>>> seems no difference.
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>>
>>
>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/325a0052/attachment-0001.html 

From fancyerii at gmail.com  Wed Jan 11 01:32:30 2012
From: fancyerii at gmail.com (Li Li)
Date: Wed, 11 Jan 2012 17:32:30 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>
	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>
	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>
Message-ID: <CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>

after a concurrent mode failure. the young generation changed from about
50MB to 1.8GB
What's the logic behind this?

2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440
secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20
sys=0.00, real=0.01 secs]
2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040
secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24
sys=0.00, real=0.02 secs]
2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed):
59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800:
[CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65,
real=28.24 secs]
 (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs]
5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)],
11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs]
2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K),
0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times:
user=0.26 sys=0.02, real=0.02 secs]
2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K),
0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times:
user=0.44 sys=0.04, real=0.04 secs]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/3777d402/attachment.html 

From java at java4.info  Wed Jan 11 01:45:28 2012
From: java at java4.info (Florian Binder)
Date: Wed, 11 Jan 2012 10:45:28 +0100
Subject: Very long young gc pause (ParNew with CMS)
In-Reply-To: <CABzyjykkn0ri8w-W-JtSys8ph-WwtySYsaC9=_SEbVDZUhe2FQ@mail.gmail.com>
References: <4F0ACAAC.8020103@java4.info>
	<CABzyjykkn0ri8w-W-JtSys8ph-WwtySYsaC9=_SEbVDZUhe2FQ@mail.gmail.com>
Message-ID: <4F0D5A38.1090906@java4.info>

I do not know why it has worked for a week.
Maybe it is because this was the xmas week ;-)

In the night there are a lot of disk operations (2 TB of data is 
written). Therefore the operating system caches a lot of files and tries 
to free memory for this, so unused pages are moved to swap space.
I assume heap fragmentation avoids swapping, since more pages are 
touched during the application is running. After a compacting gc there 
is one large (free) block which is not touched until young gc copies the 
objects from eden space. This will yield the operating system to move 
the pages of this one free block to swap and at every young gc it has to 
read it from swap.
After a CMS collection the following young gcs are much faster because 
the gaps in the heap are not swapped.

Yesterday, we have turned off the swap on this machine and now all young 
gcs take less than 200ms (instead of 6s) :-)
Thanks againt to Chi Ho Kwok for giving the key hint :-)

Flo


Am 11.01.2012 10:00, schrieb Srinivas Ramakrishna:
>
>
> On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder <java at java4.info 
> <mailto:java at java4.info>> wrote:
>
>     ...
>     I have seen that this problem occurs only after about one week of
>     uptime. Even thought we make a full (compacting) gc every night.
>     Since real-time > user-time I assume it might be a synchronization
>     problem. Can this be true?
>
>
> Together with your and Chi-Ho's conclusion that this is possibly 
> related to paging,
> a question to ponder is why this happens only after a week. Since your 
> process'
> heap size is presumably fixed and you have seen multiple full GC's 
> (from which
> i assume that your heap's pages have all been touched), have you 
> checked to
> see if the size of either this process (i.e. its native size) or of 
> another process
> on the machine has grown during the week so that you start swapping?
>
> I also find it interesting that you state that whenever you see this 
> problem
> there's always a single block in the old gen, and that the problem 
> seems to go
> away when there are more than one block in the old gen. That would seem
> to throw out the paging theory, and point the finger of suspicion to 
> some kind
> of bottleneck in the allocation out of a large block. You also state 
> that you
> do a compacting collection every night, but the bad behaviour sets in only
> after a week.
>
> So let me ask you if you see that the slow scavenge happens to be the 
> first
> scavenge after a full gc, or does the condition persist for a long 
> time and
> is independent if whether a full gc has happened recently?
>
> Try turning on -XX:+PrintOldPLAB to see if it sheds any light...
>
> -- ramki

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/19dc97a7/attachment.html 

From taras.tielkes at gmail.com  Wed Jan 11 03:00:22 2012
From: taras.tielkes at gmail.com (Taras Tielkes)
Date: Wed, 11 Jan 2012 12:00:22 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <4F06A270.3010701@oracle.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
Message-ID: <CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>

Hi Jon,

We've added the -XX:+PrintPromotionFailure flag to our production
application yesterday.
The application is running on 4 (homogenous) nodes.

In the gc logs of 3 out of 4 nodes, I've found a promotion failure
event during ParNew:

node-002
-------
2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
357592K->23382K(368640K), 0.0298150 secs]
3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
sys=0.01, real=0.03 secs]
2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
351062K->39795K(368640K), 0.0401170 secs]
3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
sys=0.00, real=0.04 secs]
2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
promotion failure size = 4281460)  (promotion failed):
350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]

node-003
-------
2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
346950K->21342K(368640K), 0.0333090 secs]
2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
sys=0.00, real=0.03 secs]
2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
345070K->32211K(368640K), 0.0369260 secs]
2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
sys=0.00, real=0.04 secs]
2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
promotion failure size = 1266955)  (promotion failed):
359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]

node-004
-------
2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
358429K->40960K(368640K), 0.0629910 secs]
3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
sys=0.02, real=0.06 secs]
2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
368640K->40960K(368640K), 0.0819780 secs]
3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
sys=0.00, real=0.08 secs]
2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
promotion failure size = 2788662)  (promotion failed):
367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
3310044K->330922K(4833280K), 4.5104170 secs]
3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]

On a fourth node, I've found a different event: promotion failure
during CMS, with a much smaller size:

node-001
-------
2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
354039K->40960K(368640K), 0.0667340 secs]
3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
sys=0.01, real=0.06 secs]
2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
368640K->40960K(368640K), 0.2586390 secs]
3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
sys=0.13, real=0.26 secs]
2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
user=0.07 sys=0.00, real=0.07 secs]
2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
2012-01-10T18:30:10.089+0100: 48431.382:
[CMS-concurrent-abortable-preclean-start]
2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
368640K->40960K(368640K), 0.1214420 secs]
3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
sys=0.05, real=0.12 secs]
2012-01-10T18:30:12.785+0100: 48434.078:
[CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
user=10.72 sys=0.48, real=2.70 secs]
2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
(368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
[ParNew (promotion failure size = 1026)  (promotion failed):
206521K->206521K(368640K), 0.1667280 secs]
 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
sys=0.04, real=0.17 secs]
48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
secs]48434.474: [scrub symbol & string tables, 0.0088370 secs] [1
CMS-remark: 3489675K(4833280K)] 36961
97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 secs]
2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
[CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
 (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
secs] 2873988K->334385K(5201920K), [CMS Perm :
117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
sys=0.00, real=8.61 secs]
2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]

I assume that the large sizes for the promotion failures during ParNew
are confirming that eliminating large array allocations might help
here. Do you agree?
I'm not sure what to make of the concurrent mode failure.

Thanks in advance for any suggestions,
Taras

On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>
>
> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>> Hi Jon,
>>
>> We've enabled the PrintPromotionFailure flag in our preprod
>> environment, but so far, no failures yet.
>> We know that the load we generate there is not representative. But
>> perhaps we'll catch something, given enough patience.
>>
>> The flag will also be enabled in our production environment next week
>> - so one way or the other, we'll get more diagnostic data soon.
>> I'll also do some allocation profiling of the application in isolation
>> - I know that there is abusive large byte[] and char[] allocation in
>> there.
>>
>> I've got two questions for now:
>>
>> 1) From googling around on the output to expect
>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>> I see that -XX:+PrintPromotionFailure will generate output like this:
>> -------
>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion
>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>> -------
>> In that example line, what does the "0" stand for?
>
> It's the index of the GC worker thread ?that experienced the promotion
> failure.
>
>> 2) Below is a snippet of (real) gc log from our production application:
>> -------
>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>> 345951K->40960K(368640K), 0.0676780 secs]
>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>> sys=0.01, real=0.06 secs]
>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>> 368640K->40959K(368640K), 0.0618880 secs]
>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>> sys=0.00, real=0.06 secs]
>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>> user=0.04 sys=0.00, real=0.04 secs]
>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start]
>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>> [CMS-concurrent-abortable-preclean-start]
>> ? CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>> 3432839K->3423755K(5201920
>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>> 0.0289480 secs]2136605.833: [scrub symbol& ?string tables, 0.0093940
>> secs] [1 CMS-remark: 3318289K(4833280K
>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>> real=7.61 secs]
>> 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start]
>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>> ? (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>> sys=0.00, real=10.29 secs]
>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>> -------
>>
>> In this case I don't know how to interpret the output.
>> a) There's a promotion failure that took 7.49 secs
> This is the time it took to attempt the minor collection (ParNew) and to
> do recovery
> from the failure.
>
>> b) There's a full GC that took 14.08 secs
>> c) There's a concurrent mode failure that took 10.29 secs
>
> Not sure about b) and c) because the output is mixed up with the
> concurrent-sweep
> output but ?I think the "concurrent mode failure" message is part of the
> "Full GC"
> message. ?My guess is that the 10.29 is the time for the Full GC and the
> 14.08
> maybe is part of the concurrent-sweep message. ?Really hard to be sure.
>
> Jon
>> How are these events, and their (real) times related to each other?
>>
>> Thanks in advance,
>> Taras
>>
>> On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu<jon.masamitsu at oracle.com> ?wrote:
>>> Taras,
>>>
>>> PrintPromotionFailure seems like it would go a long
>>> way to identify the root of your promotion failures (or
>>> at least eliminating some possible causes). ? ?I think it
>>> would help focus the discussion if you could send
>>> the result of that experiment early.
>>>
>>> Jon
>>>
>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>> Hi,
>>>>
>>>> We're running an application with the CMS/ParNew collectors that is
>>>> experiencing occasional promotion failures.
>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>> I've listed the specific JVM options used below (a).
>>>>
>>>> The application is deployed across a handful of machines, and the
>>>> promotion failures are fairly uniform across those.
>>>>
>>>> The first kind of failure we observe is a promotion failure during
>>>> ParNew collection, I've included a snipped from the gc log below (b).
>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>> triggered by the same cause), see (c) below.
>>>> The frequency (after running for a some weeks) is approximately once
>>>> per day. This is bearable, but obviously we'd like to improve on this.
>>>>
>>>> Apart from high-volume request handling (which allocates a lot of
>>>> small objects), the application also runs a few dozen background
>>>> threads that download and process XML documents, typically in the 5-30
>>>> MB range.
>>>> A known deficiency in the existing code is that the XML content is
>>>> copied twice before processing (once to a byte[], and later again to a
>>>> String/char[]).
>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>> java.lang.String/char[], my suspicion is that these big array
>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>
>>>> My questions are:
>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>> conclude that CMS fragmentation is the cause of the promotion failure?
>>>> 2) If not, what's the next step of investigating the cause?
>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a
>>>> feeling for the size of the objects that fail promotion.
>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>> case?
>>>>
>>>> Thanks in advance,
>>>> Taras
>>>>
>>>> a) Current JVM options:
>>>> --------------------------------
>>>> -server
>>>> -Xms5g
>>>> -Xmx5g
>>>> -Xmn400m
>>>> -XX:PermSize=256m
>>>> -XX:MaxPermSize=256m
>>>> -XX:+PrintGCTimeStamps
>>>> -verbose:gc
>>>> -XX:+PrintGCDateStamps
>>>> -XX:+PrintGCDetails
>>>> -XX:SurvivorRatio=8
>>>> -XX:+UseConcMarkSweepGC
>>>> -XX:+UseParNewGC
>>>> -XX:+DisableExplicitGC
>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>> -XX:+CMSClassUnloadingEnabled
>>>> -XX:+CMSScavengeBeforeRemark
>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>> -Xloggc:gc.log
>>>> --------------------------------
>>>>
>>>> b) Promotion failure during ParNew
>>>> --------------------------------
>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>> sys=0.01, real=0.07 secs]
>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>> sys=0.00, real=0.05 secs]
>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>> sys=0.00, real=0.03 secs]
>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>> 3505808K->434291K
>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K),
>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K),
>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K),
>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>> --------------------------------
>>>>
>>>> c) Promotion failure during CMS
>>>> --------------------------------
>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>> sys=0.00, real=0.05 secs]
>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>> sys=0.01, real=0.05 secs]
>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>> sys=0.00, real=0.04 secs]
>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start]
>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>> sys=0.01, real=0.08 secs]
>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start]
>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>> [CMS-concurrent-abortable-preclean-start]
>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs]
>>>> ? ?3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>> sys=2.58, real=9.88 secs]
>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs
>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>> secs]703031.419: [scrub symbol& ? ?string tables, 0.0094960 secs] [1 CMS
>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start]
>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>> ? ?(concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300
>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>> sys=0.00, real=10.96 secs]
>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K),
>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K),
>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K),
>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>> --------------------------------
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From kirk.pepperdine at gmail.com  Tue Jan 10 23:55:34 2012
From: kirk.pepperdine at gmail.com (Kirk Pepperdine)
Date: Wed, 11 Jan 2012 08:55:34 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
Message-ID: <F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>


On 2012-01-11, at 8:47 AM, Li Li wrote:

> 1. I don't understand why tenuring thresholds are 
> calculated to be 1

because the number of expected survivors exceeds the size of the survivor space

> 2. I don't set Xms, I just set Xmx=8g

with a new ratio of 3.. you should have 2 gigs of young gen meaning a .2 gigs for each survivor space and 1.6 for young gen. Do you have a GC log you can use to confirm these values? If not try visualvm and this plugin should give you a clear view (www.java.net/projects/memorypoolview).


> 3. as for memory leak, I will try to find it.
> 
> On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine <kirk at kodewerk.com> wrote:
> Hi Li LI,
> 
> I fear that you are off in the wrong direction. Resetting tenuring thresholds in this case will never work because they are being calculated to be 1. You're suggesting numbers greater than 1 and so 1 will always be used which explains why you're not seeing a difference between runs. Having a calculated tenuring threshold set to 1 implies that the memory pool is too small. If the a memory pool is too small the only thing you can do to fix that is to make it bigger. In this case, your young generational space (as I've indicated in previous postings) is too small. Also, the cost of a young generational collection is dependent mostly upon the number of surviving objects, not dead ones. Pooling temporary objects will only make the problem worse. If I recall your flag settings, you've set netsize to a fixed value. That setting will override the the new ratio setting. You also set Xmx==Xms and that also override adaptive sizing. Also you are using CMS which is inherently not size adaptable.
> 
> Last point, and this is the biggest one. The numbers you're publishing right now suggest that you have a memory leak. There is no way you're going to stabilize the memory /gc behaviour with a memory leak. Things will get progressively worse as you consume more and more heap. This is a blocking issue to all tuning efforts. It is the first thing that must be dealt with.
> 
> To find the leak;
> Identify the leaking object useing VisualVM's memory profiler with generational counts and collect allocation stack traces turned on. Sort the profile by generational counts. When you've identified the leaking object, the domain class with the highest and always increasing generational count. take an allocation stack trace snapshot and a heap dump. The heap dump should be loaded into a heap walker. Use the knowledge gained from generational counts to inspect the linkages for the leaking object and then use that information in the allocation stack traces to identify casual execution paths for creation. After that, it's into application code to determine the fix.
> 
> Kind regards,
> Kirk Pepperdine
> 
> On 2012-01-11, at 5:45 AM, Li Li wrote:
> 
>>   if the young generation is too small that it can't afford space for survivors and it have to throw them to old generation. and jvm found this, it will turn down TenuringThreshold ? 
>>    I set TenuringThreshold to 10. and found that the full gc is less frequent and every full gc collect less garbage. it seems the parameter have the effect. But I found the load average is up and young gc time is much more than before. And the response time is also increased.
>>    I guess that there are more objects in young generation. so it have to do more young gc. although they are garbage, it's not a good idea to collect them too early. because ParNewGC will stop the world, the response time is increasing.
>>    So I adjust TenuringThreshold to 3 and there are no remarkable difference. 
>>    maybe I should use object pool for my application because it use many large temporary objects.
>>    Another question, when my application runs for about 1-2 days. I found the response time increases. I guess it's the problem of large young generation.
>>     in the beginning, the total memory usage is about 4-5GB and young generation is 100-200MB, the rest is old generation.
>>     After running for days, the total memory usage is 8GB and young generation is about 2GB(I set new Ration 1:3)
>>     I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation and ?XX:MaxHeapFreeRation
>>    the default value is 40 and 70. the memory manage white paper says if the total heap free space is less than 40%, it will increase heap. if the free space is larger than 70%, it will decrease heap size.
>>    But why I see the young generation is 200mb while old is 4gb. does the adjustment of young related to old generation?
>>    I read in http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young generation should be less than 512MB, is it correct?
>> 
>> 
>> 
>> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna <ysr1729 at gmail.com> wrote:
>> I recommend Charlie's excellent book as well.
>> 
>> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold (henceforth MTT),
>> but in order to allow objects to age you also need sufficiently large survivor spaces to hold
>> them for however long you wish, otherwise the adaptive tenuring policy will adjust the
>> "current" tenuring threshold so as to prevent overflow. That may be what you saw.
>> Check out the info printed by +PrintTenuringThreshold.
>> 
>> -- ramki
>> 
>> On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:
>> hi all
>>    I have an application that generating many large objects and then discard them. I found that full gc can free memory from 70% to 40%. 
>>    I want to let this objects in young generation longer. I found -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>>    But I found a blog that says MaxTenuringThreshold is not used in ParNewGC.
>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it seems no difference.
>> 
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> 
>> 
>> 
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/1eafdc88/attachment.html 

From kirk.pepperdine at gmail.com  Wed Jan 11 01:48:36 2012
From: kirk.pepperdine at gmail.com (Kirk Pepperdine)
Date: Wed, 11 Jan 2012 10:48:36 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>
	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>
	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>
	<CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>
Message-ID: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>

CMS is not adaptive. To reconfigure heap, for many reasons, you need a full GC to occur. The response to a concurrent mode failure is always a full GC. That gave the JVM the opportunity to resize heap space. If this behaviour isn't happening when it should or is cause other problems it's time to either set the young gen size directly with NewSize or switch to the parallel collector with the adaptive sizing policy turned on. Logic here is that you want to avoid long pauses, use CMS. If CMS is giving you long pauses, than the parallel collector might be a better choice.

Regards,
Kirk

On 2012-01-11, at 10:32 AM, Li Li wrote:

> after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB
> What's the logic behind this?
> 
> 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs]
> 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs]
> 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs]
>  (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs]
> 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs]
> 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs]
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From kirk at kodewerk.com  Tue Jan 10 23:18:04 2012
From: kirk at kodewerk.com (Kirk Pepperdine)
Date: Wed, 11 Jan 2012 08:18:04 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
Message-ID: <C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>

Hi Li LI,

I fear that you are off in the wrong direction. Resetting tenuring thresholds in this case will never work because they are being calculated to be 1. You're suggesting numbers greater than 1 and so 1 will always be used which explains why you're not seeing a difference between runs. Having a calculated tenuring threshold set to 1 implies that the memory pool is too small. If the a memory pool is too small the only thing you can do to fix that is to make it bigger. In this case, your young generational space (as I've indicated in previous postings) is too small. Also, the cost of a young generational collection is dependent mostly upon the number of surviving objects, not dead ones. Pooling temporary objects will only make the problem worse. If I recall your flag settings, you've set netsize to a fixed value. That setting will override the the new ratio setting. You also set Xmx==Xms and that also override adaptive sizing. Also you are using CMS which is inherently not size adaptable.

Last point, and this is the biggest one. The numbers you're publishing right now suggest that you have a memory leak. There is no way you're going to stabilize the memory /gc behaviour with a memory leak. Things will get progressively worse as you consume more and more heap. This is a blocking issue to all tuning efforts. It is the first thing that must be dealt with.

To find the leak;
Identify the leaking object useing VisualVM's memory profiler with generational counts and collect allocation stack traces turned on. Sort the profile by generational counts. When you've identified the leaking object, the domain class with the highest and always increasing generational count. take an allocation stack trace snapshot and a heap dump. The heap dump should be loaded into a heap walker. Use the knowledge gained from generational counts to inspect the linkages for the leaking object and then use that information in the allocation stack traces to identify casual execution paths for creation. After that, it's into application code to determine the fix.

Kind regards,
Kirk Pepperdine

On 2012-01-11, at 5:45 AM, Li Li wrote:

>   if the young generation is too small that it can't afford space for survivors and it have to throw them to old generation. and jvm found this, it will turn down TenuringThreshold ? 
>    I set TenuringThreshold to 10. and found that the full gc is less frequent and every full gc collect less garbage. it seems the parameter have the effect. But I found the load average is up and young gc time is much more than before. And the response time is also increased.
>    I guess that there are more objects in young generation. so it have to do more young gc. although they are garbage, it's not a good idea to collect them too early. because ParNewGC will stop the world, the response time is increasing.
>    So I adjust TenuringThreshold to 3 and there are no remarkable difference. 
>    maybe I should use object pool for my application because it use many large temporary objects.
>    Another question, when my application runs for about 1-2 days. I found the response time increases. I guess it's the problem of large young generation.
>     in the beginning, the total memory usage is about 4-5GB and young generation is 100-200MB, the rest is old generation.
>     After running for days, the total memory usage is 8GB and young generation is about 2GB(I set new Ration 1:3)
>     I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation and ?XX:MaxHeapFreeRation
>    the default value is 40 and 70. the memory manage white paper says if the total heap free space is less than 40%, it will increase heap. if the free space is larger than 70%, it will decrease heap size.
>    But why I see the young generation is 200mb while old is 4gb. does the adjustment of young related to old generation?
>    I read in http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young generation should be less than 512MB, is it correct?
> 
> 
> 
> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna <ysr1729 at gmail.com> wrote:
> I recommend Charlie's excellent book as well.
> 
> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold (henceforth MTT),
> but in order to allow objects to age you also need sufficiently large survivor spaces to hold
> them for however long you wish, otherwise the adaptive tenuring policy will adjust the
> "current" tenuring threshold so as to prevent overflow. That may be what you saw.
> Check out the info printed by +PrintTenuringThreshold.
> 
> -- ramki
> 
> On Tue, Jan 10, 2012 at 1:31 AM, Li Li <fancyerii at gmail.com> wrote:
> hi all
>    I have an application that generating many large objects and then discard them. I found that full gc can free memory from 70% to 40%. 
>    I want to let this objects in young generation longer. I found -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold.
>    But I found a blog that says MaxTenuringThreshold is not used in ParNewGC.
> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it seems no difference.
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 
> 
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/20772300/attachment-0001.html 

From jon.masamitsu at oracle.com  Wed Jan 11 08:54:28 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Wed, 11 Jan 2012 08:54:28 -0800
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>	<4EF9FCAC.3030208@oracle.com>	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
Message-ID: <4F0DBEC4.7040907@oracle.com>

Taras,

> I assume that the large sizes for the promotion failures during ParNew
> are confirming that eliminating large array allocations might help
> here. Do you agree?

I agree that eliminating the large array allocation will help but you 
are still having
promotion failures when the allocation size is small (I think it was 
1026).  That
says that you are filling up the old (cms) generation faster than the GC can
collect it.  The large arrays are aggrevating the problem but not 
necessarily
the cause.

If these are still your heap sizes,

> -Xms5g
> -Xmx5g
> -Xmn400m

Start by increasing the young gen size as may already have been
suggested.  If you have a test setup where you can experiment,
try doubling the young gen size to start.

If you have not seen this, it might be helpful.

http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
> I'm not sure what to make of the concurrent mode

The concurrent mode failure is a consequence of the promotion failure.
Once the promotion failure happens the concurrent mode failure is
inevitable.

Jon


> .


On 1/11/2012 3:00 AM, Taras Tielkes wrote:
> Hi Jon,
>
> We've added the -XX:+PrintPromotionFailure flag to our production
> application yesterday.
> The application is running on 4 (homogenous) nodes.
>
> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
> event during ParNew:
>
> node-002
> -------
> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
> 357592K->23382K(368640K), 0.0298150 secs]
> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
> sys=0.01, real=0.03 secs]
> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
> 351062K->39795K(368640K), 0.0401170 secs]
> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
> sys=0.00, real=0.04 secs]
> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
> promotion failure size = 4281460)  (promotion failed):
> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>
> node-003
> -------
> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
> 346950K->21342K(368640K), 0.0333090 secs]
> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
> sys=0.00, real=0.03 secs]
> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
> 345070K->32211K(368640K), 0.0369260 secs]
> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
> sys=0.00, real=0.04 secs]
> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
> promotion failure size = 1266955)  (promotion failed):
> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>
> node-004
> -------
> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
> 358429K->40960K(368640K), 0.0629910 secs]
> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
> sys=0.02, real=0.06 secs]
> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
> 368640K->40960K(368640K), 0.0819780 secs]
> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
> sys=0.00, real=0.08 secs]
> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
> promotion failure size = 2788662)  (promotion failed):
> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
> 3310044K->330922K(4833280K), 4.5104170 secs]
> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>
> On a fourth node, I've found a different event: promotion failure
> during CMS, with a much smaller size:
>
> node-001
> -------
> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
> 354039K->40960K(368640K), 0.0667340 secs]
> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
> sys=0.01, real=0.06 secs]
> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
> 368640K->40960K(368640K), 0.2586390 secs]
> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
> sys=0.13, real=0.26 secs]
> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
> user=0.07 sys=0.00, real=0.07 secs]
> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
> 2012-01-10T18:30:10.089+0100: 48431.382:
> [CMS-concurrent-abortable-preclean-start]
> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
> 368640K->40960K(368640K), 0.1214420 secs]
> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
> sys=0.05, real=0.12 secs]
> 2012-01-10T18:30:12.785+0100: 48434.078:
> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
> user=10.72 sys=0.48, real=2.70 secs]
> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
> [ParNew (promotion failure size = 1026)  (promotion failed):
> 206521K->206521K(368640K), 0.1667280 secs]
>   3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
> sys=0.04, real=0.17 secs]
> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
> secs]48434.474: [scrub symbol&  string tables, 0.0088370 secs] [1
> CMS-remark: 3489675K(4833280K)] 36961
> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 secs]
> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>   (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
> secs] 2873988K->334385K(5201920K), [CMS Perm :
> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
> sys=0.00, real=8.61 secs]
> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>
> I assume that the large sizes for the promotion failures during ParNew
> are confirming that eliminating large array allocations might help
> here. Do you agree?
> I'm not sure what to make of the concurrent mode failure.
>
> Thanks in advance for any suggestions,
> Taras
>
> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>  wrote:
>>
>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>> Hi Jon,
>>>
>>> We've enabled the PrintPromotionFailure flag in our preprod
>>> environment, but so far, no failures yet.
>>> We know that the load we generate there is not representative. But
>>> perhaps we'll catch something, given enough patience.
>>>
>>> The flag will also be enabled in our production environment next week
>>> - so one way or the other, we'll get more diagnostic data soon.
>>> I'll also do some allocation profiling of the application in isolation
>>> - I know that there is abusive large byte[] and char[] allocation in
>>> there.
>>>
>>> I've got two questions for now:
>>>
>>> 1) From googling around on the output to expect
>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>> -------
>>> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>> -------
>>> In that example line, what does the "0" stand for?
>> It's the index of the GC worker thread  that experienced the promotion
>> failure.
>>
>>> 2) Below is a snippet of (real) gc log from our production application:
>>> -------
>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>> 345951K->40960K(368640K), 0.0676780 secs]
>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>> sys=0.01, real=0.06 secs]
>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>> 368640K->40959K(368640K), 0.0618880 secs]
>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>> sys=0.00, real=0.06 secs]
>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>> user=0.04 sys=0.00, real=0.04 secs]
>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start]
>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>> [CMS-concurrent-abortable-preclean-start]
>>>    CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>> 3432839K->3423755K(5201920
>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>> 0.0289480 secs]2136605.833: [scrub symbol&    string tables, 0.0093940
>>> secs] [1 CMS-remark: 3318289K(4833280K
>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>> real=7.61 secs]
>>> 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start]
>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>    (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>> sys=0.00, real=10.29 secs]
>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>> -------
>>>
>>> In this case I don't know how to interpret the output.
>>> a) There's a promotion failure that took 7.49 secs
>> This is the time it took to attempt the minor collection (ParNew) and to
>> do recovery
>> from the failure.
>>
>>> b) There's a full GC that took 14.08 secs
>>> c) There's a concurrent mode failure that took 10.29 secs
>> Not sure about b) and c) because the output is mixed up with the
>> concurrent-sweep
>> output but  I think the "concurrent mode failure" message is part of the
>> "Full GC"
>> message.  My guess is that the 10.29 is the time for the Full GC and the
>> 14.08
>> maybe is part of the concurrent-sweep message.  Really hard to be sure.
>>
>> Jon
>>> How are these events, and their (real) times related to each other?
>>>
>>> Thanks in advance,
>>> Taras
>>>
>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu<jon.masamitsu at oracle.com>    wrote:
>>>> Taras,
>>>>
>>>> PrintPromotionFailure seems like it would go a long
>>>> way to identify the root of your promotion failures (or
>>>> at least eliminating some possible causes).    I think it
>>>> would help focus the discussion if you could send
>>>> the result of that experiment early.
>>>>
>>>> Jon
>>>>
>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>> Hi,
>>>>>
>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>> experiencing occasional promotion failures.
>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>> I've listed the specific JVM options used below (a).
>>>>>
>>>>> The application is deployed across a handful of machines, and the
>>>>> promotion failures are fairly uniform across those.
>>>>>
>>>>> The first kind of failure we observe is a promotion failure during
>>>>> ParNew collection, I've included a snipped from the gc log below (b).
>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>> triggered by the same cause), see (c) below.
>>>>> The frequency (after running for a some weeks) is approximately once
>>>>> per day. This is bearable, but obviously we'd like to improve on this.
>>>>>
>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>> small objects), the application also runs a few dozen background
>>>>> threads that download and process XML documents, typically in the 5-30
>>>>> MB range.
>>>>> A known deficiency in the existing code is that the XML content is
>>>>> copied twice before processing (once to a byte[], and later again to a
>>>>> String/char[]).
>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>
>>>>> My questions are:
>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>> conclude that CMS fragmentation is the cause of the promotion failure?
>>>>> 2) If not, what's the next step of investigating the cause?
>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a
>>>>> feeling for the size of the objects that fail promotion.
>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>> case?
>>>>>
>>>>> Thanks in advance,
>>>>> Taras
>>>>>
>>>>> a) Current JVM options:
>>>>> --------------------------------
>>>>> -server
>>>>> -Xms5g
>>>>> -Xmx5g
>>>>> -Xmn400m
>>>>> -XX:PermSize=256m
>>>>> -XX:MaxPermSize=256m
>>>>> -XX:+PrintGCTimeStamps
>>>>> -verbose:gc
>>>>> -XX:+PrintGCDateStamps
>>>>> -XX:+PrintGCDetails
>>>>> -XX:SurvivorRatio=8
>>>>> -XX:+UseConcMarkSweepGC
>>>>> -XX:+UseParNewGC
>>>>> -XX:+DisableExplicitGC
>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>> -XX:+CMSClassUnloadingEnabled
>>>>> -XX:+CMSScavengeBeforeRemark
>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>> -Xloggc:gc.log
>>>>> --------------------------------
>>>>>
>>>>> b) Promotion failure during ParNew
>>>>> --------------------------------
>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>> sys=0.01, real=0.07 secs]
>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>> sys=0.00, real=0.05 secs]
>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>> sys=0.00, real=0.03 secs]
>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>> 3505808K->434291K
>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K),
>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K),
>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K),
>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>> --------------------------------
>>>>>
>>>>> c) Promotion failure during CMS
>>>>> --------------------------------
>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>> sys=0.00, real=0.05 secs]
>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>> sys=0.01, real=0.05 secs]
>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>> sys=0.00, real=0.04 secs]
>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start]
>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>> sys=0.01, real=0.08 secs]
>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start]
>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs]
>>>>>     3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>> sys=2.58, real=9.88 secs]
>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs
>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>> secs]703031.419: [scrub symbol&      string tables, 0.0094960 secs] [1 CMS
>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start]
>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>     (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300
>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>> sys=0.00, real=10.96 secs]
>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K),
>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K),
>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K),
>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>> --------------------------------
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From fancyerii at gmail.com  Thu Jan 12 04:09:28 2012
From: fancyerii at gmail.com (Li Li)
Date: Thu, 12 Jan 2012 20:09:28 +0800
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>
	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>
	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>
	<CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>
	<9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>
Message-ID: <CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>

yesterday, we set the maxNewSize to 256mb. And it works as we expected. but
an hours ago, there is a promotion failure and a concurrent mode failure
which cost 14s!  could anyone explain the gc logs for me?
or any documents for the gc log format explanation?

1. Desired survivor size 3342336 bytes, new threshold 5 (max 5)
   it says survivor size is 3mb
2. 58282K->57602K(59072K), 0.0543930 secs]
   it says before young gc the memory used is 58282K, after young gc, there
are 57602K live objects and the total young space is 59072K
3. (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340
secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)],
14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs]
     before old gc, 7.9GB is used. after old gc 3GB is alive. total old
space is 7.9GB

in which situation will occur promotion failure and concurrent mode failure?
from
http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
the author says when CMS is doing concurrent work and JVM is asked for more
memory. if there isn't any space for new allocation. then it will occur
concurrent mode failure and it will stop the world and do a serial old gc.
if there exist enough space but they are fragemented, then a promotion
failure will occur.
am I right?

2012-01-12T18:27:32.582+0800: [GC [ParNew
Desired survivor size 3342336 bytes, new threshold 1 (max 5)
- age   1:    4594648 bytes,    4594648 total
- age   2:     569200 bytes,    5163848 total
: 58548K->5738K(59072K), 0.0159400 secs] 7958648K->7908502K(7984352K),
0.0160610 secs] [Times: user=0.17 sys=0.00, real=0.02 secs]
2012-01-12T18:27:32.609+0800: [GC [ParNew (promotion failed)
Desired survivor size 3342336 bytes, new threshold 5 (max 5)
- age   1:    1666376 bytes,    1666376 total
: 58282K->57602K(59072K), 0.0543930 secs][CMS2012-01-12T18:27:33.804+0800:
[CMS-concurrent-preclean: 14.098/34.323 secs] [Times: user=370.28 sys=5.65,
real=34.31 secs]
 (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs]
7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)],
14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs]

On Wed, Jan 11, 2012 at 5:48 PM, Kirk Pepperdine
<kirk.pepperdine at gmail.com>wrote:

> CMS is not adaptive. To reconfigure heap, for many reasons, you need a
> full GC to occur. The response to a concurrent mode failure is always a
> full GC. That gave the JVM the opportunity to resize heap space. If this
> behaviour isn't happening when it should or is cause other problems it's
> time to either set the young gen size directly with NewSize or switch to
> the parallel collector with the adaptive sizing policy turned on. Logic
> here is that you want to avoid long pauses, use CMS. If CMS is giving you
> long pauses, than the parallel collector might be a better choice.
>
> Regards,
> Kirk
>
> On 2012-01-11, at 10:32 AM, Li Li wrote:
>
> > after a concurrent mode failure. the young generation changed from about
> 50MB to 1.8GB
> > What's the logic behind this?
> >
> > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K),
> 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times:
> user=0.20 sys=0.00, real=0.01 secs]
> > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K),
> 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times:
> user=0.24 sys=0.00, real=0.02 secs]
> > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed):
> 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800:
> [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65,
> real=28.24 secs]
> >  (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660
> secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)],
> 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs]
> > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K),
> 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times:
> user=0.26 sys=0.02, real=0.02 secs]
> > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K),
> 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times:
> user=0.44 sys=0.04, real=0.04 secs]
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/586eb629/attachment.html 

From bartosz.markocki at gmail.com  Thu Jan 12 05:56:38 2012
From: bartosz.markocki at gmail.com (Bartek Markocki)
Date: Thu, 12 Jan 2012 14:56:38 +0100
Subject: How can we cut down those two CMS STW times?
Message-ID: <CAGBMu_ZvQKyGBEb96gSR+FJ=Bm1LY8M29AvPqj-ian2St1bhuw@mail.gmail.com>

Hi all,

We have a backend type of application which primary purpose is to
cache user specific graphs of objects.  The graphs are relatively
small in size however the rate at which they can change (based on
users' requests) is key here.
Our main challenge was to figure out JVM settings that will handle the
peak memory allocation at the level of 4.5GB/s. To make things a bit
more challenging ;) we have a limited number of RAM on the box (as
there are multiple applications co-located on the box and the box has
just 64GB of RAM).

After a decent amount of testing we figured out the following settings
work for us:
-Xms6G -Xmx6G -Xmn3G -Xss256k -XX:MaxPermSize=512m -XX:PermSize=512m
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:TargetSurvivorRatio=90 -XX:SurvivorRatio=8
-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+CMSScavengeBeforeRemark
-XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
We run Java6 update 27 (64bit, server) on Solaris10.

The above settings work for us with exception of one CMS-initial marks
and one CMS-remark. By working I mean the STW is less than 1 second
for any STW pause.

There is one case when CMS-initial mark took 3.44 seconds. Here is the
extract from the log showing this situation:
90516.053: [GC 90516.053: [ParNew: 2633949K->154547K(2831168K),
0.1729255 secs] 4874963K->2395755K(5976896K), 0.1734674 secs] [Times:
user=3.07 sys=0.01, real=0.17 secs]
90516.846: [GC 90516.846: [ParNew: 2671155K->106975K(2831168K),
0.2183780 secs] 4906534K->2365720K(5976896K), 0.2188906 secs] [Times:
user=3.62 sys=0.05, real=0.22 secs]
90517.684: [GC 90517.684: [ParNew: 2623583K->106936K(2831168K),
0.0690728 secs] 4833212K->2316857K(5976896K), 0.0695870 secs] [Times:
user=1.20 sys=0.01, real=0.07 secs]
90517.976: [CMS-concurrent-sweep: 4.574/5.767 secs] [Times:
user=121.01 sys=1.90, real=5.77 secs]
90517.976: [CMS-concurrent-reset-start]
90518.112: [CMS-concurrent-reset: 0.136/0.136 secs] [Times: user=2.76
sys=0.05, real=0.14 secs]
90520.117: [GC [1 CMS-initial-mark: 2209921K(3145728K)]
4768007K(5976896K), 3.4458003 secs] [Times: user=3.45 sys=0.00,
real=3.45 secs]
90523.564: [CMS-concurrent-mark-start]
90523.623: [GC 90523.623: [ParNew: 2623544K->119747K(2831168K),
0.1848339 secs] 4833465K->2329818K(5976896K), 0.1853529 secs] [Times:
user=3.29 sys=0.01, real=0.19 secs]
90526.087: [CMS-concurrent-mark: 2.314/2.523 secs] [Times: user=18.11
sys=0.18, real=2.52 secs]
90526.087: [CMS-concurrent-preclean-start]
90526.155: [CMS-concurrent-preclean: 0.058/0.068 secs] [Times:
user=0.16 sys=0.00, real=0.07 secs]
90526.155: [CMS-concurrent-abortable-preclean-start]
90531.301: [GC 90531.301: [ParNew: 2636355K->45254K(2831168K),
0.0206247 secs] 4846426K->2255579K(5976896K), 0.0211745 secs] [Times:
user=0.33 sys=0.00, real=0.02 secs]
 CMS: abort preclean due to time 90531.470:
[CMS-concurrent-abortable-preclean: 5.271/5.315 secs] [Times:
user=18.85 sys=0.26, real=5.32 secs]
90531.476: [GC[YG occupancy: 662977 K (2831168 K)]90531.476: [GC
90531.476: [ParNew: 662977K->21990K(2831168K), 0.0342782 secs]
2873302K->2232487K(5976896K), 0.0347927 secs] [Times: user=0.40
sys=0.01, real=0.04 secs]
90531.511: [Rescan (parallel) , 0.0074306 secs]90531.519: [weak refs
processing, 0.0000864 secs]90531.519: [class unloading, 0.0350356
secs]90531.554: [scrub symbol & string tables, 0.0266258 secs] [1
CMS-remark: 2210497K(3145728K)] 2232487K(5976896K), 0.1197919 secs]
[Times: user=0.59 sys=0.01, real=0.12 secs]
90531.597: [CMS-concurrent-sweep-start]
90532.212: [GC 90532.212: [ParNew: 2538598K->14216K(2831168K),
0.0162798 secs] 4744071K->2219741K(5976896K), 0.0167729 secs] [Times:
user=0.26 sys=0.00, real=0.02 secs]
90532.865: [GC 90532.865: [ParNew: 2530824K->18587K(2831168K),
0.0192318 secs] 4732677K->2220478K(5976896K), 0.0197659 secs] [Times:
user=0.31 sys=0.00, real=0.02 secs]
90533.500: [GC 90533.500: [ParNew: 2535195K->20886K(2831168K),
0.0206055 secs] 4731793K->2217494K(5976896K), 0.0211439 secs] [Times:
user=0.33 sys=0.00, real=0.02 secs]

Of course almost immediately one can notice that young generation was
almost full during that time, so what happened should not be a
surprise. After some googling I found that a similar topic was
discussed on this group in 2010 ? with indication that it is caused by
the 6412968 bug (CMS: Long initial mark).  We tried suggested
workarounds and found out that they cannot be applied in our case
(limited number of available RAM) and sooner or later we hit the
promotion or/and concurrent mode failure with even worse STW time.


Unfortunately, as in 2010, bugs.sun.com does not show the bug so I
cannot check if there was any update for the bug, so here comes my
first question:  was there any update for the bug (what?s the status
of the bug)?


The next problem that we faced was related to another CMS related bug
(6990419). After we applied the suggested workaround (to enable
scavenge before remark) the problem was almost completely removed with
one exception. There is the following CMS remark:
199848.296: [GC 199848.296: [ParNew: 2689127K->119235K(2831168K),
0.0906522 secs] 4868292K->2309941K(5976896K), 0.0912736 secs] [Times:
user=1.22 sys=0.02, real=0.09 secs]
199853.617: [GC 199853.617: [ParNew: 2635843K->91628K(2831168K),
0.1040602 secs] 4826549K->2311078K(5976896K), 0.1046178 secs] [Times:
user=1.15 sys=0.01, real=0.10 secs]
199853.726: [GC [1 CMS-initial-mark: 2219449K(3145728K)]
2311170K(5976896K), 0.1208219 secs] [Times: user=0.12 sys=0.00,
real=0.12 secs]
199853.847: [CMS-concurrent-mark-start]
199856.405: [CMS-concurrent-mark: 2.547/2.557 secs] [Times: user=18.49
sys=0.35, real=2.56 secs]
199856.405: [CMS-concurrent-preclean-start]
199856.438: [CMS-concurrent-preclean: 0.031/0.033 secs] [Times:
user=0.13 sys=0.00, real=0.03 secs]
199856.439: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 199861.899:
[CMS-concurrent-abortable-preclean: 5.443/5.460 secs] [Times:
user=27.67 sys=1.14, real=5.46 secs]
199861.903: [GC[YG occupancy: 1353639 K (2831168 K)]199861.903:
[Rescan (parallel) , 1.4282026 secs]199863.332: [weak refs processing,
0.0019473 secs]199863.334: [class unloading, 0.0365617
secs]199863.370: [scrub symbol & string tables, 0.0267902 secs] [1
CMS-remark: 2219449K(3145728K)] 3573089K(5976896K), 1.5099836 secs]
[Times: user=12.20 sys=0.17, real=1.51 secs]
199863.414: [CMS-concurrent-sweep-start]
199863.420: [GC 199863.421: [ParNew: 1355519K->53699K(2831168K),
0.1129519 secs] 3574969K->2308972K(5976896K), 0.1138995 secs] [Times:
user=1.10 sys=0.01, real=0.11 secs]
199865.857: [CMS-concurrent-sweep: 2.324/2.443 secs] [Times:
user=10.15 sys=0.61, real=2.44 secs]
199865.857: [CMS-concurrent-reset-start]
199865.888: [CMS-concurrent-reset: 0.031/0.031 secs] [Times: user=0.05
sys=0.00, real=0.03 secs]
199893.779: [GC 199893.780: [ParNew: 2570307K->58285K(2831168K),
0.0397922 secs] 4620179K->2108197K(5976896K), 0.0403072 secs] [Times:
user=0.68 sys=0.00, real=0.04 secs]
199906.510: [GC 199906.510: [ParNew: 2574893K->55484K(2831168K),
0.0390212 secs] 4624805K->2105432K(5976896K), 0.0395148 secs] [Times:
user=0.67 sys=0.01, real=0.04 secs]
There are two things to notice here:
1.	The time of this rescan was 20 times longer than any other rescan
time (1.4 seconds comparing to 58 ms)
2.	There was no minor GC before CMS-remark even though it was
explicitly requested.


The question here is:  is that something already covered by the
6990419 for which workaround simply does not work or something else?


Thank you,
Bartek

From charlie.hunt at oracle.com  Thu Jan 12 05:55:57 2012
From: charlie.hunt at oracle.com (charlie hunt)
Date: Thu, 12 Jan 2012 07:55:57 -0600
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>	<CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>	<9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>
	<CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>
Message-ID: <4F0EE66D.7090406@oracle.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/858971dd/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5166 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/858971dd/smime-0001.p7s 

From charlie.hunt at oracle.com  Thu Jan 12 11:11:26 2012
From: charlie.hunt at oracle.com (charlie hunt)
Date: Thu, 12 Jan 2012 13:11:26 -0600
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <1C5F8123-6AA8-4D08-A248-50F3CE4A5516@gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>	<CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>	<9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>
	<CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>
	<4F0EE66D.7090406@oracle.com>
	<1C5F8123-6AA8-4D08-A248-50F3CE4A5516@gmail.com>
Message-ID: <4F0F305E.1010305@oracle.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/6c7ea4c1/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5166 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/6c7ea4c1/smime.p7s 

From kirk.pepperdine at gmail.com  Thu Jan 12 11:03:41 2012
From: kirk.pepperdine at gmail.com (Kirk Pepperdine)
Date: Thu, 12 Jan 2012 20:03:41 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>
	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>
	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>
	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>
	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>
	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>
	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>
	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>
	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>
	<CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>
	<9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>
	<CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>
Message-ID: <AF61F332-3B86-4DC6-ACDD-6012FA287C2D@gmail.com>

Hi,

CMS failures occur as a result of a trend over time. It's almost impossible to recommend a correction from a single incident. that said, Jon's blog entry explains CMS failure very clearly. This the record you've sent suggests that young gen is way too small.. but again, I can't say anything with a single record.

Regards,
Kirk

On 2012-01-12, at 1:09 PM, Li Li wrote:

> yesterday, we set the maxNewSize to 256mb. And it works as we expected. but an hours ago, there is a promotion failure and a concurrent mode failure which cost 14s!  could anyone explain the gc logs for me?
> or any documents for the gc log format explanation?
> 
> 1. Desired survivor size 3342336 bytes, new threshold 5 (max 5) 
>    it says survivor size is 3mb
> 2. 58282K->57602K(59072K), 0.0543930 secs]
>    it says before young gc the memory used is 58282K, after young gc, there are 57602K live objects and the total young space is 59072K
> 3. (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs]
>      before old gc, 7.9GB is used. after old gc 3GB is alive. total old space is 7.9GB
> 
> in which situation will occur promotion failure and concurrent mode failure?
> from http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> the author says when CMS is doing concurrent work and JVM is asked for more memory. if there isn't any space for new allocation. then it will occur concurrent mode failure and it will stop the world and do a serial old gc.
> if there exist enough space but they are fragemented, then a promotion failure will occur.
> am I right?
> 
> 2012-01-12T18:27:32.582+0800: [GC [ParNew
> Desired survivor size 3342336 bytes, new threshold 1 (max 5)
> - age   1:    4594648 bytes,    4594648 total
> - age   2:     569200 bytes,    5163848 total
> : 58548K->5738K(59072K), 0.0159400 secs] 7958648K->7908502K(7984352K), 0.0160610 secs] [Times: user=0.17 sys=0.00, real=0.02 secs]
> 2012-01-12T18:27:32.609+0800: [GC [ParNew (promotion failed)
> Desired survivor size 3342336 bytes, new threshold 5 (max 5)
> - age   1:    1666376 bytes,    1666376 total
> : 58282K->57602K(59072K), 0.0543930 secs][CMS2012-01-12T18:27:33.804+0800: [CMS-concurrent-preclean: 14.098/34.323 secs] [Times: user=370.28 sys=5.65, real=34.31 secs]
>  (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs]
> 
> On Wed, Jan 11, 2012 at 5:48 PM, Kirk Pepperdine <kirk.pepperdine at gmail.com> wrote:
> CMS is not adaptive. To reconfigure heap, for many reasons, you need a full GC to occur. The response to a concurrent mode failure is always a full GC. That gave the JVM the opportunity to resize heap space. If this behaviour isn't happening when it should or is cause other problems it's time to either set the young gen size directly with NewSize or switch to the parallel collector with the adaptive sizing policy turned on. Logic here is that you want to avoid long pauses, use CMS. If CMS is giving you long pauses, than the parallel collector might be a better choice.
> 
> Regards,
> Kirk
> 
> On 2012-01-11, at 10:32 AM, Li Li wrote:
> 
> > after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB
> > What's the logic behind this?
> >
> > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs]
> > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs]
> > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs]
> >  (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs]
> > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs]
> > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs]
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/9ec8a7e7/attachment-0001.html 

From kirk.pepperdine at gmail.com  Thu Jan 12 11:08:17 2012
From: kirk.pepperdine at gmail.com (Kirk Pepperdine)
Date: Thu, 12 Jan 2012 20:08:17 +0100
Subject: MaxTenuringThreshold available in ParNewGC?
In-Reply-To: <4F0EE66D.7090406@oracle.com>
References: <CAFAd71VMjQgJiAqRWtdjvX7fuj14reOEZ9zPFADyHeS0Q6DzdA@mail.gmail.com>	<CABzyjy=Q1ceTRFdAmbe2dpP6LiL56bw4Qd=u=3=ykLU8JcxsvA@mail.gmail.com>	<CAFAd71XhBMuUySLUyUNY1t+1QCw8rspjNg-NXvW2kmBtHuwtTA@mail.gmail.com>	<C25263C4-2016-4CBB-8B01-D9F434A22050@kodewerk.com>	<CAFAd71WbWowVmnnb2gsCWkGvGPQXEiY-5RTcCGnnHYACZnuVbg@mail.gmail.com>	<F994581B-F2CB-4301-85BC-202F271F5D09@gmail.com>	<CAFAd71UihSCzkDr1vGjLn9s+rn2gZY-Pr3GFcf32uLZ=bFFzzw@mail.gmail.com>	<C9465F7C-4287-4907-983E-80F1B9554047@gmail.com>	<CAFAd71V8P3ne5aZXYXDG-njVzraNwzzzq0+NWG2d+WhH9y0VkQ@mail.gmail.com>	<CAFAd71U0ZY0KR0_6bYoT5hSvGEKjF3x+g5ca2wdxU-3se5c=Sw@mail.gmail.com>	<9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com>
	<CAFAd71WJqqV-gfka0exTaeEToOdLxUOEfquGRjOAg=7SsVgSOw@mail.gmail.com>
	<4F0EE66D.7090406@oracle.com>
Message-ID: <1C5F8123-6AA8-4D08-A248-50F3CE4A5516@gmail.com>

Charlie,

You shameless shameless self promoter!!!!! Shame on you!!!!

LiLi, please ignore Charlie's shameless self promotion and run out and buy the book. I think it will be of great help to your understanding of the problems your currently facing.

Charlie, what's my commission on the sale?


Regards,
Kirk

ps ;-)


On 2012-01-12, at 2:55 PM, charlie hunt wrote:

> At the risk of sounding self-promotional ....based on the questions you are asking, I think you'd find a lot of value in the Java Performance book:
> http://www.amazon.com/Java-Performance-Charlie-Hunt/dp/0137142528
> 
> Many of the folks on the mailing list were key contributors to its content.
> 
> Almost forget ... yes, the book offers a description of the GC log and it also offers suggestions on how you can use the "Desired survivor size", "new threshold" and tenuring distribution information to help determine how to size young generation.
> 
> hths,
> 
> charlie ...
> 
> On 01/12/12 06:09 AM, Li Li wrote:
>> 
>> yesterday, we set the maxNewSize to 256mb. And it works as we expected. but an hours ago, there is a promotion failure and a concurrent mode failure which cost 14s!  could anyone explain the gc logs for me?
>> or any documents for the gc log format explanation?
>> 
>> 1. Desired survivor size 3342336 bytes, new threshold 5 (max 5) 
>>    it says survivor size is 3mb
>> 2. 58282K->57602K(59072K), 0.0543930 secs]
>>    it says before young gc the memory used is 58282K, after young gc, there are 57602K live objects and the total young space is 59072K
>> 3. (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs]
>>      before old gc, 7.9GB is used. after old gc 3GB is alive. total old space is 7.9GB
>> 
>> in which situation will occur promotion failure and concurrent mode failure?
>> from http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
>> the author says when CMS is doing concurrent work and JVM is asked for more memory. if there isn't any space for new allocation. then it will occur concurrent mode failure and it will stop the world and do a serial old gc.
>> if there exist enough space but they are fragemented, then a promotion failure will occur.
>> am I right?
>> 
>> 2012-01-12T18:27:32.582+0800: [GC [ParNew
>> Desired survivor size 3342336 bytes, new threshold 1 (max 5)
>> - age   1:    4594648 bytes,    4594648 total
>> - age   2:     569200 bytes,    5163848 total
>> : 58548K->5738K(59072K), 0.0159400 secs] 7958648K->7908502K(7984352K), 0.0160610 secs] [Times: user=0.17 sys=0.00, real=0.02 secs]
>> 2012-01-12T18:27:32.609+0800: [GC [ParNew (promotion failed)
>> Desired survivor size 3342336 bytes, new threshold 5 (max 5)
>> - age   1:    1666376 bytes,    1666376 total
>> : 58282K->57602K(59072K), 0.0543930 secs][CMS2012-01-12T18:27:33.804+0800: [CMS-concurrent-preclean: 14.098/34.323 secs] [Times: user=370.28 sys=5.65, real=34.31 secs]
>>  (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs]
>> 
>> On Wed, Jan 11, 2012 at 5:48 PM, Kirk Pepperdine <kirk.pepperdine at gmail.com> wrote:
>> CMS is not adaptive. To reconfigure heap, for many reasons, you need a full GC to occur. The response to a concurrent mode failure is always a full GC. That gave the JVM the opportunity to resize heap space. If this behaviour isn't happening when it should or is cause other problems it's time to either set the young gen size directly with NewSize or switch to the parallel collector with the adaptive sizing policy turned on. Logic here is that you want to avoid long pauses, use CMS. If CMS is giving you long pauses, than the parallel collector might be a better choice.
>> 
>> Regards,
>> Kirk
>> 
>> On 2012-01-11, at 10:32 AM, Li Li wrote:
>> 
>> > after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB
>> > What's the logic behind this?
>> >
>> > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs]
>> > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs]
>> > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs]
>> >  (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs]
>> > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs]
>> > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs]
>> >
>> > _______________________________________________
>> > hotspot-gc-use mailing list
>> > hotspot-gc-use at openjdk.java.net
>> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> 
>> 
>> 
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/80268148/attachment.html 

From gbowyer at fastmail.co.uk  Thu Jan 19 13:37:00 2012
From: gbowyer at fastmail.co.uk (Greg Bowyer)
Date: Thu, 19 Jan 2012 13:37:00 -0800
Subject: CMS Sudden long pauses with large number of weak references
Message-ID: <4F188CFC.9030401@fastmail.co.uk>

Hi all

I have an application that is solr/lucene running on JDK 1.6 / 1.7 with 
CMS, 9Gb heap. The only unusual options here are DisableExplictGC and 
UseCompressedOops (which I do not think are the issue).

Generally the application has a heap usage of ~4G with a 2-3G float of 
continual garbage on top.

The application creates several large heap objects, some of these are 
sizeable arrays (~300mb), some of these are objects that have large 
retained graphs.

One of these is a WeakHashMap that in lucene stores IndexReaders (as the 
key) -> "Fields", this mapping is used to avoid disk access, there are 
not typically a large number of readers open at any time (at worst this 
could be say 25).

Generally we see the GC behavior being fairly solid across JVMS, however 
we get ever increasing amounts of concurrent-mode-failues with a marked 
spike of weak references, this only seems to appear when new indexes 
load on lucene, which is when the references are changed and when new 
versions of these caches are created.

---- %< ----
4747.399: [GC 4747.399: [ParNew4747.492: [SoftReference, 0 refs, 
0.0000120 secs]4747.492: [WeakReference, 5 refs, 0.0000050 
secs]4747.492: [FinalReference, 0 refs, 0.0000040 secs]4747.492: 
[PhantomReference, 0 refs, 0.0000030 secs]4747.492: [JNI Weak Reference, 
0.0000020 secs]: 133242K->11776K(153344K), 0.0925850 secs] 
7199972K->7187397K(9420160K), 0.0926840 secs] [Times: user=0.72 
sys=0.00, real=0.09 secs]
Total time for which application threads were stopped: 0.0931920 seconds
4747.530: [GC [1 CMS-initial-mark: 7175620K(9266816K)] 
7300153K(9420160K), 0.0044900 secs] [Times: user=0.00 sys=0.00, 
real=0.00 secs]
Total time for which application threads were stopped: 0.0398160 seconds
4747.535: [CMS-concurrent-mark-start]
4747.537: [GC 4747.537: [ParNew4747.568: [SoftReference, 0 refs, 
0.0000120 secs]4747.568: [WeakReference, 6 refs, 0.0000060 
secs]4747.568: [FinalReference, 0 refs, 0.0000040 secs]4747.568: 
[PhantomReference, 0 refs, 0.0000040 secs]4747.568: [JNI Weak Reference, 
0.0000020 secs] (promotion failed): 127253K->127965K(153344K), 0.0315280 
secs]4747.569: [CMS4755.725: [CMS-concurrent-mark: 8.157/8.190 secs] 
[Times: user=10.81 sys=3.49, real=8.19 secs]
  (concurrent mode failure)4757.436: [SoftReference, 0 refs, 0.0000050 
secs]4757.436: [WeakReference, 6360 refs, 0.0016500 secs]4757.437: 
[FinalReference, 462 refs, 0.0088580 secs]4757.446: [PhantomReference, 
114 refs, 0.0000220 secs]4757.446: [JNI Weak Reference, 0.0000070 secs]: 
7180161K->3603572K(9266816K), 17.1070460 secs] 
7302874K->3603572K(9420160K), [CMS Perm : 47059K->47044K(78380K)], 
17.1387610 secs] [Times: user=19.75 sys=3.49, real=17.14 secs]
Total time for which application threads were stopped: 17.1393510 seconds
Total time for which application threads were stopped: 0.0245300 seconds
---- >% ----

? Is this due to the defects fixed here 
http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f1391adc6681, 
would -XX:+ParallelRefProcEnabled help here ?

Also is there anything that could explain the large number of suddenly 
processed and traced weak-references

Many thanks
-- Greg

From rednaxelafx at gmail.com  Thu Jan 19 19:40:09 2012
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Fri, 20 Jan 2012 11:40:09 +0800
Subject: CMS Sudden long pauses with large number of weak references
In-Reply-To: <4F188CFC.9030401@fastmail.co.uk>
References: <4F188CFC.9030401@fastmail.co.uk>
Message-ID: <CA+cQ+tRqRJNtftSGF2RM9fREmu2smki0fBvhDUqfqTLQvgGoqQ@mail.gmail.com>

Hi Greg,

This might just be another case of "7112034: Parallel CMS fails to properly
mark reference objects" [1], which is only fixed recently, and
isn't delivered in any FCS releases yet.

Could you try a recent JDK7u4 preview and see if the problem goes away? Or,
try -XX:-CMSConcurrentMTEnabled, which should workaround this particular
bug, but you'll get longer CMS collection cycles.

- Kris

[1]: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7112034
[2]: http://jdk7.java.net/download.html

On Fri, Jan 20, 2012 at 5:37 AM, Greg Bowyer <gbowyer at fastmail.co.uk> wrote:

> Hi all
>
> I have an application that is solr/lucene running on JDK 1.6 / 1.7 with
> CMS, 9Gb heap. The only unusual options here are DisableExplictGC and
> UseCompressedOops (which I do not think are the issue).
>
> Generally the application has a heap usage of ~4G with a 2-3G float of
> continual garbage on top.
>
> The application creates several large heap objects, some of these are
> sizeable arrays (~300mb), some of these are objects that have large
> retained graphs.
>
> One of these is a WeakHashMap that in lucene stores IndexReaders (as the
> key) -> "Fields", this mapping is used to avoid disk access, there are
> not typically a large number of readers open at any time (at worst this
> could be say 25).
>
> Generally we see the GC behavior being fairly solid across JVMS, however
> we get ever increasing amounts of concurrent-mode-failues with a marked
> spike of weak references, this only seems to appear when new indexes
> load on lucene, which is when the references are changed and when new
> versions of these caches are created.
>
> ---- %< ----
> 4747.399: [GC 4747.399: [ParNew4747.492: [SoftReference, 0 refs,
> 0.0000120 secs]4747.492: [WeakReference, 5 refs, 0.0000050
> secs]4747.492: [FinalReference, 0 refs, 0.0000040 secs]4747.492:
> [PhantomReference, 0 refs, 0.0000030 secs]4747.492: [JNI Weak Reference,
> 0.0000020 secs]: 133242K->11776K(153344K), 0.0925850 secs]
> 7199972K->7187397K(9420160K), 0.0926840 secs] [Times: user=0.72
> sys=0.00, real=0.09 secs]
> Total time for which application threads were stopped: 0.0931920 seconds
> 4747.530: [GC [1 CMS-initial-mark: 7175620K(9266816K)]
> 7300153K(9420160K), 0.0044900 secs] [Times: user=0.00 sys=0.00,
> real=0.00 secs]
> Total time for which application threads were stopped: 0.0398160 seconds
> 4747.535: [CMS-concurrent-mark-start]
> 4747.537: [GC 4747.537: [ParNew4747.568: [SoftReference, 0 refs,
> 0.0000120 secs]4747.568: [WeakReference, 6 refs, 0.0000060
> secs]4747.568: [FinalReference, 0 refs, 0.0000040 secs]4747.568:
> [PhantomReference, 0 refs, 0.0000040 secs]4747.568: [JNI Weak Reference,
> 0.0000020 secs] (promotion failed): 127253K->127965K(153344K), 0.0315280
> secs]4747.569: [CMS4755.725: [CMS-concurrent-mark: 8.157/8.190 secs]
> [Times: user=10.81 sys=3.49, real=8.19 secs]
>  (concurrent mode failure)4757.436: [SoftReference, 0 refs, 0.0000050
> secs]4757.436: [WeakReference, 6360 refs, 0.0016500 secs]4757.437:
> [FinalReference, 462 refs, 0.0088580 secs]4757.446: [PhantomReference,
> 114 refs, 0.0000220 secs]4757.446: [JNI Weak Reference, 0.0000070 secs]:
> 7180161K->3603572K(9266816K), 17.1070460 secs]
> 7302874K->3603572K(9420160K), [CMS Perm : 47059K->47044K(78380K)],
> 17.1387610 secs] [Times: user=19.75 sys=3.49, real=17.14 secs]
> Total time for which application threads were stopped: 17.1393510 seconds
> Total time for which application threads were stopped: 0.0245300 seconds
> ---- >% ----
>
> ? Is this due to the defects fixed here
> http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f1391adc6681,
> would -XX:+ParallelRefProcEnabled help here ?
>
> Also is there anything that could explain the large number of suddenly
> processed and traced weak-references
>
> Many thanks
> -- Greg
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120120/ed4b5f1f/attachment.html 

From gbowyer at fastmail.co.uk  Fri Jan 20 10:48:32 2012
From: gbowyer at fastmail.co.uk (Greg Bowyer)
Date: Fri, 20 Jan 2012 10:48:32 -0800
Subject: CMS Sudden long pauses with large number of weak references
In-Reply-To: <CA+cQ+tRqRJNtftSGF2RM9fREmu2smki0fBvhDUqfqTLQvgGoqQ@mail.gmail.com>
References: <4F188CFC.9030401@fastmail.co.uk>
	<CA+cQ+tRqRJNtftSGF2RM9fREmu2smki0fBvhDUqfqTLQvgGoqQ@mail.gmail.com>
Message-ID: <4F19B700.8030404@fastmail.co.uk>

I didnt know this was in a forthcoming release so I had already compiled 
JDK7 from tip with the change (since its so small and simple)

I still saw some pauses, although I think that might be down to CMS 
starting to late, so I have lowered the Occupancy Fraction and started 
testing.

So far with the fix for 7112034 
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7112034>, 
ParallelRefProcessing and an Occupancy Fraction of 70 I am seeing fairly 
good results with the worst pause time so far of ~200 ms (this is far 
better than multi-second pauses I had before)

I will let a few more tests run and then try jdk7u4 and let you know if 
this helps me

thank you for confirming my thoughts on this.

-- Greg
On 19/01/12 19:40, Krystal Mok wrote:
> Hi Greg,
>
> This might just be another case of "7112034: Parallel CMS fails to 
> properly mark reference objects" [1], which is only fixed recently, 
> and isn't delivered in any FCS releases yet.
>
> Could you try a recent JDK7u4 preview and see if the problem goes 
> away? Or, try -XX:-CMSConcurrentMTEnabled, which should workaround 
> this particular bug, but you'll get longer CMS collection cycles.
>
> - Kris
>
> [1]: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7112034
> [2]: http://jdk7.java.net/download.html
>
> On Fri, Jan 20, 2012 at 5:37 AM, Greg Bowyer <gbowyer at fastmail.co.uk 
> <mailto:gbowyer at fastmail.co.uk>> wrote:
>
>     Hi all
>
>     I have an application that is solr/lucene running on JDK 1.6 / 1.7
>     with
>     CMS, 9Gb heap. The only unusual options here are DisableExplictGC and
>     UseCompressedOops (which I do not think are the issue).
>
>     Generally the application has a heap usage of ~4G with a 2-3G float of
>     continual garbage on top.
>
>     The application creates several large heap objects, some of these are
>     sizeable arrays (~300mb), some of these are objects that have large
>     retained graphs.
>
>     One of these is a WeakHashMap that in lucene stores IndexReaders
>     (as the
>     key) -> "Fields", this mapping is used to avoid disk access, there are
>     not typically a large number of readers open at any time (at worst
>     this
>     could be say 25).
>
>     Generally we see the GC behavior being fairly solid across JVMS,
>     however
>     we get ever increasing amounts of concurrent-mode-failues with a
>     marked
>     spike of weak references, this only seems to appear when new indexes
>     load on lucene, which is when the references are changed and when new
>     versions of these caches are created.
>
>     ---- %< ----
>     4747.399: [GC 4747.399: [ParNew4747.492: [SoftReference, 0 refs,
>     0.0000120 secs]4747.492: [WeakReference, 5 refs, 0.0000050
>     secs]4747.492: [FinalReference, 0 refs, 0.0000040 secs]4747.492:
>     [PhantomReference, 0 refs, 0.0000030 secs]4747.492: [JNI Weak
>     Reference,
>     0.0000020 secs]: 133242K->11776K(153344K), 0.0925850 secs]
>     7199972K->7187397K(9420160K), 0.0926840 secs] [Times: user=0.72
>     sys=0.00, real=0.09 secs]
>     Total time for which application threads were stopped: 0.0931920
>     seconds
>     4747.530: [GC [1 CMS-initial-mark: 7175620K(9266816K)]
>     7300153K(9420160K), 0.0044900 secs] [Times: user=0.00 sys=0.00,
>     real=0.00 secs]
>     Total time for which application threads were stopped: 0.0398160
>     seconds
>     4747.535: [CMS-concurrent-mark-start]
>     4747.537: [GC 4747.537: [ParNew4747.568: [SoftReference, 0 refs,
>     0.0000120 secs]4747.568: [WeakReference, 6 refs, 0.0000060
>     secs]4747.568: [FinalReference, 0 refs, 0.0000040 secs]4747.568:
>     [PhantomReference, 0 refs, 0.0000040 secs]4747.568: [JNI Weak
>     Reference,
>     0.0000020 secs] (promotion failed): 127253K->127965K(153344K),
>     0.0315280
>     secs]4747.569: [CMS4755.725: [CMS-concurrent-mark: 8.157/8.190 secs]
>     [Times: user=10.81 sys=3.49, real=8.19 secs]
>      (concurrent mode failure)4757.436: [SoftReference, 0 refs, 0.0000050
>     secs]4757.436: [WeakReference, 6360 refs, 0.0016500 secs]4757.437:
>     [FinalReference, 462 refs, 0.0088580 secs]4757.446: [PhantomReference,
>     114 refs, 0.0000220 secs]4757.446: [JNI Weak Reference, 0.0000070
>     secs]:
>     7180161K->3603572K(9266816K), 17.1070460 secs]
>     7302874K->3603572K(9420160K), [CMS Perm : 47059K->47044K(78380K)],
>     17.1387610 secs] [Times: user=19.75 sys=3.49, real=17.14 secs]
>     Total time for which application threads were stopped: 17.1393510
>     seconds
>     Total time for which application threads were stopped: 0.0245300
>     seconds
>     ---- >% ----
>
>     ? Is this due to the defects fixed here
>     http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f1391adc6681,
>     would -XX:+ParallelRefProcEnabled help here ?
>
>     Also is there anything that could explain the large number of suddenly
>     processed and traced weak-references
>
>     Many thanks
>     -- Greg
>     _______________________________________________
>     hotspot-gc-use mailing list
>     hotspot-gc-use at openjdk.java.net
>     <mailto:hotspot-gc-use at openjdk.java.net>
>     http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120120/283a2161/attachment-0001.html 

From gbowyer at fastmail.co.uk  Mon Jan 23 14:39:52 2012
From: gbowyer at fastmail.co.uk (Greg Bowyer)
Date: Mon, 23 Jan 2012 14:39:52 -0800
Subject: Odd pause with ParNew
Message-ID: <4F1DE1B8.1070909@fastmail.co.uk>

Hi all, working through my GC issues I have found that the fix for CMS 
weak references is making my GC far more predictable

However I occasionally find that sometimes I see a ParNew collection of 
1 second, outside of the number of VM operations for the safepoint there 
is nothing that would seem to be an issue ? does anyone know why this 
ParNew claims a 1 second wait ?

This is for a JDK compiled from the tip of jdk7/jdk7 repo in openjdk 
with the recent fix for CMS parallel marking.

---- %< ----
Total time for which application threads were stopped: 0.0005490 seconds
Application time: 0.7253910 seconds
38668.349: [GC 38668.349: [ParNew38669.323: [SoftReference, 0 refs, 
0.0000160 secs]38669.323: [WeakReference, 12 refs, 0.0000080 
secs]38669.323: [FinalReference, 0 refs, 0.0000040 secs]38669.323: 
[PhantomReference, 0 refs, 0.0000060 secs]38669.323: [JNI Weak 
Reference, 0.0000080 secs]
Desired survivor size 34865152 bytes, new threshold 1 (max 6)
- age   1:   69728352 bytes,   69728352 total
: 610922K->68096K(613440K), 0.9741030 secs] 
10005726K->9689043K(12514816K), 0.9742160 secs] [Times: user=7.26 
sys=0.01, real=0.97 secs]
Total time for which application threads were stopped: 1.0050700 seconds
---- >% ----

This matches the following safepoint (I guess):
---- %< ----
38668.316: GenCollectForAllocation          [      55          
7             10    ]      [     0     0    30     0   974    ]  7
---- >% ----

Which is hiding for reference in

--- %< ----
          vmop                      [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
38643.172: GenCollectForAllocation          [      55          
9             12    ]      [     0     0   148     0   130    ]  9
38646.203: GenCollectForAllocation          [      55         
12             12    ]      [     2     0    22     0   139    ]  12
38648.293: GenCollectForAllocation          [      55         
13             13    ]      [     1     0     1     0   144    ]  13
38649.852: CMS_Final_Remark                 [      55         
14             14    ]      [     0     0     1     0   168    ]  14
38652.328: FindDeadlocks                    [      55         
16             16    ]      [     0     0    59     0     0    ]  16
38652.387: FindDeadlocks                    [      55          
4              4    ]      [     0     0     0     0     0    ]  4
38652.762: GenCollectForAllocation          [      55          
9              9    ]      [     0     0    24     0   132    ]  9
38655.586: GenCollectForAllocation          [      55         
10             10    ]      [     0     1     9     0   294    ]  10
38656.961: GenCollectForAllocation          [      55          
6              6    ]      [     0     0     0     0   215    ]  5
38658.125: GenCollectForAllocation          [      55          
3              4    ]      [     0     0     1     0    91    ]  2
38658.223: CMS_Initial_Mark                 [      55          
2              4    ]      [     0     0     0     0     6    ]  1
38658.926: GenCollectForAllocation          [      55          
6              6    ]      [     0     0     1     0   102    ]  6
38661.621: GenCollectForAllocation          [      55          
5              5    ]      [     0     0     1     0    72    ]  5
38663.527: GenCollectForAllocation          [      55          
7              7    ]      [     0     0     0     0    56    ]  7
38665.180: GenCollectForAllocation          [      55          
5              5    ]      [     0     0     1     0   659    ]  4
38667.230: GenCollectForAllocation          [      55         
11             11    ]      [     0     0     0     0    88    ]  11
38667.566: FindDeadlocks                    [      55         
13             13    ]      [     0     0     0     0     0    ]  13
38667.566: FindDeadlocks                    [      55          
5              5    ]      [     0     0     0     0     0    ]  5
38667.582: RevokeBias                       [      55         
10             11    ]      [     0     0     8     0     0    ]  9
 > 38668.316: GenCollectForAllocation        [      55          
7             10    ]      [     0     0    30     0   974    ]  7
38670.551: GenCollectForAllocation          [      55         
10             10    ]      [     0     0     1     0   391    ]  10
38671.875: GenCollectForAllocation          [      55          
6              6    ]      [     0     0    25     0   409    ]  6
38674.230: GenCollectForAllocation          [      55         
11             11    ]      [     1     0     1     0   415    ]  10
38676.121: GenCollectForAllocation          [      55          
1              1    ]      [     0     0     0     0   558    ]  0
38677.691: GenCollectForAllocation          [      55          
7              9    ]      [     0     0     1     0   388    ]  6
38679.488: GenCollectForAllocation          [      55          
6              9    ]      [     0     0    10     0   297    ]  6
38680.367: CMS_Final_Remark                 [      55         
12             12    ]      [     0     0     0     0   310    ]  12
38682.016: GenCollectForAllocation          [      55         
10             11    ]      [     0     0     0     0   295    ]  10
38683.473: FindDeadlocks                    [      55          
6              8    ]      [     1     0     9     0     0    ]  4
38683.484: FindDeadlocks                    [      55          
7              7    ]      [     0     0     0     0     0    ]  7
          vmop                    [threads: total initially_running 
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
38683.867: GenCollectForAllocation          [      55          
7              8    ]      [     1     0     2     0   198    ]  7
38685.070: GenCollectForAllocation          [      55          
8              8    ]      [     0     0     0     0   430    ]  8
38687.312: GenCollectForAllocation          [      55          
6              8    ]      [     0     0     1     0   287    ]  6
38690.094: GenCollectForAllocation          [      55          
8             10    ]      [     0     0    16     0    49    ]  7
38692.910: CMS_Initial_Mark                 [      55          
3              3    ]      [     0     0     1     0   129    ]  2
38694.043: no vm operation                  [      55          
7              9    ]      [     0     0   644     0     0    ]  5
38696.605: GenCollectForAllocation          [      55          
9              9    ]      [     0     0   272     0    90    ]  9
38698.535: FindDeadlocks                    [      55          
8             13    ]      [     0     0    40     0     0    ]  8
38698.578: FindDeadlocks                    [      55          
8              8    ]      [     0     0     3     0     0    ]  8
38702.559: GenCollectForAllocation          [      55          
6              6    ]      [     0     0     1     0    55    ]  6
38709.008: GenCollectForAllocation          [      55          
1              1    ]      [     0     0     0     0    48    ]  0
38712.719: CMS_Final_Remark                 [      55          
3              3    ]      [     0     0     0     0    51    ]  2
38713.625: FindDeadlocks                    [      55          
2              2    ]      [     0     0     0     0     0    ]  0
38713.625: FindDeadlocks                    [      55          
2              2    ]      [     0     0     0     0     0    ]  2
38720.492: CMS_Initial_Mark                 [      55          
4              4    ]      [     0     0     1     0    88    ]  4
38721.059: GenCollectForAllocation          [      55          
4              4    ]      [     0     0     1     0    58    ]  3
38725.684: GenCollectForAllocation          [      55          
3              4    ]      [     0     0     0     0    57    ]  2
38725.742: CMS_Final_Remark                 [      55          
2              4    ]      [     0     0     0     0   254    ]  2
38729.410: FindDeadlocks                    [      55          
5              5    ]      [     0     0     0     0     0    ]  5
38729.410: FindDeadlocks                    [      55          
5              5    ]      [     0     0     0     0     0    ]  5
38730.527: GenCollectForAllocation          [      55         
11             10    ]      [     1     0    16     0    43    ]  8
38734.977: GenCollectForAllocation          [      55          
2              2    ]      [     0     0     0     0    53    ]  0

---- >% ----

Any thoughts ? Many thanks

-- Greg

From ysr1729 at gmail.com  Mon Jan 23 17:45:34 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Mon, 23 Jan 2012 17:45:34 -0800
Subject: Odd pause with ParNew
In-Reply-To: <4F1DE1B8.1070909@fastmail.co.uk>
References: <4F1DE1B8.1070909@fastmail.co.uk>
Message-ID: <CABzyjynWfsk+tNNwHjuJcwxROjLwTH6LJH0XZ7WbOdO0x0djgA@mail.gmail.com>

Hi Greg -- one of the first things I check in such cases is to see if that
particular scavenge
happened to copy much more data than the rest. (Since scavenge times are
directly
proportional to copy/allocation costs.) The second thing I check is whether
the
scavenge immediately followed a compacting collection (for which there is a
known
allocation pathology with an existing CR).

For the first above, there isn't sufficient surrounding info/context, but
the
"new threshold (max 6)" indicates that there may have been a spurt in
promotion (which is explained by the sudden lowering of tenuring threshold),
so you may find upon further analysis that there was indeed a sudden jump in
surviving objects (and thence a concommitant increase perhaps in objects
copied).

>From the safepoint stats you present it doesn't look like we'd have to look
beyond GC
to find an explanation (i.e. safeppointing or non-GC processing do not seem
to be
implicated here).

HTHS.
-- ramki

On Mon, Jan 23, 2012 at 2:39 PM, Greg Bowyer <gbowyer at fastmail.co.uk> wrote:

> Hi all, working through my GC issues I have found that the fix for CMS
> weak references is making my GC far more predictable
>
> However I occasionally find that sometimes I see a ParNew collection of
> 1 second, outside of the number of VM operations for the safepoint there
> is nothing that would seem to be an issue ? does anyone know why this
> ParNew claims a 1 second wait ?
>
> This is for a JDK compiled from the tip of jdk7/jdk7 repo in openjdk
> with the recent fix for CMS parallel marking.
>
> ---- %< ----
> Total time for which application threads were stopped: 0.0005490 seconds
> Application time: 0.7253910 seconds
> 38668.349: [GC 38668.349: [ParNew38669.323: [SoftReference, 0 refs,
> 0.0000160 secs]38669.323: [WeakReference, 12 refs, 0.0000080
> secs]38669.323: [FinalReference, 0 refs, 0.0000040 secs]38669.323:
> [PhantomReference, 0 refs, 0.0000060 secs]38669.323: [JNI Weak
> Reference, 0.0000080 secs]
> Desired survivor size 34865152 bytes, new threshold 1 (max 6)
> - age   1:   69728352 bytes,   69728352 total
> : 610922K->68096K(613440K), 0.9741030 secs]
> 10005726K->9689043K(12514816K), 0.9742160 secs] [Times: user=7.26
> sys=0.01, real=0.97 secs]
> Total time for which application threads were stopped: 1.0050700 seconds
> ---- >% ----
>
> This matches the following safepoint (I guess):
> ---- %< ----
> 38668.316: GenCollectForAllocation          [      55
> 7             10    ]      [     0     0    30     0   974    ]  7
> ---- >% ----
>
> Which is hiding for reference in
>
> --- %< ----
>          vmop                      [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 38643.172: GenCollectForAllocation          [      55
> 9             12    ]      [     0     0   148     0   130    ]  9
> 38646.203: GenCollectForAllocation          [      55
> 12             12    ]      [     2     0    22     0   139    ]  12
> 38648.293: GenCollectForAllocation          [      55
> 13             13    ]      [     1     0     1     0   144    ]  13
> 38649.852: CMS_Final_Remark                 [      55
> 14             14    ]      [     0     0     1     0   168    ]  14
> 38652.328: FindDeadlocks                    [      55
> 16             16    ]      [     0     0    59     0     0    ]  16
> 38652.387: FindDeadlocks                    [      55
> 4              4    ]      [     0     0     0     0     0    ]  4
> 38652.762: GenCollectForAllocation          [      55
> 9              9    ]      [     0     0    24     0   132    ]  9
> 38655.586: GenCollectForAllocation          [      55
> 10             10    ]      [     0     1     9     0   294    ]  10
> 38656.961: GenCollectForAllocation          [      55
> 6              6    ]      [     0     0     0     0   215    ]  5
> 38658.125: GenCollectForAllocation          [      55
> 3              4    ]      [     0     0     1     0    91    ]  2
> 38658.223: CMS_Initial_Mark                 [      55
> 2              4    ]      [     0     0     0     0     6    ]  1
> 38658.926: GenCollectForAllocation          [      55
> 6              6    ]      [     0     0     1     0   102    ]  6
> 38661.621: GenCollectForAllocation          [      55
> 5              5    ]      [     0     0     1     0    72    ]  5
> 38663.527: GenCollectForAllocation          [      55
> 7              7    ]      [     0     0     0     0    56    ]  7
> 38665.180: GenCollectForAllocation          [      55
> 5              5    ]      [     0     0     1     0   659    ]  4
> 38667.230: GenCollectForAllocation          [      55
> 11             11    ]      [     0     0     0     0    88    ]  11
> 38667.566: FindDeadlocks                    [      55
> 13             13    ]      [     0     0     0     0     0    ]  13
> 38667.566: FindDeadlocks                    [      55
> 5              5    ]      [     0     0     0     0     0    ]  5
> 38667.582: RevokeBias                       [      55
> 10             11    ]      [     0     0     8     0     0    ]  9
>  > 38668.316: GenCollectForAllocation        [      55
> 7             10    ]      [     0     0    30     0   974    ]  7
> 38670.551: GenCollectForAllocation          [      55
> 10             10    ]      [     0     0     1     0   391    ]  10
> 38671.875: GenCollectForAllocation          [      55
> 6              6    ]      [     0     0    25     0   409    ]  6
> 38674.230: GenCollectForAllocation          [      55
> 11             11    ]      [     1     0     1     0   415    ]  10
> 38676.121: GenCollectForAllocation          [      55
> 1              1    ]      [     0     0     0     0   558    ]  0
> 38677.691: GenCollectForAllocation          [      55
> 7              9    ]      [     0     0     1     0   388    ]  6
> 38679.488: GenCollectForAllocation          [      55
> 6              9    ]      [     0     0    10     0   297    ]  6
> 38680.367: CMS_Final_Remark                 [      55
> 12             12    ]      [     0     0     0     0   310    ]  12
> 38682.016: GenCollectForAllocation          [      55
> 10             11    ]      [     0     0     0     0   295    ]  10
> 38683.473: FindDeadlocks                    [      55
> 6              8    ]      [     1     0     9     0     0    ]  4
> 38683.484: FindDeadlocks                    [      55
> 7              7    ]      [     0     0     0     0     0    ]  7
>          vmop                    [threads: total initially_running
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 38683.867: GenCollectForAllocation          [      55
> 7              8    ]      [     1     0     2     0   198    ]  7
> 38685.070: GenCollectForAllocation          [      55
> 8              8    ]      [     0     0     0     0   430    ]  8
> 38687.312: GenCollectForAllocation          [      55
> 6              8    ]      [     0     0     1     0   287    ]  6
> 38690.094: GenCollectForAllocation          [      55
> 8             10    ]      [     0     0    16     0    49    ]  7
> 38692.910: CMS_Initial_Mark                 [      55
> 3              3    ]      [     0     0     1     0   129    ]  2
> 38694.043: no vm operation                  [      55
> 7              9    ]      [     0     0   644     0     0    ]  5
> 38696.605: GenCollectForAllocation          [      55
> 9              9    ]      [     0     0   272     0    90    ]  9
> 38698.535: FindDeadlocks                    [      55
> 8             13    ]      [     0     0    40     0     0    ]  8
> 38698.578: FindDeadlocks                    [      55
> 8              8    ]      [     0     0     3     0     0    ]  8
> 38702.559: GenCollectForAllocation          [      55
> 6              6    ]      [     0     0     1     0    55    ]  6
> 38709.008: GenCollectForAllocation          [      55
> 1              1    ]      [     0     0     0     0    48    ]  0
> 38712.719: CMS_Final_Remark                 [      55
> 3              3    ]      [     0     0     0     0    51    ]  2
> 38713.625: FindDeadlocks                    [      55
> 2              2    ]      [     0     0     0     0     0    ]  0
> 38713.625: FindDeadlocks                    [      55
> 2              2    ]      [     0     0     0     0     0    ]  2
> 38720.492: CMS_Initial_Mark                 [      55
> 4              4    ]      [     0     0     1     0    88    ]  4
> 38721.059: GenCollectForAllocation          [      55
> 4              4    ]      [     0     0     1     0    58    ]  3
> 38725.684: GenCollectForAllocation          [      55
> 3              4    ]      [     0     0     0     0    57    ]  2
> 38725.742: CMS_Final_Remark                 [      55
> 2              4    ]      [     0     0     0     0   254    ]  2
> 38729.410: FindDeadlocks                    [      55
> 5              5    ]      [     0     0     0     0     0    ]  5
> 38729.410: FindDeadlocks                    [      55
> 5              5    ]      [     0     0     0     0     0    ]  5
> 38730.527: GenCollectForAllocation          [      55
> 11             10    ]      [     1     0    16     0    43    ]  8
> 38734.977: GenCollectForAllocation          [      55
> 2              2    ]      [     0     0     0     0    53    ]  0
>
> ---- >% ----
>
> Any thoughts ? Many thanks
>
> -- Greg
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120123/08d1c232/attachment-0001.html 

From taras.tielkes at gmail.com  Tue Jan 24 10:15:39 2012
From: taras.tielkes at gmail.com (Taras Tielkes)
Date: Tue, 24 Jan 2012 19:15:39 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <4F1ECE7B.3040502@oracle.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
Message-ID: <CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>

Hi Jon,

Xmx is 5g, PermGen is 256m, new is 400m.

The overall tenured gen usage is at the point when I would expect the
CMS to kick in though.
Does this mean I'd have to lower the CMS initiating occupancy setting
(currently at 68%)?

In addition, are the promotion failure sizes expressed in bytes? If
so, I'm surprised to see such odd-sized (7, for example) sizes.

Thanks,
Taras

On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>
> Taras,
>
> The pattern makes sense if the tenured (cms) gen is in fact full.
> Multiple ?GC workers try to get a chunk of space for
> an allocation and there is no space.
>
> Jon
>
>
> On 01/24/12 04:22, Taras Tielkes wrote:
>>
>> Hi Jon,
>>
>> While inspecting our production logs for promotion failures, I saw the
>> following one today:
>> --------
>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>> 349623K->20008K(368640K), 0.0294350 secs]
>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>> sys=0.00, real=0.03 secs]
>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>> 347688K->40960K(368640K), 0.0536700 secs]
>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>> sys=0.00, real=0.05 secs]
>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>> (0: promotion failure size = 6) ?(1: promotion failure size = 6) ?(2:
>> promotion failure size = 7) ?(3: promotion failure size = 7) ?(4:
>> promotion failure size = 9) ?(5: p
>> romotion failure size = 9) ?(6: promotion failure size = 6) ?(7:
>> promotion failure size = 9) ?(promotion failed):
>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>> --------
>>
>> It's different from the others in two ways:
>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>> 2) the very small size of the promoted object
>>
>> Do such an promotion failure pattern ring a bell? It does not make sense
>> to me.
>>
>> Thanks,
>> Taras
>>
>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>> ?wrote:
>>>
>>> Taras,
>>>
>>>> I assume that the large sizes for the promotion failures during ParNew
>>>> are confirming that eliminating large array allocations might help
>>>> here. Do you agree?
>>>
>>> I agree that eliminating the large array allocation will help but you
>>> are still having
>>> promotion failures when the allocation size is small (I think it was
>>> 1026). ?That
>>> says that you are filling up the old (cms) generation faster than the GC
>>> can
>>> collect it. ?The large arrays are aggrevating the problem but not
>>> necessarily
>>> the cause.
>>>
>>> If these are still your heap sizes,
>>>
>>>> -Xms5g
>>>> -Xmx5g
>>>> -Xmn400m
>>>
>>> Start by increasing the young gen size as may already have been
>>> suggested. ?If you have a test setup where you can experiment,
>>> try doubling the young gen size to start.
>>>
>>> If you have not seen this, it might be helpful.
>>>
>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>
>>>> I'm not sure what to make of the concurrent mode
>>>
>>> The concurrent mode failure is a consequence of the promotion failure.
>>> Once the promotion failure happens the concurrent mode failure is
>>> inevitable.
>>>
>>> Jon
>>>
>>>
>>>> .
>>>
>>>
>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>
>>>> Hi Jon,
>>>>
>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>> application yesterday.
>>>> The application is running on 4 (homogenous) nodes.
>>>>
>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>> event during ParNew:
>>>>
>>>> node-002
>>>> -------
>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>> sys=0.01, real=0.03 secs]
>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>> sys=0.00, real=0.04 secs]
>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>> promotion failure size = 4281460) ?(promotion failed):
>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>
>>>> node-003
>>>> -------
>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>> sys=0.00, real=0.03 secs]
>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>> sys=0.00, real=0.04 secs]
>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>> promotion failure size = 1266955) ?(promotion failed):
>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>
>>>> node-004
>>>> -------
>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>> sys=0.02, real=0.06 secs]
>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>> sys=0.00, real=0.08 secs]
>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>> promotion failure size = 2788662) ?(promotion failed):
>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>
>>>> On a fourth node, I've found a different event: promotion failure
>>>> during CMS, with a much smaller size:
>>>>
>>>> node-001
>>>> -------
>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>> sys=0.01, real=0.06 secs]
>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>> sys=0.13, real=0.26 secs]
>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>> [CMS-concurrent-abortable-preclean-start]
>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>> sys=0.05, real=0.12 secs]
>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>> [ParNew (promotion failure size = 1026) ?(promotion failed):
>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>> ? 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>> sys=0.04, real=0.17 secs]
>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>> secs]48434.474: [scrub symbol& ? ?string tables, 0.0088370 secs] [1
>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>> secs]
>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>> ? (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>> sys=0.00, real=8.61 secs]
>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>
>>>> I assume that the large sizes for the promotion failures during ParNew
>>>> are confirming that eliminating large array allocations might help
>>>> here. Do you agree?
>>>> I'm not sure what to make of the concurrent mode failure.
>>>>
>>>> Thanks in advance for any suggestions,
>>>> Taras
>>>>
>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>> ? ?wrote:
>>>>>
>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>
>>>>>> Hi Jon,
>>>>>>
>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>> environment, but so far, no failures yet.
>>>>>> We know that the load we generate there is not representative. But
>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>
>>>>>> The flag will also be enabled in our production environment next week
>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>> there.
>>>>>>
>>>>>> I've got two questions for now:
>>>>>>
>>>>>> 1) From googling around on the output to expect
>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>> -------
>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion
>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>> -------
>>>>>> In that example line, what does the "0" stand for?
>>>>>
>>>>> It's the index of the GC worker thread ?that experienced the promotion
>>>>> failure.
>>>>>
>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>> application:
>>>>>> -------
>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>> sys=0.01, real=0.06 secs]
>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>> sys=0.00, real=0.06 secs]
>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>> [CMS-concurrent-preclean-start]
>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>> ? ?CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>> 3432839K->3423755K(5201920
>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& ? ? ?string tables,
>>>>>> 0.0093940
>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>> real=7.61 secs]
>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>> [CMS-concurrent-sweep-start]
>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>> ? ?(concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>> sys=0.00, real=10.29 secs]
>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>> -------
>>>>>>
>>>>>> In this case I don't know how to interpret the output.
>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>
>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>> to
>>>>> do recovery
>>>>> from the failure.
>>>>>
>>>>>> b) There's a full GC that took 14.08 secs
>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>
>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>> concurrent-sweep
>>>>> output but ?I think the "concurrent mode failure" message is part of
>>>>> the
>>>>> "Full GC"
>>>>> message. ?My guess is that the 10.29 is the time for the Full GC and
>>>>> the
>>>>> 14.08
>>>>> maybe is part of the concurrent-sweep message. ?Really hard to be sure.
>>>>>
>>>>> Jon
>>>>>>
>>>>>> How are these events, and their (real) times related to each other?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Taras
>>>>>>
>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>> Masamitsu<jon.masamitsu at oracle.com> ? ? ?wrote:
>>>>>>>
>>>>>>> Taras,
>>>>>>>
>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>> way to identify the root of your promotion failures (or
>>>>>>> at least eliminating some possible causes). ? ?I think it
>>>>>>> would help focus the discussion if you could send
>>>>>>> the result of that experiment early.
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>> experiencing occasional promotion failures.
>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>
>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>
>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>> (b).
>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>> 5-30
>>>>>>>> MB range.
>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>> a
>>>>>>>> String/char[]).
>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>> failure?
>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>> a
>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>> case?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Taras
>>>>>>>>
>>>>>>>> a) Current JVM options:
>>>>>>>> --------------------------------
>>>>>>>> -server
>>>>>>>> -Xms5g
>>>>>>>> -Xmx5g
>>>>>>>> -Xmn400m
>>>>>>>> -XX:PermSize=256m
>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>> -verbose:gc
>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>> -XX:+PrintGCDetails
>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>> -XX:+UseParNewGC
>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>> -Xloggc:gc.log
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>> b) Promotion failure during ParNew
>>>>>>>> --------------------------------
>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>> 3505808K->434291K
>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>> c) Promotion failure during CMS
>>>>>>>> --------------------------------
>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>> secs]
>>>>>>>> ? ? 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>> refs
>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>> secs]703031.419: [scrub symbol& ? ? ? ?string tables, 0.0094960
>>>>>>>> secs] [1 CMS
>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>> ? ? (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>> 10.9594300
>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>> --------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From jon.masamitsu at oracle.com  Tue Jan 24 14:21:11 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Tue, 24 Jan 2012 14:21:11 -0800
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
Message-ID: <4F1F2ED7.6060308@oracle.com>


On 01/24/12 10:15, Taras Tielkes wrote:
> Hi Jon,
>
> Xmx is 5g, PermGen is 256m, new is 400m.
>
> The overall tenured gen usage is at the point when I would expect the
> CMS to kick in though.
> Does this mean I'd have to lower the CMS initiating occupancy setting
> (currently at 68%)?

I don't have any quick answers as to what to try next.

>
> In addition, are the promotion failure sizes expressed in bytes? If
> so, I'm surprised to see such odd-sized (7, for example) sizes.

It's in words.

Jon
>
> Thanks,
> Taras
>
> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu<jon.masamitsu at oracle.com>  wrote:
>> Taras,
>>
>> The pattern makes sense if the tenured (cms) gen is in fact full.
>> Multiple  GC workers try to get a chunk of space for
>> an allocation and there is no space.
>>
>> Jon
>>
>>
>> On 01/24/12 04:22, Taras Tielkes wrote:
>>> Hi Jon,
>>>
>>> While inspecting our production logs for promotion failures, I saw the
>>> following one today:
>>> --------
>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>>> 349623K->20008K(368640K), 0.0294350 secs]
>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>>> sys=0.00, real=0.03 secs]
>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>>> 347688K->40960K(368640K), 0.0536700 secs]
>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>>> sys=0.00, real=0.05 secs]
>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>>> (0: promotion failure size = 6)  (1: promotion failure size = 6)  (2:
>>> promotion failure size = 7)  (3: promotion failure size = 7)  (4:
>>> promotion failure size = 9)  (5: p
>>> romotion failure size = 9)  (6: promotion failure size = 6)  (7:
>>> promotion failure size = 9)  (promotion failed):
>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>> --------
>>>
>>> It's different from the others in two ways:
>>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>>> 2) the very small size of the promoted object
>>>
>>> Do such an promotion failure pattern ring a bell? It does not make sense
>>> to me.
>>>
>>> Thanks,
>>> Taras
>>>
>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>   wrote:
>>>> Taras,
>>>>
>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>> are confirming that eliminating large array allocations might help
>>>>> here. Do you agree?
>>>> I agree that eliminating the large array allocation will help but you
>>>> are still having
>>>> promotion failures when the allocation size is small (I think it was
>>>> 1026).  That
>>>> says that you are filling up the old (cms) generation faster than the GC
>>>> can
>>>> collect it.  The large arrays are aggrevating the problem but not
>>>> necessarily
>>>> the cause.
>>>>
>>>> If these are still your heap sizes,
>>>>
>>>>> -Xms5g
>>>>> -Xmx5g
>>>>> -Xmn400m
>>>> Start by increasing the young gen size as may already have been
>>>> suggested.  If you have a test setup where you can experiment,
>>>> try doubling the young gen size to start.
>>>>
>>>> If you have not seen this, it might be helpful.
>>>>
>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>> I'm not sure what to make of the concurrent mode
>>>> The concurrent mode failure is a consequence of the promotion failure.
>>>> Once the promotion failure happens the concurrent mode failure is
>>>> inevitable.
>>>>
>>>> Jon
>>>>
>>>>
>>>>> .
>>>>
>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>> Hi Jon,
>>>>>
>>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>>> application yesterday.
>>>>> The application is running on 4 (homogenous) nodes.
>>>>>
>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>>> event during ParNew:
>>>>>
>>>>> node-002
>>>>> -------
>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>>> sys=0.01, real=0.03 secs]
>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>>> sys=0.00, real=0.04 secs]
>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>>> promotion failure size = 4281460)  (promotion failed):
>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>>
>>>>> node-003
>>>>> -------
>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>>> sys=0.00, real=0.03 secs]
>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>>> sys=0.00, real=0.04 secs]
>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>>> promotion failure size = 1266955)  (promotion failed):
>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>>
>>>>> node-004
>>>>> -------
>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>>> sys=0.02, real=0.06 secs]
>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>>> sys=0.00, real=0.08 secs]
>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>>> promotion failure size = 2788662)  (promotion failed):
>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>>
>>>>> On a fourth node, I've found a different event: promotion failure
>>>>> during CMS, with a much smaller size:
>>>>>
>>>>> node-001
>>>>> -------
>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>>> sys=0.01, real=0.06 secs]
>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>>> sys=0.13, real=0.26 secs]
>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>>> sys=0.05, real=0.12 secs]
>>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>>> [ParNew (promotion failure size = 1026)  (promotion failed):
>>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>>    3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>>> sys=0.04, real=0.17 secs]
>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>>> secs]48434.474: [scrub symbol&      string tables, 0.0088370 secs] [1
>>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>>> secs]
>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>>    (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>>> sys=0.00, real=8.61 secs]
>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>>
>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>> are confirming that eliminating large array allocations might help
>>>>> here. Do you agree?
>>>>> I'm not sure what to make of the concurrent mode failure.
>>>>>
>>>>> Thanks in advance for any suggestions,
>>>>> Taras
>>>>>
>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>     wrote:
>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>> Hi Jon,
>>>>>>>
>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>>> environment, but so far, no failures yet.
>>>>>>> We know that the load we generate there is not representative. But
>>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>>
>>>>>>> The flag will also be enabled in our production environment next week
>>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>>> there.
>>>>>>>
>>>>>>> I've got two questions for now:
>>>>>>>
>>>>>>> 1) From googling around on the output to expect
>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>>> -------
>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>>> -------
>>>>>>> In that example line, what does the "0" stand for?
>>>>>> It's the index of the GC worker thread  that experienced the promotion
>>>>>> failure.
>>>>>>
>>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>>> application:
>>>>>>> -------
>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>>> sys=0.00, real=0.06 secs]
>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>     CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>>> 3432839K->3423755K(5201920
>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol&        string tables,
>>>>>>> 0.0093940
>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>>> real=7.61 secs]
>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>>     (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>>> sys=0.00, real=10.29 secs]
>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>>> -------
>>>>>>>
>>>>>>> In this case I don't know how to interpret the output.
>>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>>> to
>>>>>> do recovery
>>>>>> from the failure.
>>>>>>
>>>>>>> b) There's a full GC that took 14.08 secs
>>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>>> concurrent-sweep
>>>>>> output but  I think the "concurrent mode failure" message is part of
>>>>>> the
>>>>>> "Full GC"
>>>>>> message.  My guess is that the 10.29 is the time for the Full GC and
>>>>>> the
>>>>>> 14.08
>>>>>> maybe is part of the concurrent-sweep message.  Really hard to be sure.
>>>>>>
>>>>>> Jon
>>>>>>> How are these events, and their (real) times related to each other?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Taras
>>>>>>>
>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>>> Masamitsu<jon.masamitsu at oracle.com>        wrote:
>>>>>>>> Taras,
>>>>>>>>
>>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>>> way to identify the root of your promotion failures (or
>>>>>>>> at least eliminating some possible causes).    I think it
>>>>>>>> would help focus the discussion if you could send
>>>>>>>> the result of that experiment early.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>>> experiencing occasional promotion failures.
>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>>
>>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>>
>>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>>> (b).
>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>>> 5-30
>>>>>>>>> MB range.
>>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>>> a
>>>>>>>>> String/char[]).
>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>>
>>>>>>>>> My questions are:
>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>>> failure?
>>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>>> a
>>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>>> case?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Taras
>>>>>>>>>
>>>>>>>>> a) Current JVM options:
>>>>>>>>> --------------------------------
>>>>>>>>> -server
>>>>>>>>> -Xms5g
>>>>>>>>> -Xmx5g
>>>>>>>>> -Xmn400m
>>>>>>>>> -XX:PermSize=256m
>>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>>> -verbose:gc
>>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>>> -XX:+PrintGCDetails
>>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>>> -XX:+UseParNewGC
>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>>> -Xloggc:gc.log
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>> b) Promotion failure during ParNew
>>>>>>>>> --------------------------------
>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>>> 3505808K->434291K
>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>> c) Promotion failure during CMS
>>>>>>>>> --------------------------------
>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>>> secs]
>>>>>>>>>      3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>>> refs
>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>>> secs]703031.419: [scrub symbol&          string tables, 0.0094960
>>>>>>>>> secs] [1 CMS
>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>>      (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>>> 10.9594300
>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>>> --------------------------------
>>>>>>>>> _______________________________________________
>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From karmazilla at gmail.com  Wed Jan 25 01:37:28 2012
From: karmazilla at gmail.com (Christian Vest Hansen)
Date: Wed, 25 Jan 2012 10:37:28 +0100
Subject: JVM Crash during GC
In-Reply-To: <4DD29C64.8060501@oracle.com>
References: <BANLkTin8Y8GGhMhTyp-6gQ9gO1Xxtij-nA@mail.gmail.com>
	<4DD29C64.8060501@oracle.com>
Message-ID: <CACyP5Pd0arWygp8z2Srghg2BPEFvPqECu1EJEOp3-Lv-y+dM9A@mail.gmail.com>

Hi,

How do you enable heap verification?
Is there a place where I can read more about what it does?

On Tue, May 17, 2011 at 18:03, Y. S. Ramakrishna <y.s.ramakrishna at oracle.com
> wrote:

> Hi Shane, that's 6u18 which is about 18 months old. Could you try
> the latest, 6u25, and see if the problem reproduces?
>
> The crash is somewhat generic in that we crash when scanning
> cards during a scavenge, presumably running across a bad pointer.
>
> If you need to stick with that JVM, you can try turning off
> compressed oops explicitly, and/or enable heap verification
> to see if it catches anything sooner.
>
> If the problem reproduces with the latest bits, we'd definitely
> be interested in a formal bug report with a test case.
>
> -- ramki
>
> On 05/17/11 08:46, Shane Cox wrote:
> > Has anyone seen a JVM crash similar to this one?  Wondering if this is a
> > new or existing problem.  Any insights would be appreciated.
> >
> > Thanks,
> > Shane
> >
> >
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  SIGSEGV (0xb) at pc=0x00002b0f1733cdc9, pid=14532, tid=1093286208
> > #
> > #  SIGSEGV (0xb) at pc=0x00002b0f1733cdc9, pid=14532, tid=1093286208
> > #
> > # JRE version: 6.0_18-b07
> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
> > linux-amd64 )
> > # Problematic frame:
> > # V  [libjvm.so+0x3b1dc9]
> >
> >
> > Current thread (0x0000000056588800):  GCTaskThread [stack:
> > 0x0000000000000000,0x0000000000000000] [id=14539]
> >
> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
> > si_addr=0x0000000000000025;;
> >
> >
> > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
> > C=native code)
> > V  [libjvm.so+0x3b1dc9];;  void
> > ParScanClosure::do_oop_work<oopDesc*>(oopDesc**, bool, bool)+0x79
> > V  [libjvm.so+0x5e7f03];;
> > ParRootScanWithBarrierTwoGensClosure::do_oop(oopDesc**)+0x13
> > V  [libjvm.so+0x3ab18c];;  instanceKlass::oop_oop_iterate_nv_m(oopDesc*,
> > FilteringClosure*, MemRegion)+0x16c
> > V  [libjvm.so+0x297aff];;
> > FreeListSpace_DCTOC::walk_mem_region_with_cl_par(MemRegion, HeapWord*,
> > HeapWord*, FilteringClosure*)+0x13f
> > V  [libjvm.so+0x297995];;
> > FreeListSpace_DCTOC::walk_mem_region_with_cl(MemRegion, HeapWord*,
> > HeapWord*, FilteringClosure*)+0x35
> > V  [libjvm.so+0x66014f];;  Filtering_DCTOC::walk_mem_region(MemRegion,
> > HeapWord*, HeapWord*)+0x5f
> > V  [libjvm.so+0x65fee9];;
> > DirtyCardToOopClosure::do_MemRegion(MemRegion)+0xf9
> > V  [libjvm.so+0x24153d];;
> > ClearNoncleanCardWrapper::do_MemRegion(MemRegion)+0xdd
> > V  [libjvm.so+0x23ffea];;
> > CardTableModRefBS::non_clean_card_iterate_work(MemRegion,
> > MemRegionClosure*, bool)+0x1ca
> > V  [libjvm.so+0x5e504b];;  CardTableModRefBS::process_stride(Space*,
> > MemRegion, int, int, DirtyCardToOopClosure*, MemRegionClosure*, bool,
> > signed char**, unsigned long, unsigned long)+0x13b
> > V  [libjvm.so+0x5e4e98];;
> > CardTableModRefBS::par_non_clean_card_iterate_work(Space*, MemRegion,
> > DirtyCardToOopClosure*, MemRegionClosure*, bool, int)+0xc8
> > V  [libjvm.so+0x23fdfb];;
> > CardTableModRefBS::non_clean_card_iterate(Space*, MemRegion,
> > DirtyCardToOopClosure*, MemRegionClosure*, bool)+0x5b
> > V  [libjvm.so+0x240b9a];;
> > CardTableRS::younger_refs_in_space_iterate(Space*,
> OopsInGenClosure*)+0x8a
> > V  [libjvm.so+0x379378];;
> > Generation::younger_refs_in_space_iterate(Space*, OopsInGenClosure*)+0x18
> > V  [libjvm.so+0x2c5c5f];;
> >
> ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*)+0x4f
> > V  [libjvm.so+0x240a8a];;
> > CardTableRS::younger_refs_iterate(Generation*, OopsInGenClosure*)+0x2a
> > V  [libjvm.so+0x36bfcd];;
> > GenCollectedHeap::gen_process_strong_roots(int, bool, bool,
> > SharedHeap::ScanningOption, OopsInGenClosure*, OopsInGenClosure*)+0x9d
> > V  [libjvm.so+0x5e82c9];;  ParNewGenTask::work(int)+0xc9
> > V  [libjvm.so+0x722e0d];;  GangWorker::loop()+0xad
> > V  [libjvm.so+0x722d24];;  GangWorker::run()+0x24
> > V  [libjvm.so+0x5da2af];;  java_start(Thread*)+0x13f
> >
> > VM Arguments:
> > jvm_args: -verbose:gc -XX:+PrintGCDetails -XX:+PrintHeapAtGC
> > -XX:+PrintGCDateStamps -XX:+UseParNewGC -Xmx4000m -Xms4000m -Xss256k
> > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:+UseConcMarkSweepGC
> > -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
> > -XX:+CMSPermGenSweepingEnabled -XX:+ExplicitGCInvokesConcurrent
> >
> > OS:Red Hat Enterprise Linux Server release 5.3 (Tikanga)
> >
> > uname:Linux 2.6.18-128.el5 #1 SMP Wed Jan 21 08:45:05 EST 2009 x86_64
> >
> > vm_info: Java HotSpot(TM) 64-Bit Server VM (16.0-b13) for linux-amd64
> > JRE (1.6.0_18-b07), built on Dec 17 2009 13:42:22 by "java_re" with gcc
> > 3.2.2 (SuSE Linux)
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>


-- 
Venlig hilsen / Kind regards,
Christian Vest Hansen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120125/fb2235b4/attachment.html 

From taras.tielkes at gmail.com  Wed Jan 25 02:41:54 2012
From: taras.tielkes at gmail.com (Taras Tielkes)
Date: Wed, 25 Jan 2012 11:41:54 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <4F1F2ED7.6060308@oracle.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
	<4F1F2ED7.6060308@oracle.com>
Message-ID: <CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>

Hi Jon,

At the risk of asking a stupid question, what's the word size on x64
when using CompressedOops?

Thanks,
Taras

On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu
<jon.masamitsu at oracle.com> wrote:
>
>
> On 01/24/12 10:15, Taras Tielkes wrote:
>> Hi Jon,
>>
>> Xmx is 5g, PermGen is 256m, new is 400m.
>>
>> The overall tenured gen usage is at the point when I would expect the
>> CMS to kick in though.
>> Does this mean I'd have to lower the CMS initiating occupancy setting
>> (currently at 68%)?
>
> I don't have any quick answers as to what to try next.
>
>>
>> In addition, are the promotion failure sizes expressed in bytes? If
>> so, I'm surprised to see such odd-sized (7, for example) sizes.
>
> It's in words.
>
> Jon
>>
>> Thanks,
>> Taras
>>
>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu<jon.masamitsu at oracle.com> ?wrote:
>>> Taras,
>>>
>>> The pattern makes sense if the tenured (cms) gen is in fact full.
>>> Multiple ?GC workers try to get a chunk of space for
>>> an allocation and there is no space.
>>>
>>> Jon
>>>
>>>
>>> On 01/24/12 04:22, Taras Tielkes wrote:
>>>> Hi Jon,
>>>>
>>>> While inspecting our production logs for promotion failures, I saw the
>>>> following one today:
>>>> --------
>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>>>> 349623K->20008K(368640K), 0.0294350 secs]
>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>>>> sys=0.00, real=0.03 secs]
>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>>>> 347688K->40960K(368640K), 0.0536700 secs]
>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>>>> sys=0.00, real=0.05 secs]
>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>>>> (0: promotion failure size = 6) ?(1: promotion failure size = 6) ?(2:
>>>> promotion failure size = 7) ?(3: promotion failure size = 7) ?(4:
>>>> promotion failure size = 9) ?(5: p
>>>> romotion failure size = 9) ?(6: promotion failure size = 6) ?(7:
>>>> promotion failure size = 9) ?(promotion failed):
>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>>>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>> --------
>>>>
>>>> It's different from the others in two ways:
>>>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>>>> 2) the very small size of the promoted object
>>>>
>>>> Do such an promotion failure pattern ring a bell? It does not make sense
>>>> to me.
>>>>
>>>> Thanks,
>>>> Taras
>>>>
>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>> ? wrote:
>>>>> Taras,
>>>>>
>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>> are confirming that eliminating large array allocations might help
>>>>>> here. Do you agree?
>>>>> I agree that eliminating the large array allocation will help but you
>>>>> are still having
>>>>> promotion failures when the allocation size is small (I think it was
>>>>> 1026). ?That
>>>>> says that you are filling up the old (cms) generation faster than the GC
>>>>> can
>>>>> collect it. ?The large arrays are aggrevating the problem but not
>>>>> necessarily
>>>>> the cause.
>>>>>
>>>>> If these are still your heap sizes,
>>>>>
>>>>>> -Xms5g
>>>>>> -Xmx5g
>>>>>> -Xmn400m
>>>>> Start by increasing the young gen size as may already have been
>>>>> suggested. ?If you have a test setup where you can experiment,
>>>>> try doubling the young gen size to start.
>>>>>
>>>>> If you have not seen this, it might be helpful.
>>>>>
>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>>> I'm not sure what to make of the concurrent mode
>>>>> The concurrent mode failure is a consequence of the promotion failure.
>>>>> Once the promotion failure happens the concurrent mode failure is
>>>>> inevitable.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>>> .
>>>>>
>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>>> Hi Jon,
>>>>>>
>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>>>> application yesterday.
>>>>>> The application is running on 4 (homogenous) nodes.
>>>>>>
>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>>>> event during ParNew:
>>>>>>
>>>>>> node-002
>>>>>> -------
>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>>>> sys=0.01, real=0.03 secs]
>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>>>> promotion failure size = 4281460) ?(promotion failed):
>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>>>
>>>>>> node-003
>>>>>> -------
>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>>>> sys=0.00, real=0.03 secs]
>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>>>> promotion failure size = 1266955) ?(promotion failed):
>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>>>
>>>>>> node-004
>>>>>> -------
>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>>>> sys=0.02, real=0.06 secs]
>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>>>> sys=0.00, real=0.08 secs]
>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>>>> promotion failure size = 2788662) ?(promotion failed):
>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>>>
>>>>>> On a fourth node, I've found a different event: promotion failure
>>>>>> during CMS, with a much smaller size:
>>>>>>
>>>>>> node-001
>>>>>> -------
>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>>>> sys=0.01, real=0.06 secs]
>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>>>> sys=0.13, real=0.26 secs]
>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>>>> sys=0.05, real=0.12 secs]
>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>>>> [ParNew (promotion failure size = 1026) ?(promotion failed):
>>>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>>> ? ?3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>>>> sys=0.04, real=0.17 secs]
>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>>>> secs]48434.474: [scrub symbol& ? ? ?string tables, 0.0088370 secs] [1
>>>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>>>> secs]
>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>>> ? ?(concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>>>> sys=0.00, real=8.61 secs]
>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>>>
>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>> are confirming that eliminating large array allocations might help
>>>>>> here. Do you agree?
>>>>>> I'm not sure what to make of the concurrent mode failure.
>>>>>>
>>>>>> Thanks in advance for any suggestions,
>>>>>> Taras
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>> ? ? wrote:
>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>>> Hi Jon,
>>>>>>>>
>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>>>> environment, but so far, no failures yet.
>>>>>>>> We know that the load we generate there is not representative. But
>>>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>>>
>>>>>>>> The flag will also be enabled in our production environment next week
>>>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>>>> there.
>>>>>>>>
>>>>>>>> I've got two questions for now:
>>>>>>>>
>>>>>>>> 1) From googling around on the output to expect
>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>>>> -------
>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion
>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>>>> -------
>>>>>>>> In that example line, what does the "0" stand for?
>>>>>>> It's the index of the GC worker thread ?that experienced the promotion
>>>>>>> failure.
>>>>>>>
>>>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>>>> application:
>>>>>>>> -------
>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>>>> sys=0.00, real=0.06 secs]
>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>> ? ? CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>>>> 3432839K->3423755K(5201920
>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& ? ? ? ?string tables,
>>>>>>>> 0.0093940
>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>>>> real=7.61 secs]
>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>>> ? ? (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>>>> sys=0.00, real=10.29 secs]
>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>>>> -------
>>>>>>>>
>>>>>>>> In this case I don't know how to interpret the output.
>>>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>>>> to
>>>>>>> do recovery
>>>>>>> from the failure.
>>>>>>>
>>>>>>>> b) There's a full GC that took 14.08 secs
>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>>>> concurrent-sweep
>>>>>>> output but ?I think the "concurrent mode failure" message is part of
>>>>>>> the
>>>>>>> "Full GC"
>>>>>>> message. ?My guess is that the 10.29 is the time for the Full GC and
>>>>>>> the
>>>>>>> 14.08
>>>>>>> maybe is part of the concurrent-sweep message. ?Really hard to be sure.
>>>>>>>
>>>>>>> Jon
>>>>>>>> How are these events, and their (real) times related to each other?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Taras
>>>>>>>>
>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>>>> Masamitsu<jon.masamitsu at oracle.com> ? ? ? ?wrote:
>>>>>>>>> Taras,
>>>>>>>>>
>>>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>>>> way to identify the root of your promotion failures (or
>>>>>>>>> at least eliminating some possible causes). ? ?I think it
>>>>>>>>> would help focus the discussion if you could send
>>>>>>>>> the result of that experiment early.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>>>> experiencing occasional promotion failures.
>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>>>
>>>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>>>
>>>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>>>> (b).
>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>>>> 5-30
>>>>>>>>>> MB range.
>>>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>>>> a
>>>>>>>>>> String/char[]).
>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>>>
>>>>>>>>>> My questions are:
>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>>>> failure?
>>>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>>>> a
>>>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>>>> case?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Taras
>>>>>>>>>>
>>>>>>>>>> a) Current JVM options:
>>>>>>>>>> --------------------------------
>>>>>>>>>> -server
>>>>>>>>>> -Xms5g
>>>>>>>>>> -Xmx5g
>>>>>>>>>> -Xmn400m
>>>>>>>>>> -XX:PermSize=256m
>>>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>>>> -verbose:gc
>>>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>>>> -XX:+PrintGCDetails
>>>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>>>> -XX:+UseParNewGC
>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>>>> -Xloggc:gc.log
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> b) Promotion failure during ParNew
>>>>>>>>>> --------------------------------
>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>>>> 3505808K->434291K
>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> c) Promotion failure during CMS
>>>>>>>>>> --------------------------------
>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>>>> secs]
>>>>>>>>>> ? ? ?3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>>>> refs
>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>>>> secs]703031.419: [scrub symbol& ? ? ? ? ?string tables, 0.0094960
>>>>>>>>>> secs] [1 CMS
>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>>> ? ? ?(concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>>>> 10.9594300
>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>>>> --------------------------------
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>> _______________________________________________
>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From jon.masamitsu at oracle.com  Wed Jan 25 22:49:49 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Wed, 25 Jan 2012 22:49:49 -0800
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>	<4EF9FCAC.3030208@oracle.com>	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>	<4F06A270.3010701@oracle.com>	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>	<4F0DBEC4.7040907@oracle.com>	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>	<4F1ECE7B.3040502@oracle.com>	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>	<4F1F2ED7.6060308@oracle.com>
	<CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>
Message-ID: <4F20F78D.9070905@oracle.com>


On 1/25/2012 2:41 AM, Taras Tielkes wrote:
> Hi Jon,
>
> At the risk of asking a stupid question, what's the word size on x64
> when using CompressedOops?

Word size is the same with and without CompressedOops (8 bytes).  With  
CompressedOops
we can just point to words with a 32bit reference (i.e., map the 32bit 
reference to a full
64bit address).

Jon

> Thanks,
> Taras
>
> On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu
> <jon.masamitsu at oracle.com>  wrote:
>> On 01/24/12 10:15, Taras Tielkes wrote:
>>> Hi Jon,
>>>
>>> Xmx is 5g, PermGen is 256m, new is 400m.
>>>
>>> The overall tenured gen usage is at the point when I would expect the
>>> CMS to kick in though.
>>> Does this mean I'd have to lower the CMS initiating occupancy setting
>>> (currently at 68%)?
>> I don't have any quick answers as to what to try next.
>>
>>> In addition, are the promotion failure sizes expressed in bytes? If
>>> so, I'm surprised to see such odd-sized (7, for example) sizes.
>> It's in words.
>>
>> Jon
>>> Thanks,
>>> Taras
>>>
>>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu<jon.masamitsu at oracle.com>    wrote:
>>>> Taras,
>>>>
>>>> The pattern makes sense if the tenured (cms) gen is in fact full.
>>>> Multiple  GC workers try to get a chunk of space for
>>>> an allocation and there is no space.
>>>>
>>>> Jon
>>>>
>>>>
>>>> On 01/24/12 04:22, Taras Tielkes wrote:
>>>>> Hi Jon,
>>>>>
>>>>> While inspecting our production logs for promotion failures, I saw the
>>>>> following one today:
>>>>> --------
>>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>>>>> 349623K->20008K(368640K), 0.0294350 secs]
>>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>>>>> sys=0.00, real=0.03 secs]
>>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>>>>> 347688K->40960K(368640K), 0.0536700 secs]
>>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>>>>> sys=0.00, real=0.05 secs]
>>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>>>>> (0: promotion failure size = 6)  (1: promotion failure size = 6)  (2:
>>>>> promotion failure size = 7)  (3: promotion failure size = 7)  (4:
>>>>> promotion failure size = 9)  (5: p
>>>>> romotion failure size = 9)  (6: promotion failure size = 6)  (7:
>>>>> promotion failure size = 9)  (promotion failed):
>>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>>>>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>> --------
>>>>>
>>>>> It's different from the others in two ways:
>>>>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>>>>> 2) the very small size of the promoted object
>>>>>
>>>>> Do such an promotion failure pattern ring a bell? It does not make sense
>>>>> to me.
>>>>>
>>>>> Thanks,
>>>>> Taras
>>>>>
>>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>    wrote:
>>>>>> Taras,
>>>>>>
>>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>>> are confirming that eliminating large array allocations might help
>>>>>>> here. Do you agree?
>>>>>> I agree that eliminating the large array allocation will help but you
>>>>>> are still having
>>>>>> promotion failures when the allocation size is small (I think it was
>>>>>> 1026).  That
>>>>>> says that you are filling up the old (cms) generation faster than the GC
>>>>>> can
>>>>>> collect it.  The large arrays are aggrevating the problem but not
>>>>>> necessarily
>>>>>> the cause.
>>>>>>
>>>>>> If these are still your heap sizes,
>>>>>>
>>>>>>> -Xms5g
>>>>>>> -Xmx5g
>>>>>>> -Xmn400m
>>>>>> Start by increasing the young gen size as may already have been
>>>>>> suggested.  If you have a test setup where you can experiment,
>>>>>> try doubling the young gen size to start.
>>>>>>
>>>>>> If you have not seen this, it might be helpful.
>>>>>>
>>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>>>> I'm not sure what to make of the concurrent mode
>>>>>> The concurrent mode failure is a consequence of the promotion failure.
>>>>>> Once the promotion failure happens the concurrent mode failure is
>>>>>> inevitable.
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>>
>>>>>>> .
>>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>>>> Hi Jon,
>>>>>>>
>>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>>>>> application yesterday.
>>>>>>> The application is running on 4 (homogenous) nodes.
>>>>>>>
>>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>>>>> event during ParNew:
>>>>>>>
>>>>>>> node-002
>>>>>>> -------
>>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>>>>> sys=0.01, real=0.03 secs]
>>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>>>>> promotion failure size = 4281460)  (promotion failed):
>>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>>>>
>>>>>>> node-003
>>>>>>> -------
>>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>>>>> promotion failure size = 1266955)  (promotion failed):
>>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>>>>
>>>>>>> node-004
>>>>>>> -------
>>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>>>>> sys=0.02, real=0.06 secs]
>>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>>>>> sys=0.00, real=0.08 secs]
>>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>>>>> promotion failure size = 2788662)  (promotion failed):
>>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>>>>
>>>>>>> On a fourth node, I've found a different event: promotion failure
>>>>>>> during CMS, with a much smaller size:
>>>>>>>
>>>>>>> node-001
>>>>>>> -------
>>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>>>>> sys=0.13, real=0.26 secs]
>>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>>>>> sys=0.05, real=0.12 secs]
>>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>>>>> [ParNew (promotion failure size = 1026)  (promotion failed):
>>>>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>>>>     3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>>>>> sys=0.04, real=0.17 secs]
>>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>>>>> secs]48434.474: [scrub symbol&        string tables, 0.0088370 secs] [1
>>>>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>>>>> secs]
>>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>>>>     (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>>>>> sys=0.00, real=8.61 secs]
>>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>>>>
>>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>>> are confirming that eliminating large array allocations might help
>>>>>>> here. Do you agree?
>>>>>>> I'm not sure what to make of the concurrent mode failure.
>>>>>>>
>>>>>>> Thanks in advance for any suggestions,
>>>>>>> Taras
>>>>>>>
>>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>>>      wrote:
>>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>>>> Hi Jon,
>>>>>>>>>
>>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>>>>> environment, but so far, no failures yet.
>>>>>>>>> We know that the load we generate there is not representative. But
>>>>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>>>>
>>>>>>>>> The flag will also be enabled in our production environment next week
>>>>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>> I've got two questions for now:
>>>>>>>>>
>>>>>>>>> 1) From googling around on the output to expect
>>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>>>>> -------
>>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
>>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>>>>> -------
>>>>>>>>> In that example line, what does the "0" stand for?
>>>>>>>> It's the index of the GC worker thread  that experienced the promotion
>>>>>>>> failure.
>>>>>>>>
>>>>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>>>>> application:
>>>>>>>>> -------
>>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>>>>> sys=0.00, real=0.06 secs]
>>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>      CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>>>>> 3432839K->3423755K(5201920
>>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol&          string tables,
>>>>>>>>> 0.0093940
>>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>>>>> real=7.61 secs]
>>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>>>>      (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>>>>> sys=0.00, real=10.29 secs]
>>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>>>>> -------
>>>>>>>>>
>>>>>>>>> In this case I don't know how to interpret the output.
>>>>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>>>>> to
>>>>>>>> do recovery
>>>>>>>> from the failure.
>>>>>>>>
>>>>>>>>> b) There's a full GC that took 14.08 secs
>>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>>>>> concurrent-sweep
>>>>>>>> output but  I think the "concurrent mode failure" message is part of
>>>>>>>> the
>>>>>>>> "Full GC"
>>>>>>>> message.  My guess is that the 10.29 is the time for the Full GC and
>>>>>>>> the
>>>>>>>> 14.08
>>>>>>>> maybe is part of the concurrent-sweep message.  Really hard to be sure.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>> How are these events, and their (real) times related to each other?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Taras
>>>>>>>>>
>>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>>>>> Masamitsu<jon.masamitsu at oracle.com>          wrote:
>>>>>>>>>> Taras,
>>>>>>>>>>
>>>>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>>>>> way to identify the root of your promotion failures (or
>>>>>>>>>> at least eliminating some possible causes).    I think it
>>>>>>>>>> would help focus the discussion if you could send
>>>>>>>>>> the result of that experiment early.
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>>>>> experiencing occasional promotion failures.
>>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>>>>
>>>>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>>>>
>>>>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>>>>> (b).
>>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>>>>> 5-30
>>>>>>>>>>> MB range.
>>>>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>>>>> a
>>>>>>>>>>> String/char[]).
>>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>>>>
>>>>>>>>>>> My questions are:
>>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>>>>> failure?
>>>>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>>>>> a
>>>>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>>>>> case?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Taras
>>>>>>>>>>>
>>>>>>>>>>> a) Current JVM options:
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> -server
>>>>>>>>>>> -Xms5g
>>>>>>>>>>> -Xmx5g
>>>>>>>>>>> -Xmn400m
>>>>>>>>>>> -XX:PermSize=256m
>>>>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>>>>> -verbose:gc
>>>>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>>>>> -XX:+PrintGCDetails
>>>>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>>>>> -XX:+UseParNewGC
>>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>>>>> -Xloggc:gc.log
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>> b) Promotion failure during ParNew
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>>>>> 3505808K->434291K
>>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>> c) Promotion failure during CMS
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>>>>> secs]
>>>>>>>>>>>       3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>>>>> refs
>>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>>>>> secs]703031.419: [scrub symbol&            string tables, 0.0094960
>>>>>>>>>>> secs] [1 CMS
>>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>>>>       (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>>>>> 10.9594300
>>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>> _______________________________________________
>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From taras.tielkes at gmail.com  Thu Jan 26 12:22:14 2012
From: taras.tielkes at gmail.com (Taras Tielkes)
Date: Thu, 26 Jan 2012 21:22:14 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <4F20F78D.9070905@oracle.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
	<4F1F2ED7.6060308@oracle.com>
	<CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>
	<4F20F78D.9070905@oracle.com>
Message-ID: <CA+R7V79M0B2UTqqxiUGfoK-1pMP54e+biBnH+wy=zGEA2vjihg@mail.gmail.com>

Hi Jon,

Thanks for clearing up the word size question.
Over the past weeks, I've seen promotion failures exceeding 10M words,
so arrays of over 80 megabytes in size :)
Today, we deployed a production release that eliminates the huge
buffer allocations - instead data is processed in a streaming fashion.

We'll see how things are performing after collecting operations data
for a few weeks.

Thanks for your help,
Taras

On Thu, Jan 26, 2012 at 7:49 AM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>
>
> On 1/25/2012 2:41 AM, Taras Tielkes wrote:
>> Hi Jon,
>>
>> At the risk of asking a stupid question, what's the word size on x64
>> when using CompressedOops?
>
> Word size is the same with and without CompressedOops (8 bytes). ?With
> CompressedOops
> we can just point to words with a 32bit reference (i.e., map the 32bit
> reference to a full
> 64bit address).
>
> Jon
>
>> Thanks,
>> Taras
>>
>> On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu
>> <jon.masamitsu at oracle.com> ?wrote:
>>> On 01/24/12 10:15, Taras Tielkes wrote:
>>>> Hi Jon,
>>>>
>>>> Xmx is 5g, PermGen is 256m, new is 400m.
>>>>
>>>> The overall tenured gen usage is at the point when I would expect the
>>>> CMS to kick in though.
>>>> Does this mean I'd have to lower the CMS initiating occupancy setting
>>>> (currently at 68%)?
>>> I don't have any quick answers as to what to try next.
>>>
>>>> In addition, are the promotion failure sizes expressed in bytes? If
>>>> so, I'm surprised to see such odd-sized (7, for example) sizes.
>>> It's in words.
>>>
>>> Jon
>>>> Thanks,
>>>> Taras
>>>>
>>>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu<jon.masamitsu at oracle.com> ? ?wrote:
>>>>> Taras,
>>>>>
>>>>> The pattern makes sense if the tenured (cms) gen is in fact full.
>>>>> Multiple ?GC workers try to get a chunk of space for
>>>>> an allocation and there is no space.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>> On 01/24/12 04:22, Taras Tielkes wrote:
>>>>>> Hi Jon,
>>>>>>
>>>>>> While inspecting our production logs for promotion failures, I saw the
>>>>>> following one today:
>>>>>> --------
>>>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>>>>>> 349623K->20008K(368640K), 0.0294350 secs]
>>>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>>>>>> sys=0.00, real=0.03 secs]
>>>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>>>>>> 347688K->40960K(368640K), 0.0536700 secs]
>>>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>>>>>> sys=0.00, real=0.05 secs]
>>>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>>>>>> (0: promotion failure size = 6) ?(1: promotion failure size = 6) ?(2:
>>>>>> promotion failure size = 7) ?(3: promotion failure size = 7) ?(4:
>>>>>> promotion failure size = 9) ?(5: p
>>>>>> romotion failure size = 9) ?(6: promotion failure size = 6) ?(7:
>>>>>> promotion failure size = 9) ?(promotion failed):
>>>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>>>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>>>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>>>>>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>>>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>>>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>>>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>>>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>>>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>> --------
>>>>>>
>>>>>> It's different from the others in two ways:
>>>>>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>>>>>> 2) the very small size of the promoted object
>>>>>>
>>>>>> Do such an promotion failure pattern ring a bell? It does not make sense
>>>>>> to me.
>>>>>>
>>>>>> Thanks,
>>>>>> Taras
>>>>>>
>>>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>> ? ?wrote:
>>>>>>> Taras,
>>>>>>>
>>>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>>>> are confirming that eliminating large array allocations might help
>>>>>>>> here. Do you agree?
>>>>>>> I agree that eliminating the large array allocation will help but you
>>>>>>> are still having
>>>>>>> promotion failures when the allocation size is small (I think it was
>>>>>>> 1026). ?That
>>>>>>> says that you are filling up the old (cms) generation faster than the GC
>>>>>>> can
>>>>>>> collect it. ?The large arrays are aggrevating the problem but not
>>>>>>> necessarily
>>>>>>> the cause.
>>>>>>>
>>>>>>> If these are still your heap sizes,
>>>>>>>
>>>>>>>> -Xms5g
>>>>>>>> -Xmx5g
>>>>>>>> -Xmn400m
>>>>>>> Start by increasing the young gen size as may already have been
>>>>>>> suggested. ?If you have a test setup where you can experiment,
>>>>>>> try doubling the young gen size to start.
>>>>>>>
>>>>>>> If you have not seen this, it might be helpful.
>>>>>>>
>>>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>>>>> I'm not sure what to make of the concurrent mode
>>>>>>> The concurrent mode failure is a consequence of the promotion failure.
>>>>>>> Once the promotion failure happens the concurrent mode failure is
>>>>>>> inevitable.
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>>
>>>>>>>> .
>>>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>>>>> Hi Jon,
>>>>>>>>
>>>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>>>>>> application yesterday.
>>>>>>>> The application is running on 4 (homogenous) nodes.
>>>>>>>>
>>>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>>>>>> event during ParNew:
>>>>>>>>
>>>>>>>> node-002
>>>>>>>> -------
>>>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>>>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>>>>>> sys=0.01, real=0.03 secs]
>>>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>>>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>>>>>> promotion failure size = 4281460) ?(promotion failed):
>>>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>>>>>
>>>>>>>> node-003
>>>>>>>> -------
>>>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>>>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>>>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>>>>>> promotion failure size = 1266955) ?(promotion failed):
>>>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>>>>>
>>>>>>>> node-004
>>>>>>>> -------
>>>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>>>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>>>>>> sys=0.02, real=0.06 secs]
>>>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>>>>>> sys=0.00, real=0.08 secs]
>>>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>>>>>> promotion failure size = 2788662) ?(promotion failed):
>>>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>>>>>
>>>>>>>> On a fourth node, I've found a different event: promotion failure
>>>>>>>> during CMS, with a much smaller size:
>>>>>>>>
>>>>>>>> node-001
>>>>>>>> -------
>>>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>>>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>>>>>> sys=0.13, real=0.26 secs]
>>>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>>>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>>>>>> sys=0.05, real=0.12 secs]
>>>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>>>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>>>>>> [ParNew (promotion failure size = 1026) ?(promotion failed):
>>>>>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>>>>> ? ? 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>>>>>> sys=0.04, real=0.17 secs]
>>>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>>>>>> secs]48434.474: [scrub symbol& ? ? ? ?string tables, 0.0088370 secs] [1
>>>>>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>>>>>> secs]
>>>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>>>>> ? ? (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>>>>>> sys=0.00, real=8.61 secs]
>>>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>>>>>
>>>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>>>> are confirming that eliminating large array allocations might help
>>>>>>>> here. Do you agree?
>>>>>>>> I'm not sure what to make of the concurrent mode failure.
>>>>>>>>
>>>>>>>> Thanks in advance for any suggestions,
>>>>>>>> Taras
>>>>>>>>
>>>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>>>> ? ? ?wrote:
>>>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>>>>> Hi Jon,
>>>>>>>>>>
>>>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>>>>>> environment, but so far, no failures yet.
>>>>>>>>>> We know that the load we generate there is not representative. But
>>>>>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>>>>>
>>>>>>>>>> The flag will also be enabled in our production environment next week
>>>>>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>> I've got two questions for now:
>>>>>>>>>>
>>>>>>>>>> 1) From googling around on the output to expect
>>>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>>>>>> -------
>>>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion
>>>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>>>>>> -------
>>>>>>>>>> In that example line, what does the "0" stand for?
>>>>>>>>> It's the index of the GC worker thread ?that experienced the promotion
>>>>>>>>> failure.
>>>>>>>>>
>>>>>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>>>>>> application:
>>>>>>>>>> -------
>>>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>>>>>> sys=0.00, real=0.06 secs]
>>>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>> ? ? ?CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>>>>>> 3432839K->3423755K(5201920
>>>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& ? ? ? ? ?string tables,
>>>>>>>>>> 0.0093940
>>>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>>>>>> real=7.61 secs]
>>>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>>>>> ? ? ?(concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>>>>>> sys=0.00, real=10.29 secs]
>>>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>>>>>> -------
>>>>>>>>>>
>>>>>>>>>> In this case I don't know how to interpret the output.
>>>>>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>>>>>> to
>>>>>>>>> do recovery
>>>>>>>>> from the failure.
>>>>>>>>>
>>>>>>>>>> b) There's a full GC that took 14.08 secs
>>>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>>>>>> concurrent-sweep
>>>>>>>>> output but ?I think the "concurrent mode failure" message is part of
>>>>>>>>> the
>>>>>>>>> "Full GC"
>>>>>>>>> message. ?My guess is that the 10.29 is the time for the Full GC and
>>>>>>>>> the
>>>>>>>>> 14.08
>>>>>>>>> maybe is part of the concurrent-sweep message. ?Really hard to be sure.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>> How are these events, and their (real) times related to each other?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Taras
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>>>>>> Masamitsu<jon.masamitsu at oracle.com> ? ? ? ? ?wrote:
>>>>>>>>>>> Taras,
>>>>>>>>>>>
>>>>>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>>>>>> way to identify the root of your promotion failures (or
>>>>>>>>>>> at least eliminating some possible causes). ? ?I think it
>>>>>>>>>>> would help focus the discussion if you could send
>>>>>>>>>>> the result of that experiment early.
>>>>>>>>>>>
>>>>>>>>>>> Jon
>>>>>>>>>>>
>>>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>>>>>> experiencing occasional promotion failures.
>>>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>>>>>
>>>>>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>>>>>
>>>>>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>>>>>> (b).
>>>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>>>>>> this.
>>>>>>>>>>>>
>>>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>>>>>> 5-30
>>>>>>>>>>>> MB range.
>>>>>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>>>>>> a
>>>>>>>>>>>> String/char[]).
>>>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>>>>>
>>>>>>>>>>>> My questions are:
>>>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>>>>>> failure?
>>>>>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>>>>>> a
>>>>>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>>>>>> case?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>> Taras
>>>>>>>>>>>>
>>>>>>>>>>>> a) Current JVM options:
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>> -server
>>>>>>>>>>>> -Xms5g
>>>>>>>>>>>> -Xmx5g
>>>>>>>>>>>> -Xmn400m
>>>>>>>>>>>> -XX:PermSize=256m
>>>>>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>>>>>> -verbose:gc
>>>>>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>>>>>> -XX:+PrintGCDetails
>>>>>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>>>>>> -XX:+UseParNewGC
>>>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>>>>>> -Xloggc:gc.log
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> b) Promotion failure during ParNew
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>>>>>> 3505808K->434291K
>>>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> c) Promotion failure during CMS
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>>>>>> secs]
>>>>>>>>>>>> ? ? ? 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>>>>>> refs
>>>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>>>>>> secs]703031.419: [scrub symbol& ? ? ? ? ? ?string tables, 0.0094960
>>>>>>>>>>>> secs] [1 CMS
>>>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>>>>> ? ? ? (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>>>>>> 10.9594300
>>>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>> _______________________________________________
>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use