From kinnari.darji at citi.com Tue Jan 3 13:36:18 2012 From: kinnari.darji at citi.com (Darji, Kinnari ) Date: Tue, 3 Jan 2012 16:36:18 -0500 Subject: ParNew garbage collection Message-ID: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net> Hello GC team, I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise. 2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew Desired survivor size 19628032 bytes, new threshold 4 (max 4) - age 1: 4466024 bytes, 4466024 total - age 2: 3568136 bytes, 8034160 total - age 3: 3559808 bytes, 11593968 total - age 4: 1737520 bytes, 13331488 total : 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs] Total time for which application threads were stopped: 1.7197830 seconds Application time: 8.4134780 seconds Thank you Kinnari -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120103/3e75a750/attachment.html From jon.masamitsu at oracle.com Wed Jan 4 09:43:34 2012 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Wed, 04 Jan 2012 09:43:34 -0800 Subject: ParNew garbage collection In-Reply-To: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net> References: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net> Message-ID: <4F048FC6.30907@oracle.com> Try turning on TraceSafepointCleanupTime. I haven't used it myself. If that's not it, look in share/vm/runtime/globals.hpp for some other flag that traces safepoints. On 1/3/2012 1:36 PM, Darji, Kinnari wrote: > Hello GC team, > I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise. > > 2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew > Desired survivor size 19628032 bytes, new threshold 4 (max 4) > - age 1: 4466024 bytes, 4466024 total > - age 2: 3568136 bytes, 8034160 total > - age 3: 3559808 bytes, 11593968 total > - age 4: 1737520 bytes, 13331488 total > : 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs] > Total time for which application threads were stopped: 1.7197830 seconds > Application time: 8.4134780 seconds > > > > Thank you > Kinnari > > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120104/1d676f2c/attachment.html From ysr1729 at gmail.com Wed Jan 4 09:53:52 2012 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Wed, 4 Jan 2012 09:53:52 -0800 Subject: ParNew garbage collection In-Reply-To: <4F048FC6.30907@oracle.com> References: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net> <4F048FC6.30907@oracle.com> Message-ID: May be also +PrintSafepointStatistics (and related parms) to drill down a bit further, although TraceSafepointCleanup would probably provide all of the info on a per-safepoint basis. There was an old issue wrt monitor deflation that was foixed a few releases ago, so Kinnari should check the version of the JVM she's running on as well.... (There are now a couple of flags related to monitor list handling policies i believe but i have no experience with them and do not have the code in front of me -- make sure to cc the runtime list if that turns out to be the issue again and you are already on a very recent version of the JVM.) -- ramki On Wed, Jan 4, 2012 at 9:43 AM, Jon Masamitsu wrote: > ** > Try turning on TraceSafepointCleanupTime. I haven't used it myself. If > that's not it, look in share/vm/runtime/globals.hpp for some other flag > that traces safepoints. > > > On 1/3/2012 1:36 PM, Darji, Kinnari wrote: > > Hello GC team, > I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise. > > 2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew > Desired survivor size 19628032 bytes, new threshold 4 (max 4) > - age 1: 4466024 bytes, 4466024 total > - age 2: 3568136 bytes, 8034160 total > - age 3: 3559808 bytes, 11593968 total > - age 4: 1737520 bytes, 13331488 total > : 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs] > Total time for which application threads were stopped: 1.7197830 seconds > Application time: 8.4134780 seconds > > > > Thank you > Kinnari > > > > > _______________________________________________ > hotspot-gc-use mailing listhotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120104/c186dd04/attachment.html From taras.tielkes at gmail.com Thu Jan 5 15:32:50 2012 From: taras.tielkes at gmail.com (Taras Tielkes) Date: Fri, 6 Jan 2012 00:32:50 +0100 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: <4EF9FCAC.3030208@oracle.com> References: <4EF9FCAC.3030208@oracle.com> Message-ID: Hi Jon, We've enabled the PrintPromotionFailure flag in our preprod environment, but so far, no failures yet. We know that the load we generate there is not representative. But perhaps we'll catch something, given enough patience. The flag will also be enabled in our production environment next week - so one way or the other, we'll get more diagnostic data soon. I'll also do some allocation profiling of the application in isolation - I know that there is abusive large byte[] and char[] allocation in there. I've got two questions for now: 1) From googling around on the output to expect (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), I see that -XX:+PrintPromotionFailure will generate output like this: ------- 592.079: [ParNew (0: promotion failure size = 2698) (promotion failed): 135865K->134943K(138240K), 0.1433555 secs] ------- In that example line, what does the "0" stand for? 2) Below is a snippet of (real) gc log from our production application: ------- 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: 345951K->40960K(368640K), 0.0676780 secs] 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 sys=0.01, real=0.06 secs] 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: 368640K->40959K(368640K), 0.0618880 secs] 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start] 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 2011-12-30T22:42:24.099+0100: 2136593.001: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] [Times: user=5.70 sys=0.23, real=5.23 secs] 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] 3432839K->3423755K(5201920 K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak refs processing, 0.0034280 secs]2136605.804: [class unloading, 0.0289480 secs]2136605.833: [scrub symbol & string tables, 0.0093940 secs] [1 CMS-remark: 3318289K(4833280K )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, real=7.61 secs] 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start] 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 secs] 3491471K->291853K(5201920K), [CMS Perm : 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 sys=0.00, real=10.29 secs] 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] ------- In this case I don't know how to interpret the output. a) There's a promotion failure that took 7.49 secs b) There's a full GC that took 14.08 secs c) There's a concurrent mode failure that took 10.29 secs How are these events, and their (real) times related to each other? Thanks in advance, Taras On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu wrote: > Taras, > > PrintPromotionFailure seems like it would go a long > way to identify the root of your promotion failures (or > at least eliminating some possible causes). ? ?I think it > would help focus the discussion if you could send > the result of that experiment early. > > Jon > > On 12/27/2011 5:07 AM, Taras Tielkes wrote: >> Hi, >> >> We're running an application with the CMS/ParNew collectors that is >> experiencing occasional promotion failures. >> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >> I've listed the specific JVM options used below (a). >> >> The application is deployed across a handful of machines, and the >> promotion failures are fairly uniform across those. >> >> The first kind of failure we observe is a promotion failure during >> ParNew collection, I've included a snipped from the gc log below (b). >> The second kind of failure is a concurrrent mode failure (perhaps >> triggered by the same cause), see (c) below. >> The frequency (after running for a some weeks) is approximately once >> per day. This is bearable, but obviously we'd like to improve on this. >> >> Apart from high-volume request handling (which allocates a lot of >> small objects), the application also runs a few dozen background >> threads that download and process XML documents, typically in the 5-30 >> MB range. >> A known deficiency in the existing code is that the XML content is >> copied twice before processing (once to a byte[], and later again to a >> String/char[]). >> Given that a 30 MB XML stream will result in a 60 MB >> java.lang.String/char[], my suspicion is that these big array >> allocations are causing us to run into the CMS fragmentation issue. >> >> My questions are: >> 1) Does the data from the GC logs provide sufficient evidence to >> conclude that CMS fragmentation is the cause of the promotion failure? >> 2) If not, what's the next step of investigating the cause? >> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a >> feeling for the size of the objects that fail promotion. >> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >> reliable approach to diagnose CMS fragmentation. Is this indeed the >> case? >> >> Thanks in advance, >> Taras >> >> a) Current JVM options: >> -------------------------------- >> -server >> -Xms5g >> -Xmx5g >> -Xmn400m >> -XX:PermSize=256m >> -XX:MaxPermSize=256m >> -XX:+PrintGCTimeStamps >> -verbose:gc >> -XX:+PrintGCDateStamps >> -XX:+PrintGCDetails >> -XX:SurvivorRatio=8 >> -XX:+UseConcMarkSweepGC >> -XX:+UseParNewGC >> -XX:+DisableExplicitGC >> -XX:+UseCMSInitiatingOccupancyOnly >> -XX:+CMSClassUnloadingEnabled >> -XX:+CMSScavengeBeforeRemark >> -XX:CMSInitiatingOccupancyFraction=68 >> -Xloggc:gc.log >> -------------------------------- >> >> b) Promotion failure during ParNew >> -------------------------------- >> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >> 368640K->40959K(368640K), 0.0693460 secs] >> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >> sys=0.01, real=0.07 secs] >> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >> 368639K->31321K(368640K), 0.0511400 secs] >> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >> sys=0.00, real=0.05 secs] >> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >> 359001K->18694K(368640K), 0.0272970 secs] >> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >> sys=0.00, real=0.03 secs] >> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >> (promotion failed): 338813K->361078K(368640K), 0.1321200 >> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >> 3505808K->434291K >> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >> [Times: user=5.24 sys=0.00, real=5.02 secs] >> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K), >> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K), >> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K), >> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >> -------------------------------- >> >> c) Promotion failure during CMS >> -------------------------------- >> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >> 357228K->40960K(368640K), 0.0525110 secs] >> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >> sys=0.00, real=0.05 secs] >> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >> 366075K->37119K(368640K), 0.0479780 secs] >> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >> sys=0.01, real=0.05 secs] >> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >> 364792K->40960K(368640K), 0.0421740 secs] >> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >> sys=0.00, real=0.04 secs] >> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >> user=0.02 sys=0.00, real=0.03 secs] >> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start] >> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >> 368640K->40960K(368640K), 0.0836690 secs] >> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >> sys=0.01, real=0.08 secs] >> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start] >> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >> 2011-12-14T08:29:30.938+0100: 703019.840: >> [CMS-concurrent-abortable-preclean-start] >> 2011-12-14T08:29:32.337+0100: 703021.239: >> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >> user=6.68 sys=0.27, real=1.40 secs] >> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs] >> ? 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >> sys=2.58, real=9.88 secs] >> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs >> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >> secs]703031.419: [scrub symbol& ?string tables, 0.0094960 secs] [1 CMS >> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >> [Times: user=13.73 sys=2.59, real=10.19 secs] >> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start] >> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >> ? (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300 >> secs] 3739469K->433437K(5201920K), [CMS Perm : >> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >> sys=0.00, real=10.96 secs] >> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K), >> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K), >> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K), >> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >> -------------------------------- >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From jon.masamitsu at oracle.com Thu Jan 5 23:27:44 2012 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Thu, 05 Jan 2012 23:27:44 -0800 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: References: <4EF9FCAC.3030208@oracle.com> Message-ID: <4F06A270.3010701@oracle.com> On 1/5/2012 3:32 PM, Taras Tielkes wrote: > Hi Jon, > > We've enabled the PrintPromotionFailure flag in our preprod > environment, but so far, no failures yet. > We know that the load we generate there is not representative. But > perhaps we'll catch something, given enough patience. > > The flag will also be enabled in our production environment next week > - so one way or the other, we'll get more diagnostic data soon. > I'll also do some allocation profiling of the application in isolation > - I know that there is abusive large byte[] and char[] allocation in > there. > > I've got two questions for now: > > 1) From googling around on the output to expect > (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), > I see that -XX:+PrintPromotionFailure will generate output like this: > ------- > 592.079: [ParNew (0: promotion failure size = 2698) (promotion > failed): 135865K->134943K(138240K), 0.1433555 secs] > ------- > In that example line, what does the "0" stand for? It's the index of the GC worker thread that experienced the promotion failure. > 2) Below is a snippet of (real) gc log from our production application: > ------- > 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: > 345951K->40960K(368640K), 0.0676780 secs] > 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 > sys=0.01, real=0.06 secs] > 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: > 368640K->40959K(368640K), 0.0618880 secs] > 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 > sys=0.00, real=0.06 secs] > 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: > 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: > user=0.04 sys=0.00, real=0.04 secs] > 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] > 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: > 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] > 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start] > 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: > 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] > 2011-12-30T22:42:24.099+0100: 2136593.001: > [CMS-concurrent-abortable-preclean-start] > CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: > 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] > [Times: user=5.70 sys=0.23, real=5.23 secs] > 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K > (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: > [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] > 3432839K->3423755K(5201920 > K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] > 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak > refs processing, 0.0034280 secs]2136605.804: [class unloading, > 0.0289480 secs]2136605.833: [scrub symbol& string tables, 0.0093940 > secs] [1 CMS-remark: 3318289K(4833280K > )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, > real=7.61 secs] > 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start] > 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: > [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: > 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] > (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 > secs] 3491471K->291853K(5201920K), [CMS Perm : > 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 > sys=0.00, real=10.29 secs] > 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: > 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), > 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] > 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: > 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), > 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] > ------- > > In this case I don't know how to interpret the output. > a) There's a promotion failure that took 7.49 secs This is the time it took to attempt the minor collection (ParNew) and to do recovery from the failure. > b) There's a full GC that took 14.08 secs > c) There's a concurrent mode failure that took 10.29 secs Not sure about b) and c) because the output is mixed up with the concurrent-sweep output but I think the "concurrent mode failure" message is part of the "Full GC" message. My guess is that the 10.29 is the time for the Full GC and the 14.08 maybe is part of the concurrent-sweep message. Really hard to be sure. Jon > How are these events, and their (real) times related to each other? > > Thanks in advance, > Taras > > On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu wrote: >> Taras, >> >> PrintPromotionFailure seems like it would go a long >> way to identify the root of your promotion failures (or >> at least eliminating some possible causes). I think it >> would help focus the discussion if you could send >> the result of that experiment early. >> >> Jon >> >> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>> Hi, >>> >>> We're running an application with the CMS/ParNew collectors that is >>> experiencing occasional promotion failures. >>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>> I've listed the specific JVM options used below (a). >>> >>> The application is deployed across a handful of machines, and the >>> promotion failures are fairly uniform across those. >>> >>> The first kind of failure we observe is a promotion failure during >>> ParNew collection, I've included a snipped from the gc log below (b). >>> The second kind of failure is a concurrrent mode failure (perhaps >>> triggered by the same cause), see (c) below. >>> The frequency (after running for a some weeks) is approximately once >>> per day. This is bearable, but obviously we'd like to improve on this. >>> >>> Apart from high-volume request handling (which allocates a lot of >>> small objects), the application also runs a few dozen background >>> threads that download and process XML documents, typically in the 5-30 >>> MB range. >>> A known deficiency in the existing code is that the XML content is >>> copied twice before processing (once to a byte[], and later again to a >>> String/char[]). >>> Given that a 30 MB XML stream will result in a 60 MB >>> java.lang.String/char[], my suspicion is that these big array >>> allocations are causing us to run into the CMS fragmentation issue. >>> >>> My questions are: >>> 1) Does the data from the GC logs provide sufficient evidence to >>> conclude that CMS fragmentation is the cause of the promotion failure? >>> 2) If not, what's the next step of investigating the cause? >>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a >>> feeling for the size of the objects that fail promotion. >>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>> case? >>> >>> Thanks in advance, >>> Taras >>> >>> a) Current JVM options: >>> -------------------------------- >>> -server >>> -Xms5g >>> -Xmx5g >>> -Xmn400m >>> -XX:PermSize=256m >>> -XX:MaxPermSize=256m >>> -XX:+PrintGCTimeStamps >>> -verbose:gc >>> -XX:+PrintGCDateStamps >>> -XX:+PrintGCDetails >>> -XX:SurvivorRatio=8 >>> -XX:+UseConcMarkSweepGC >>> -XX:+UseParNewGC >>> -XX:+DisableExplicitGC >>> -XX:+UseCMSInitiatingOccupancyOnly >>> -XX:+CMSClassUnloadingEnabled >>> -XX:+CMSScavengeBeforeRemark >>> -XX:CMSInitiatingOccupancyFraction=68 >>> -Xloggc:gc.log >>> -------------------------------- >>> >>> b) Promotion failure during ParNew >>> -------------------------------- >>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>> 368640K->40959K(368640K), 0.0693460 secs] >>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>> sys=0.01, real=0.07 secs] >>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>> 368639K->31321K(368640K), 0.0511400 secs] >>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>> sys=0.00, real=0.05 secs] >>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>> 359001K->18694K(368640K), 0.0272970 secs] >>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>> sys=0.00, real=0.03 secs] >>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>> 3505808K->434291K >>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K), >>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K), >>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K), >>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>> -------------------------------- >>> >>> c) Promotion failure during CMS >>> -------------------------------- >>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>> 357228K->40960K(368640K), 0.0525110 secs] >>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>> sys=0.00, real=0.05 secs] >>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>> 366075K->37119K(368640K), 0.0479780 secs] >>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>> sys=0.01, real=0.05 secs] >>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>> 364792K->40960K(368640K), 0.0421740 secs] >>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>> sys=0.00, real=0.04 secs] >>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>> user=0.02 sys=0.00, real=0.03 secs] >>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start] >>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>> 368640K->40960K(368640K), 0.0836690 secs] >>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>> sys=0.01, real=0.08 secs] >>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start] >>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>> 2011-12-14T08:29:30.938+0100: 703019.840: >>> [CMS-concurrent-abortable-preclean-start] >>> 2011-12-14T08:29:32.337+0100: 703021.239: >>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>> user=6.68 sys=0.27, real=1.40 secs] >>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs] >>> 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>> sys=2.58, real=9.88 secs] >>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs >>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>> secs]703031.419: [scrub symbol& string tables, 0.0094960 secs] [1 CMS >>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start] >>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>> (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300 >>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>> sys=0.00, real=10.96 secs] >>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K), >>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K), >>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K), >>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>> -------------------------------- >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From java at java4.info Mon Jan 9 03:08:28 2012 From: java at java4.info (Florian Binder) Date: Mon, 09 Jan 2012 12:08:28 +0100 Subject: Very long young gc pause (ParNew with CMS) Message-ID: <4F0ACAAC.8020103@java4.info> Hi everybody, I am using CMS (with ParNew) gc and have very long (> 6 seconds) young gc pauses. As you can see in the log below the old-gen-heap consists of one large block, the new Size has 256m, it uses 13 worker threads and it has to copy 27505761 words (~210mb) directly from eden to old gen. I have seen that this problem occurs only after about one week of uptime. Even thought we make a full (compacting) gc every night. Since real-time > user-time I assume it might be a synchronization problem. Can this be true? Do you have any Ideas how I can speed up this gcs? Please let me know, if you need more informations. Thank you, Flo ##### java -version ##### java version "1.6.0_29" Java(TM) SE Runtime Environment (build 1.6.0_29-b11) Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) ##### The startup parameters: ##### -Xms28G -Xmx28G -XX:+UseConcMarkSweepGC \ -XX:CMSMaxAbortablePrecleanTime=10000 \ -XX:SurvivorRatio=8 \ -XX:TargetSurvivorRatio=90 \ -XX:MaxTenuringThreshold=31 \ -XX:CMSInitiatingOccupancyFraction=80 \ -XX:NewSize=256M \ -verbose:gc \ -XX:+PrintFlagsFinal \ -XX:PrintFLSStatistics=1 \ -XX:+PrintGCDetails \ -XX:+PrintGCDateStamps \ -XX:-TraceClassUnloading \ -XX:+PrintGCApplicationConcurrentTime \ -XX:+PrintGCApplicationStoppedTime \ -XX:+PrintTenuringDistribution \ -XX:+CMSClassUnloadingEnabled \ -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ -Djava.awt.headless=true ##### From the out-file (as of +PrintFlagsFinal): ##### ParallelGCThreads = 13 ##### The gc.log-excerpt: ##### Application time: 20,0617700 seconds 2011-12-22T12:02:03.289+0100: [GC Before GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 1183290963 Max Chunk Size: 1183290963 Number of Blocks: 1 Av. Block Size: 1183290963 Tree Height: 1 Before GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 0 Max Chunk Size: 0 Number of Blocks: 0 Tree Height: 0 [ParNew Desired survivor size 25480392 bytes, new threshold 1 (max 31) - age 1: 28260160 bytes, 28260160 total : 249216K->27648K(249216K), 6,1808130 secs] 20061765K->20056210K(29332480K)After GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 1155785202 Max Chunk Size: 1155785202 Number of Blocks: 1 Av. Block Size: 1155785202 Tree Height: 1 After GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 0 Max Chunk Size: 0 Number of Blocks: 0 Tree Height: 0 , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] Total time for which application threads were stopped: 6,1818730 seconds From ysr1729 at gmail.com Mon Jan 9 10:40:43 2012 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Mon, 9 Jan 2012 10:40:43 -0800 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: <4F0ACAAC.8020103@java4.info> References: <4F0ACAAC.8020103@java4.info> Message-ID: Haven't looked at any logs, but setting MaxTenuringThreshold to 31 can be bad. I'd dial that down to 8, or leave it at the default of 15. (Your GC logs which must presumably include the tenuring distribution should inform you as to a more optimal size to use. As Kirk noted, premature promotion can be bad, and so can survivor space overflow, which can lead to premature promotion and exacerbate fragmentation.) -- ramki On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder wrote: > Hi everybody, > > I am using CMS (with ParNew) gc and have very long (> 6 seconds) young > gc pauses. > As you can see in the log below the old-gen-heap consists of one large > block, the new Size has 256m, it uses 13 worker threads and it has to > copy 27505761 words (~210mb) directly from eden to old gen. > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > Do you have any Ideas how I can speed up this gcs? > > Please let me know, if you need more informations. > > Thank you, > Flo > > > ##### java -version ##### > java version "1.6.0_29" > Java(TM) SE Runtime Environment (build 1.6.0_29-b11) > Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) > > ##### The startup parameters: ##### > -Xms28G -Xmx28G > -XX:+UseConcMarkSweepGC \ > -XX:CMSMaxAbortablePrecleanTime=10000 \ > -XX:SurvivorRatio=8 \ > -XX:TargetSurvivorRatio=90 \ > -XX:MaxTenuringThreshold=31 \ > -XX:CMSInitiatingOccupancyFraction=80 \ > -XX:NewSize=256M \ > > -verbose:gc \ > -XX:+PrintFlagsFinal \ > -XX:PrintFLSStatistics=1 \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:-TraceClassUnloading \ > -XX:+PrintGCApplicationConcurrentTime \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+PrintTenuringDistribution \ > -XX:+CMSClassUnloadingEnabled \ > -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ > -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ > > -Djava.awt.headless=true > > ##### From the out-file (as of +PrintFlagsFinal): ##### > ParallelGCThreads = 13 > > ##### The gc.log-excerpt: ##### > Application time: 20,0617700 seconds > 2011-12-22T12:02:03.289+0100: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1183290963 > Max Chunk Size: 1183290963 > Number of Blocks: 1 > Av. Block Size: 1183290963 > Tree Height: 1 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > [ParNew > Desired survivor size 25480392 bytes, new threshold 1 (max 31) > - age 1: 28260160 bytes, 28260160 total > : 249216K->27648K(249216K), 6,1808130 secs] > 20061765K->20056210K(29332480K)After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1155785202 > Max Chunk Size: 1155785202 > Number of Blocks: 1 > Av. Block Size: 1155785202 > Tree Height: 1 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] > Total time for which application threads were stopped: 6,1818730 seconds > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/17f1facd/attachment.html From java at java4.info Mon Jan 9 11:18:13 2012 From: java at java4.info (Florian Binder) Date: Mon, 09 Jan 2012 20:18:13 +0100 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: References: <4F0ACAAC.8020103@java4.info> Message-ID: <4F0B3D75.9060602@java4.info> Hi Ramki, Yes, I am agreed with you. 31 is too large and I have removed the parameter (using default now). Nevertheless this is not the problem as the max used age was always 1. Since the most (more than 90%) new allocated objects in our application live for a long time (>1h) we mostly will have premature promotion. Is there a way to optimize this? I have seen most time, when young gc needs much time (> 6 secs) there is only one large block in the old gen. If there has been a cms-old-gen-collection and there are more than one blocks in the old generation it is mostly (not always) much faster (needs less than 200ms). Is it possible that premature promotion can not be done parallel if there is only one large block in the old gen? In the past we have had a problem with fragmentation on this server but this is gone since we increased memory for it and triggered a full gc (compacting) every night, like Tony advised us. With setting the initiating occupancy fraction to 80% we have only a few (~10) old generation collections (which are very fast) and the heap fragmentation is low. Flo Am 09.01.2012 19:40, schrieb Srinivas Ramakrishna: > Haven't looked at any logs, but setting MaxTenuringThreshold to 31 can > be bad. I'd dial that down to 8, > or leave it at the default of 15. (Your GC logs which must presumably > include the tenuring distribution should > inform you as to a more optimal size to use. As Kirk noted, premature > promotion can be bad, and so can > survivor space overflow, which can lead to premature promotion and > exacerbate fragmentation.) > > -- ramki > > On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder > wrote: > > Hi everybody, > > I am using CMS (with ParNew) gc and have very long (> 6 seconds) young > gc pauses. > As you can see in the log below the old-gen-heap consists of one large > block, the new Size has 256m, it uses 13 worker threads and it has to > copy 27505761 words (~210mb) directly from eden to old gen. > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > Do you have any Ideas how I can speed up this gcs? > > Please let me know, if you need more informations. > > Thank you, > Flo > > > ##### java -version ##### > java version "1.6.0_29" > Java(TM) SE Runtime Environment (build 1.6.0_29-b11) > Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) > > ##### The startup parameters: ##### > -Xms28G -Xmx28G > -XX:+UseConcMarkSweepGC \ > -XX:CMSMaxAbortablePrecleanTime=10000 \ > -XX:SurvivorRatio=8 \ > -XX:TargetSurvivorRatio=90 \ > -XX:MaxTenuringThreshold=31 \ > -XX:CMSInitiatingOccupancyFraction=80 \ > -XX:NewSize=256M \ > > -verbose:gc \ > -XX:+PrintFlagsFinal \ > -XX:PrintFLSStatistics=1 \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:-TraceClassUnloading \ > -XX:+PrintGCApplicationConcurrentTime \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+PrintTenuringDistribution \ > -XX:+CMSClassUnloadingEnabled \ > -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ > -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ > > -Djava.awt.headless=true > > ##### From the out-file (as of +PrintFlagsFinal): ##### > ParallelGCThreads = 13 > > ##### The gc.log-excerpt: ##### > Application time: 20,0617700 seconds > 2011-12-22T12:02:03.289+0100: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1183290963 > Max Chunk Size: 1183290963 > Number of Blocks: 1 > Av. Block Size: 1183290963 > Tree Height: 1 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > [ParNew > Desired survivor size 25480392 bytes, new threshold 1 (max 31) > - age 1: 28260160 bytes, 28260160 total > : 249216K->27648K(249216K), 6,1808130 secs] > 20061765K->20056210K(29332480K)After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1155785202 > Max Chunk Size: 1155785202 > Number of Blocks: 1 > Av. Block Size: 1155785202 > Tree Height: 1 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] > Total time for which application threads were stopped: 6,1818730 > seconds > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/a2997a2e/attachment.html From jon.masamitsu at oracle.com Mon Jan 9 11:24:05 2012 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Mon, 09 Jan 2012 11:24:05 -0800 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: <4F0ACAAC.8020103@java4.info> References: <4F0ACAAC.8020103@java4.info> Message-ID: <4F0B3ED5.6010802@oracle.com> Florian, Have you even turned on PrintReferenceGC to see if you are spending a significant amount of time doing Reference processing? If you do see significant Reference processing times , you can try turning on ParallelRefProcEnabled. Jon On 01/09/12 03:08, Florian Binder wrote: > Hi everybody, > > I am using CMS (with ParNew) gc and have very long (> 6 seconds) young > gc pauses. > As you can see in the log below the old-gen-heap consists of one large > block, the new Size has 256m, it uses 13 worker threads and it has to > copy 27505761 words (~210mb) directly from eden to old gen. > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time> user-time I assume it might be a synchronization > problem. Can this be true? > > Do you have any Ideas how I can speed up this gcs? > > Please let me know, if you need more informations. > > Thank you, > Flo > > > ##### java -version ##### > java version "1.6.0_29" > Java(TM) SE Runtime Environment (build 1.6.0_29-b11) > Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) > > ##### The startup parameters: ##### > -Xms28G -Xmx28G > -XX:+UseConcMarkSweepGC \ > -XX:CMSMaxAbortablePrecleanTime=10000 \ > -XX:SurvivorRatio=8 \ > -XX:TargetSurvivorRatio=90 \ > -XX:MaxTenuringThreshold=31 \ > -XX:CMSInitiatingOccupancyFraction=80 \ > -XX:NewSize=256M \ > > -verbose:gc \ > -XX:+PrintFlagsFinal \ > -XX:PrintFLSStatistics=1 \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:-TraceClassUnloading \ > -XX:+PrintGCApplicationConcurrentTime \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+PrintTenuringDistribution \ > -XX:+CMSClassUnloadingEnabled \ > -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ > -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ > > -Djava.awt.headless=true > > ##### From the out-file (as of +PrintFlagsFinal): ##### > ParallelGCThreads = 13 > > ##### The gc.log-excerpt: ##### > Application time: 20,0617700 seconds > 2011-12-22T12:02:03.289+0100: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1183290963 > Max Chunk Size: 1183290963 > Number of Blocks: 1 > Av. Block Size: 1183290963 > Tree Height: 1 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > [ParNew > Desired survivor size 25480392 bytes, new threshold 1 (max 31) > - age 1: 28260160 bytes, 28260160 total > : 249216K->27648K(249216K), 6,1808130 secs] > 20061765K->20056210K(29332480K)After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1155785202 > Max Chunk Size: 1155785202 > Number of Blocks: 1 > Av. Block Size: 1155785202 > Tree Height: 1 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] > Total time for which application threads were stopped: 6,1818730 seconds > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From kirk at kodewerk.com Mon Jan 9 11:06:26 2012 From: kirk at kodewerk.com (Kirk Pepperdine) Date: Mon, 9 Jan 2012 20:06:26 +0100 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: References: <4F0ACAAC.8020103@java4.info> Message-ID: Hi Ramki, AFAICT given the limited GC log, the calculated tenuring threshold is always 1 which mean's he always flooding survivor spaces (i.e. suffering from premature promotion). My guess is that the tuning strategy assumes cost of long lived objects dominates and so heap is configured to minimize (survivor) copy costs. But it would appear that this strategy has backfired. Look at young gen size and if you do the maths you can see that there is no chance of there not being premature promotion. WIth the 80% initiating occupancy fraction.. well, that can't lead to anything good either. WIth the VM so misconfigured it's difficult to estimate true live set size which could then be used to calculate more reasonable pool sizes. So, with all the promtion going on, I suspect that fragmentation is making it difficult to reallocate the object in tenuring... hence long pause time. Would you say with these large data strictures that it might be difficult for the CMS to parallelize the scan for roots? The abortable pre-clean aborts on time which means that it's not able to clear out much and given the apparent life-cycle, is it worth running this phase? In fact, would you not guess that the parallel collector do better in this scenario? -- Kirk ps. I'm always happy beat you to the punch.. 'cos it's very difficult to do. ;-) On 2012-01-09, at 7:40 PM, Srinivas Ramakrishna wrote: > Haven't looked at any logs, but setting MaxTenuringThreshold to 31 can be bad. I'd dial that down to 8, > or leave it at the default of 15. (Your GC logs which must presumably include the tenuring distribution should > inform you as to a more optimal size to use. As Kirk noted, premature promotion can be bad, and so can > survivor space overflow, which can lead to premature promotion and exacerbate fragmentation.) > > -- ramki > > On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder wrote: > Hi everybody, > > I am using CMS (with ParNew) gc and have very long (> 6 seconds) young > gc pauses. > As you can see in the log below the old-gen-heap consists of one large > block, the new Size has 256m, it uses 13 worker threads and it has to > copy 27505761 words (~210mb) directly from eden to old gen. > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > Do you have any Ideas how I can speed up this gcs? > > Please let me know, if you need more informations. > > Thank you, > Flo > > > ##### java -version ##### > java version "1.6.0_29" > Java(TM) SE Runtime Environment (build 1.6.0_29-b11) > Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) > > ##### The startup parameters: ##### > -Xms28G -Xmx28G > -XX:+UseConcMarkSweepGC \ > -XX:CMSMaxAbortablePrecleanTime=10000 \ > -XX:SurvivorRatio=8 \ > -XX:TargetSurvivorRatio=90 \ > -XX:MaxTenuringThreshold=31 \ > -XX:CMSInitiatingOccupancyFraction=80 \ > -XX:NewSize=256M \ > > -verbose:gc \ > -XX:+PrintFlagsFinal \ > -XX:PrintFLSStatistics=1 \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:-TraceClassUnloading \ > -XX:+PrintGCApplicationConcurrentTime \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+PrintTenuringDistribution \ > -XX:+CMSClassUnloadingEnabled \ > -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ > -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ > > -Djava.awt.headless=true > > ##### From the out-file (as of +PrintFlagsFinal): ##### > ParallelGCThreads = 13 > > ##### The gc.log-excerpt: ##### > Application time: 20,0617700 seconds > 2011-12-22T12:02:03.289+0100: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1183290963 > Max Chunk Size: 1183290963 > Number of Blocks: 1 > Av. Block Size: 1183290963 > Tree Height: 1 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > [ParNew > Desired survivor size 25480392 bytes, new threshold 1 (max 31) > - age 1: 28260160 bytes, 28260160 total > : 249216K->27648K(249216K), 6,1808130 secs] > 20061765K->20056210K(29332480K)After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1155785202 > Max Chunk Size: 1155785202 > Number of Blocks: 1 > Av. Block Size: 1155785202 > Tree Height: 1 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] > Total time for which application threads were stopped: 6,1818730 seconds > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/08d12ac9/attachment.html From chkwok at digibites.nl Mon Jan 9 11:33:52 2012 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Mon, 9 Jan 2012 20:33:52 +0100 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: <4F0ACAAC.8020103@java4.info> References: <4F0ACAAC.8020103@java4.info> Message-ID: Just making sure the obvious case is covered: is it just me or is 6s real > 3.5s user+sys with 13 threads just plain weird? That means there was 0.5 thread actually running on the average during that collection. Do a sar -B (requires package sysstat) and see if there were any major pagefaults (or indirectly via cacti and other monitoring tools via memory usage, load average etc, or even via cat /proc/vmstat and pgmajfault), I've seen those cause these kind of times during GC. Chi Ho Kwok On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder wrote: > Hi everybody, > > I am using CMS (with ParNew) gc and have very long (> 6 seconds) young > gc pauses. > As you can see in the log below the old-gen-heap consists of one large > block, the new Size has 256m, it uses 13 worker threads and it has to > copy 27505761 words (~210mb) directly from eden to old gen. > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > Do you have any Ideas how I can speed up this gcs? > > Please let me know, if you need more informations. > > Thank you, > Flo > > > ##### java -version ##### > java version "1.6.0_29" > Java(TM) SE Runtime Environment (build 1.6.0_29-b11) > Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) > > ##### The startup parameters: ##### > -Xms28G -Xmx28G > -XX:+UseConcMarkSweepGC \ > -XX:CMSMaxAbortablePrecleanTime=10000 \ > -XX:SurvivorRatio=8 \ > -XX:TargetSurvivorRatio=90 \ > -XX:MaxTenuringThreshold=31 \ > -XX:CMSInitiatingOccupancyFraction=80 \ > -XX:NewSize=256M \ > > -verbose:gc \ > -XX:+PrintFlagsFinal \ > -XX:PrintFLSStatistics=1 \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:-TraceClassUnloading \ > -XX:+PrintGCApplicationConcurrentTime \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+PrintTenuringDistribution \ > -XX:+CMSClassUnloadingEnabled \ > -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ > -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ > > -Djava.awt.headless=true > > ##### From the out-file (as of +PrintFlagsFinal): ##### > ParallelGCThreads = 13 > > ##### The gc.log-excerpt: ##### > Application time: 20,0617700 seconds > 2011-12-22T12:02:03.289+0100: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1183290963 > Max Chunk Size: 1183290963 > Number of Blocks: 1 > Av. Block Size: 1183290963 > Tree Height: 1 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > [ParNew > Desired survivor size 25480392 bytes, new threshold 1 (max 31) > - age 1: 28260160 bytes, 28260160 total > : 249216K->27648K(249216K), 6,1808130 secs] > 20061765K->20056210K(29332480K)After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1155785202 > Max Chunk Size: 1155785202 > Number of Blocks: 1 > Av. Block Size: 1155785202 > Tree Height: 1 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] > Total time for which application threads were stopped: 6,1818730 seconds > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/ffc7400e/attachment.html From java at java4.info Mon Jan 9 11:47:32 2012 From: java at java4.info (Florian Binder) Date: Mon, 09 Jan 2012 20:47:32 +0100 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: References: <4F0ACAAC.8020103@java4.info> Message-ID: <4F0B4454.2010206@java4.info> Yes! You are right! I have a lot of page faults when gc is taking so much time. For example (sar -B): 00:00:01 pgpgin/s pgpgout/s fault/s majflt/s 00:50:01 0,01 45,18 162,29 0,00 01:00:01 0,02 46,58 170,45 0,00 01:10:02 25313,71 27030,39 27464,37 0,02 01:20:02 23456,85 25371,28 13621,92 0,01 01:30:01 22778,76 22918,60 10136,71 0,03 01:40:11 19020,44 22723,65 8617,42 0,15 01:50:01 5,52 44,22 147,26 0,05 What is this meaning and how can I avoid it? Flo Am 09.01.2012 20:33, schrieb Chi Ho Kwok: > Just making sure the obvious case is covered: is it just me or is 6s > real > 3.5s user+sys with 13 threads just plain weird? That means > there was 0.5 thread actually running on the average during that > collection. > > Do a sar -B (requires package sysstat) and see if there were any major > pagefaults (or indirectly via cacti and other monitoring tools via > memory usage, load average etc, or even via cat /proc/vmstat and > pgmajfault), I've seen those cause these kind of times during GC. > > > Chi Ho Kwok > > On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder > wrote: > > Hi everybody, > > I am using CMS (with ParNew) gc and have very long (> 6 seconds) young > gc pauses. > As you can see in the log below the old-gen-heap consists of one large > block, the new Size has 256m, it uses 13 worker threads and it has to > copy 27505761 words (~210mb) directly from eden to old gen. > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > Do you have any Ideas how I can speed up this gcs? > > Please let me know, if you need more informations. > > Thank you, > Flo > > > ##### java -version ##### > java version "1.6.0_29" > Java(TM) SE Runtime Environment (build 1.6.0_29-b11) > Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) > > ##### The startup parameters: ##### > -Xms28G -Xmx28G > -XX:+UseConcMarkSweepGC \ > -XX:CMSMaxAbortablePrecleanTime=10000 \ > -XX:SurvivorRatio=8 \ > -XX:TargetSurvivorRatio=90 \ > -XX:MaxTenuringThreshold=31 \ > -XX:CMSInitiatingOccupancyFraction=80 \ > -XX:NewSize=256M \ > > -verbose:gc \ > -XX:+PrintFlagsFinal \ > -XX:PrintFLSStatistics=1 \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:-TraceClassUnloading \ > -XX:+PrintGCApplicationConcurrentTime \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+PrintTenuringDistribution \ > -XX:+CMSClassUnloadingEnabled \ > -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ > -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ > > -Djava.awt.headless=true > > ##### From the out-file (as of +PrintFlagsFinal): ##### > ParallelGCThreads = 13 > > ##### The gc.log-excerpt: ##### > Application time: 20,0617700 seconds > 2011-12-22T12:02:03.289+0100: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1183290963 > Max Chunk Size: 1183290963 > Number of Blocks: 1 > Av. Block Size: 1183290963 > Tree Height: 1 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > [ParNew > Desired survivor size 25480392 bytes, new threshold 1 (max 31) > - age 1: 28260160 bytes, 28260160 total > : 249216K->27648K(249216K), 6,1808130 secs] > 20061765K->20056210K(29332480K)After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 1155785202 > Max Chunk Size: 1155785202 > Number of Blocks: 1 > Av. Block Size: 1155785202 > Tree Height: 1 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] > Total time for which application threads were stopped: 6,1818730 > seconds > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120109/4bdedec2/attachment-0001.html From chkwok at digibites.nl Mon Jan 9 21:21:48 2012 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Tue, 10 Jan 2012 06:21:48 +0100 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: <4F0B4454.2010206@java4.info> References: <4F0ACAAC.8020103@java4.info> <4F0B4454.2010206@java4.info> Message-ID: Hi Florian, Uh, you might want to try sar -r as well, that reports memory usage (and man sar for other reporting options, and -f /var/log/sysstat/saXX where xx is the day for older data is useful as well). Page in / out means reading or writing to the swap file, usual cause here is one or more huge background task / cron jobs taking up too much memory forcing other things to swap out. You can try reducing the size of the heap and see if it helps if you're just a little bit short, but otherwise I don't think you can solve this with just VM options. Here's the relevant section from the manual: -B Report paging statistics. Some of the metrics below are > available only with post 2.5 kernels. The following values are displayed: > pgpgin/s > Total number of kilobytes the system paged in from > disk per second. Note: With old kernels (2.2.x) this value is a number of > blocks > per second (and not kilobytes). > pgpgout/s > Total number of kilobytes the system paged out to > disk per second. Note: With old kernels (2.2.x) this value is a number of > blocks > per second (and not kilobytes). > fault/s > Number of page faults (major + minor) made by the > system per second. This is not a count of page faults that generate I/O, > because > some page faults can be resolved without I/O. > majflt/s > Number of major faults the system has made per > second, those which have required loading a memory page from disk. I'm not sure what kernel you're on, but pgpgin / out being high is a bad thing. Sar seems to report that all faults are minor, but that conflicts with the first two columns. Chi Ho Kwok On Mon, Jan 9, 2012 at 8:47 PM, Florian Binder wrote: > Yes! > You are right! > I have a lot of page faults when gc is taking so much time. > > For example (sar -B): > 00:00:01 pgpgin/s pgpgout/s fault/s majflt/s > 00:50:01 0,01 45,18 162,29 0,00 > 01:00:01 0,02 46,58 170,45 0,00 > 01:10:02 25313,71 27030,39 27464,37 0,02 > 01:20:02 23456,85 25371,28 13621,92 0,01 > 01:30:01 22778,76 22918,60 10136,71 0,03 > 01:40:11 19020,44 22723,65 8617,42 0,15 > 01:50:01 5,52 44,22 147,26 0,05 > > What is this meaning and how can I avoid it? > > > Flo > > > > Am 09.01.2012 20:33, schrieb Chi Ho Kwok: > > Just making sure the obvious case is covered: is it just me or is 6s real > > 3.5s user+sys with 13 threads just plain weird? That means there was 0.5 > thread actually running on the average during that collection. > > Do a sar -B (requires package sysstat) and see if there were any major > pagefaults (or indirectly via cacti and other monitoring tools via memory > usage, load average etc, or even via cat /proc/vmstat and pgmajfault), I've > seen those cause these kind of times during GC. > > > Chi Ho Kwok > > On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder wrote: > >> Hi everybody, >> >> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young >> gc pauses. >> As you can see in the log below the old-gen-heap consists of one large >> block, the new Size has 256m, it uses 13 worker threads and it has to >> copy 27505761 words (~210mb) directly from eden to old gen. >> I have seen that this problem occurs only after about one week of >> uptime. Even thought we make a full (compacting) gc every night. >> Since real-time > user-time I assume it might be a synchronization >> problem. Can this be true? >> >> Do you have any Ideas how I can speed up this gcs? >> >> Please let me know, if you need more informations. >> >> Thank you, >> Flo >> >> >> ##### java -version ##### >> java version "1.6.0_29" >> Java(TM) SE Runtime Environment (build 1.6.0_29-b11) >> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) >> >> ##### The startup parameters: ##### >> -Xms28G -Xmx28G >> -XX:+UseConcMarkSweepGC \ >> -XX:CMSMaxAbortablePrecleanTime=10000 \ >> -XX:SurvivorRatio=8 \ >> -XX:TargetSurvivorRatio=90 \ >> -XX:MaxTenuringThreshold=31 \ >> -XX:CMSInitiatingOccupancyFraction=80 \ >> -XX:NewSize=256M \ >> >> -verbose:gc \ >> -XX:+PrintFlagsFinal \ >> -XX:PrintFLSStatistics=1 \ >> -XX:+PrintGCDetails \ >> -XX:+PrintGCDateStamps \ >> -XX:-TraceClassUnloading \ >> -XX:+PrintGCApplicationConcurrentTime \ >> -XX:+PrintGCApplicationStoppedTime \ >> -XX:+PrintTenuringDistribution \ >> -XX:+CMSClassUnloadingEnabled \ >> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ >> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ >> >> -Djava.awt.headless=true >> >> ##### From the out-file (as of +PrintFlagsFinal): ##### >> ParallelGCThreads = 13 >> >> ##### The gc.log-excerpt: ##### >> Application time: 20,0617700 seconds >> 2011-12-22T12:02:03.289+0100: [GC Before GC: >> Statistics for BinaryTreeDictionary: >> ------------------------------------ >> Total Free Space: 1183290963 >> Max Chunk Size: 1183290963 >> Number of Blocks: 1 >> Av. Block Size: 1183290963 >> Tree Height: 1 >> Before GC: >> Statistics for BinaryTreeDictionary: >> ------------------------------------ >> Total Free Space: 0 >> Max Chunk Size: 0 >> Number of Blocks: 0 >> Tree Height: 0 >> [ParNew >> Desired survivor size 25480392 bytes, new threshold 1 (max 31) >> - age 1: 28260160 bytes, 28260160 total >> : 249216K->27648K(249216K), 6,1808130 secs] >> 20061765K->20056210K(29332480K)After GC: >> Statistics for BinaryTreeDictionary: >> ------------------------------------ >> Total Free Space: 1155785202 >> Max Chunk Size: 1155785202 >> Number of Blocks: 1 >> Av. Block Size: 1155785202 >> Tree Height: 1 >> After GC: >> Statistics for BinaryTreeDictionary: >> ------------------------------------ >> Total Free Space: 0 >> Max Chunk Size: 0 >> Number of Blocks: 0 >> Tree Height: 0 >> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] >> Total time for which application threads were stopped: 6,1818730 seconds >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/ea863255/attachment.html From vitalyd at gmail.com Mon Jan 9 21:43:36 2012 From: vitalyd at gmail.com (Vitaly Davidovich) Date: Tue, 10 Jan 2012 00:43:36 -0500 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: References: <4F0ACAAC.8020103@java4.info> <4F0B4454.2010206@java4.info> Message-ID: Apparently pgpgin/pgpgout may not be that accurate to determine swap file usage: http://help.lockergnome.com/linux/pgpgin-pgpgout-measure--ftopict506279.html May need to use vmstat and look at si/so instead. On Jan 10, 2012 12:24 AM, "Chi Ho Kwok" wrote: > Hi Florian, > > Uh, you might want to try sar -r as well, that reports memory usage (and > man sar for other reporting options, and -f /var/log/sysstat/saXX where xx > is the day for older data is useful as well). Page in / out means reading > or writing to the swap file, usual cause here is one or more huge > background task / cron jobs taking up too much memory forcing other things > to swap out. You can try reducing the size of the heap and see if it helps > if you're just a little bit short, but otherwise I don't think you can > solve this with just VM options. > > > Here's the relevant section from the manual: > > -B Report paging statistics. Some of the metrics below are >> available only with post 2.5 kernels. The following values are displayed: >> pgpgin/s >> Total number of kilobytes the system paged in from >> disk per second. Note: With old kernels (2.2.x) this value is a number of >> blocks >> per second (and not kilobytes). >> pgpgout/s >> Total number of kilobytes the system paged out to >> disk per second. Note: With old kernels (2.2.x) this value is a number of >> blocks >> per second (and not kilobytes). >> fault/s >> Number of page faults (major + minor) made by the >> system per second. This is not a count of page faults that generate I/O, >> because >> some page faults can be resolved without I/O. >> majflt/s >> Number of major faults the system has made per >> second, those which have required loading a memory page from disk. > > > I'm not sure what kernel you're on, but pgpgin / out being high is a bad > thing. Sar seems to report that all faults are minor, but that conflicts > with the first two columns. > > > Chi Ho Kwok > > On Mon, Jan 9, 2012 at 8:47 PM, Florian Binder wrote: > >> Yes! >> You are right! >> I have a lot of page faults when gc is taking so much time. >> >> For example (sar -B): >> 00:00:01 pgpgin/s pgpgout/s fault/s majflt/s >> 00:50:01 0,01 45,18 162,29 0,00 >> 01:00:01 0,02 46,58 170,45 0,00 >> 01:10:02 25313,71 27030,39 27464,37 0,02 >> 01:20:02 23456,85 25371,28 13621,92 0,01 >> 01:30:01 22778,76 22918,60 10136,71 0,03 >> 01:40:11 19020,44 22723,65 8617,42 0,15 >> 01:50:01 5,52 44,22 147,26 0,05 >> >> What is this meaning and how can I avoid it? >> >> >> Flo >> >> >> >> Am 09.01.2012 20:33, schrieb Chi Ho Kwok: >> >> Just making sure the obvious case is covered: is it just me or is 6s real >> > 3.5s user+sys with 13 threads just plain weird? That means there was 0.5 >> thread actually running on the average during that collection. >> >> Do a sar -B (requires package sysstat) and see if there were any major >> pagefaults (or indirectly via cacti and other monitoring tools via memory >> usage, load average etc, or even via cat /proc/vmstat and pgmajfault), I've >> seen those cause these kind of times during GC. >> >> >> Chi Ho Kwok >> >> On Mon, Jan 9, 2012 at 12:08 PM, Florian Binder wrote: >> >>> Hi everybody, >>> >>> I am using CMS (with ParNew) gc and have very long (> 6 seconds) young >>> gc pauses. >>> As you can see in the log below the old-gen-heap consists of one large >>> block, the new Size has 256m, it uses 13 worker threads and it has to >>> copy 27505761 words (~210mb) directly from eden to old gen. >>> I have seen that this problem occurs only after about one week of >>> uptime. Even thought we make a full (compacting) gc every night. >>> Since real-time > user-time I assume it might be a synchronization >>> problem. Can this be true? >>> >>> Do you have any Ideas how I can speed up this gcs? >>> >>> Please let me know, if you need more informations. >>> >>> Thank you, >>> Flo >>> >>> >>> ##### java -version ##### >>> java version "1.6.0_29" >>> Java(TM) SE Runtime Environment (build 1.6.0_29-b11) >>> Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) >>> >>> ##### The startup parameters: ##### >>> -Xms28G -Xmx28G >>> -XX:+UseConcMarkSweepGC \ >>> -XX:CMSMaxAbortablePrecleanTime=10000 \ >>> -XX:SurvivorRatio=8 \ >>> -XX:TargetSurvivorRatio=90 \ >>> -XX:MaxTenuringThreshold=31 \ >>> -XX:CMSInitiatingOccupancyFraction=80 \ >>> -XX:NewSize=256M \ >>> >>> -verbose:gc \ >>> -XX:+PrintFlagsFinal \ >>> -XX:PrintFLSStatistics=1 \ >>> -XX:+PrintGCDetails \ >>> -XX:+PrintGCDateStamps \ >>> -XX:-TraceClassUnloading \ >>> -XX:+PrintGCApplicationConcurrentTime \ >>> -XX:+PrintGCApplicationStoppedTime \ >>> -XX:+PrintTenuringDistribution \ >>> -XX:+CMSClassUnloadingEnabled \ >>> -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 \ >>> -Dsun.rmi.dgc.client.gcInterval=9223372036854775807 \ >>> >>> -Djava.awt.headless=true >>> >>> ##### From the out-file (as of +PrintFlagsFinal): ##### >>> ParallelGCThreads = 13 >>> >>> ##### The gc.log-excerpt: ##### >>> Application time: 20,0617700 seconds >>> 2011-12-22T12:02:03.289+0100: [GC Before GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 1183290963 >>> Max Chunk Size: 1183290963 >>> Number of Blocks: 1 >>> Av. Block Size: 1183290963 >>> Tree Height: 1 >>> Before GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 0 >>> Max Chunk Size: 0 >>> Number of Blocks: 0 >>> Tree Height: 0 >>> [ParNew >>> Desired survivor size 25480392 bytes, new threshold 1 (max 31) >>> - age 1: 28260160 bytes, 28260160 total >>> : 249216K->27648K(249216K), 6,1808130 secs] >>> 20061765K->20056210K(29332480K)After GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 1155785202 >>> Max Chunk Size: 1155785202 >>> Number of Blocks: 1 >>> Av. Block Size: 1155785202 >>> Tree Height: 1 >>> After GC: >>> Statistics for BinaryTreeDictionary: >>> ------------------------------------ >>> Total Free Space: 0 >>> Max Chunk Size: 0 >>> Number of Blocks: 0 >>> Tree Height: 0 >>> , 6,1809440 secs] [Times: user=3,08 sys=0,51, real=6,18 secs] >>> Total time for which application threads were stopped: 6,1818730 seconds >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> >> >> >> > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/7c061fcb/attachment-0001.html From fancyerii at gmail.com Tue Jan 10 01:31:07 2012 From: fancyerii at gmail.com (Li Li) Date: Tue, 10 Jan 2012 17:31:07 +0800 Subject: MaxTenuringThreshold available in ParNewGC? Message-ID: hi all I have an application that generating many large objects and then discard them. I found that full gc can free memory from 70% to 40%. I want to let this objects in young generation longer. I found -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. But I found a blog that says MaxTenuringThreshold is not used in ParNewGC. And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it seems no difference. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/1fd48a4c/attachment.html From fancyerii at gmail.com Tue Jan 10 01:49:10 2012 From: fancyerii at gmail.com (Li Li) Date: Tue, 10 Jan 2012 17:49:10 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: btw, is there any web page that list all the jvm parameters and their default values? I am confused that they are distributed in many documents and some are deprecated. On Tue, Jan 10, 2012 at 5:31 PM, Li Li wrote: > hi all > I have an application that generating many large objects and then > discard them. I found that full gc can free memory from 70% to 40%. > btw, is there any web page that list all JVM parameters and their default > values? > > I want to let this objects in young generation longer. I found > -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. > But I found a blog that says MaxTenuringThreshold is not used in > ParNewGC. > And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it > seems no difference. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/9e974a80/attachment.html From java at java4.info Tue Jan 10 02:23:26 2012 From: java at java4.info (Florian Binder) Date: Tue, 10 Jan 2012 11:23:26 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: <4F0C119E.7090600@java4.info> At http://cr.openjdk.java.net/~brutisso/7016112/webrev.02/src/share/vm/runtime/globals.hpp.html you have the source code with most jvm-parameters. I know, it is a webrev and not the newest file, but there are the most parameters with a short description ;-) An other way is to enable PrintFlagsFinal or PrintFlagsInitial or just run: java -XX:+PrintFlagsFinal Flo Am 10.01.2012 10:49, schrieb Li Li: > btw, is there any web page that list all the jvm parameters and their > default values? I am confused that they are distributed in many > documents and some are deprecated. > > On Tue, Jan 10, 2012 at 5:31 PM, Li Li > wrote: > > hi all > I have an application that generating many large objects and > then discard them. I found that full gc can free memory from 70% > to 40%. > btw, is there any web page that list all JVM parameters and their > default values? > > > I want to let this objects in young generation longer. I found > -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. > But I found a blog that says MaxTenuringThreshold is not used > in ParNewGC. > And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, > but it seems no difference. > > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/084c12d1/attachment.html From bengt.rutisson at oracle.com Tue Jan 10 05:17:18 2012 From: bengt.rutisson at oracle.com (Bengt Rutisson) Date: Tue, 10 Jan 2012 14:17:18 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: <4F0C119E.7090600@java4.info> References: <4F0C119E.7090600@java4.info> Message-ID: <4F0C3A5E.5010300@oracle.com> On 2012-01-10 11:23, Florian Binder wrote: > At > http://cr.openjdk.java.net/~brutisso/7016112/webrev.02/src/share/vm/runtime/globals.hpp.html This is actually a link to one of my webrevs. It could be removed any day. A more stable way of finding the source for globals.hpp is to look in the mercurial repository for OpenJDK: http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/file/97c00e21fecb/src/share/vm/runtime/globals.hpp Bengt > > you have the source code with most jvm-parameters. > I know, it is a webrev and not the newest file, but there are the most > parameters with a short description ;-) > > An other way is to enable PrintFlagsFinal or PrintFlagsInitial or just > run: > java -XX:+PrintFlagsFinal > > Flo > > > Am 10.01.2012 10:49, schrieb Li Li: >> btw, is there any web page that list all the jvm parameters and their >> default values? I am confused that they are distributed in many >> documents and some are deprecated. >> >> On Tue, Jan 10, 2012 at 5:31 PM, Li Li > > wrote: >> >> hi all >> I have an application that generating many large objects and >> then discard them. I found that full gc can free memory from 70% >> to 40%. >> btw, is there any web page that list all JVM parameters and their >> default values? >> >> >> I want to let this objects in young generation longer. I found >> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. >> But I found a blog that says MaxTenuringThreshold is not used >> in ParNewGC. >> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, >> but it seems no difference. >> >> >> >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/8957224e/attachment.html From charlie.hunt at oracle.com Tue Jan 10 05:11:02 2012 From: charlie.hunt at oracle.com (charlie hunt) Date: Tue, 10 Jan 2012 07:11:02 -0600 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: <4F0C119E.7090600@java4.info> References: <4F0C119E.7090600@java4.info> Message-ID: <4F0C38E6.80301@oracle.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/40491552/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5166 bytes Desc: S/MIME Cryptographic Signature Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/40491552/smime-0001.p7s From kinnari.darji at citi.com Tue Jan 10 08:24:01 2012 From: kinnari.darji at citi.com (Darji, Kinnari ) Date: Tue, 10 Jan 2012 11:24:01 -0500 Subject: ParNew garbage collection In-Reply-To: References: <21ED8E3420CDB647B88C7F80A7D64DAC06F04BC659@exnjmb89.nam.nsroot.net> <4F048FC6.30907@oracle.com> Message-ID: <21ED8E3420CDB647B88C7F80A7D64DAC06F0901CA6@exnjmb89.nam.nsroot.net> I am using jdk-1.6.0_16 version. Is there any known issue with this version? Also I have tried +PrintSafepointStatistics option in past but it prints out time which does not actual correlate to the time. It is hard to match it up the occurrence of event. I will try with TraceSafepointCleanup option. Thank you Kinnari From: hotspot-gc-use-bounces at openjdk.java.net [mailto:hotspot-gc-use-bounces at openjdk.java.net] On Behalf Of Srinivas Ramakrishna Sent: Wednesday, January 04, 2012 12:54 PM To: Jon Masamitsu Cc: hotspot-gc-use at openjdk.java.net Subject: Re: ParNew garbage collection May be also +PrintSafepointStatistics (and related parms) to drill down a bit further, although TraceSafepointCleanup would probably provide all of the info on a per-safepoint basis. There was an old issue wrt monitor deflation that was foixed a few releases ago, so Kinnari should check the version of the JVM she's running on as well.... (There are now a couple of flags related to monitor list handling policies i believe but i have no experience with them and do not have the code in front of me -- make sure to cc the runtime list if that turns out to be the issue again and you are already on a very recent version of the JVM.) -- ramki On Wed, Jan 4, 2012 at 9:43 AM, Jon Masamitsu > wrote: Try turning on TraceSafepointCleanupTime. I haven't used it myself. If that's not it, look in share/vm/runtime/globals.hpp for some other flag that traces safepoints. On 1/3/2012 1:36 PM, Darji, Kinnari wrote: Hello GC team, I have question regarding ParNew collection. As in logs below, the GC is taking only 0.04 sec but application was stopped for 1.71 sec. What could possibly cause this? Please advise. 2012-01-03T14:37:04.975-0500: 30982.368: [GC 30982.368: [ParNew Desired survivor size 19628032 bytes, new threshold 4 (max 4) - age 1: 4466024 bytes, 4466024 total - age 2: 3568136 bytes, 8034160 total - age 3: 3559808 bytes, 11593968 total - age 4: 1737520 bytes, 13331488 total : 330991K->18683K(345024K), 0.0357400 secs] 5205809K->4894299K(26176064K), 0.0366240 secs] [Times: user=0.47 sys=0.04, real=0.04 secs] Total time for which application threads were stopped: 1.7197830 seconds Application time: 8.4134780 seconds Thank you Kinnari _______________________________________________ hotspot-gc-use mailing list hotspot-gc-use at openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use _______________________________________________ hotspot-gc-use mailing list hotspot-gc-use at openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/1de08526/attachment.html From ysr1729 at gmail.com Tue Jan 10 09:23:53 2012 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Tue, 10 Jan 2012 09:23:53 -0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: I recommend Charlie's excellent book as well. To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold (henceforth MTT), but in order to allow objects to age you also need sufficiently large survivor spaces to hold them for however long you wish, otherwise the adaptive tenuring policy will adjust the "current" tenuring threshold so as to prevent overflow. That may be what you saw. Check out the info printed by +PrintTenuringThreshold. -- ramki On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: > hi all > I have an application that generating many large objects and then > discard them. I found that full gc can free memory from 70% to 40%. > I want to let this objects in young generation longer. I found > -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. > But I found a blog that says MaxTenuringThreshold is not used in > ParNewGC. > And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it > seems no difference. > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120110/1116311d/attachment.html From fancyerii at gmail.com Tue Jan 10 20:45:21 2012 From: fancyerii at gmail.com (Li Li) Date: Wed, 11 Jan 2012 12:45:21 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: if the young generation is too small that it can't afford space for survivors and it have to throw them to old generation. and jvm found this, it will turn down TenuringThreshold ? I set TenuringThreshold to 10. and found that the full gc is less frequent and every full gc collect less garbage. it seems the parameter have the effect. But I found the load average is up and young gc time is much more than before. And the response time is also increased. I guess that there are more objects in young generation. so it have to do more young gc. although they are garbage, it's not a good idea to collect them too early. because ParNewGC will stop the world, the response time is increasing. So I adjust TenuringThreshold to 3 and there are no remarkable difference. maybe I should use object pool for my application because it use many large temporary objects. Another question, when my application runs for about 1-2 days. I found the response time increases. I guess it's the problem of large young generation. in the beginning, the total memory usage is about 4-5GB and young generation is 100-200MB, the rest is old generation. After running for days, the total memory usage is 8GB and young generation is about 2GB(I set new Ration 1:3) I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation and ?XX:MaxHeapFreeRation the default value is 40 and 70. the memory manage white paper says if the total heap free space is less than 40%, it will increase heap. if the free space is larger than 70%, it will decrease heap size. But why I see the young generation is 200mb while old is 4gb. does the adjustment of young related to old generation? I read in http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young generation should be less than 512MB, is it correct? On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna wrote: > I recommend Charlie's excellent book as well. > > To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold > (henceforth MTT), > but in order to allow objects to age you also need sufficiently large > survivor spaces to hold > them for however long you wish, otherwise the adaptive tenuring policy > will adjust the > "current" tenuring threshold so as to prevent overflow. That may be what > you saw. > Check out the info printed by +PrintTenuringThreshold. > > -- ramki > > On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: > >> hi all >> I have an application that generating many large objects and then >> discard them. I found that full gc can free memory from 70% to 40%. >> I want to let this objects in young generation longer. I found >> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. >> But I found a blog that says MaxTenuringThreshold is not used in >> ParNewGC. >> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it >> seems no difference. >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/86b651b6/attachment.html From fancyerii at gmail.com Tue Jan 10 23:47:29 2012 From: fancyerii at gmail.com (Li Li) Date: Wed, 11 Jan 2012 15:47:29 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: 1. I don't understand why tenuring thresholds are calculated to be 1 2. I don't set Xms, I just set Xmx=8g 3. as for memory leak, I will try to find it. On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine wrote: > Hi Li LI, > > I fear that you are off in the wrong direction. Resetting tenuring > thresholds in this case will never work because they are being calculated > to be 1. You're suggesting numbers greater than 1 and so 1 will always be > used which explains why you're not seeing a difference between runs. Having > a calculated tenuring threshold set to 1 implies that the memory pool is > too small. If the a memory pool is too small the only thing you can do to > fix that is to make it bigger. In this case, your young generational space > (as I've indicated in previous postings) is too small. Also, the cost of a > young generational collection is dependent mostly upon the number of > surviving objects, not dead ones. Pooling temporary objects will only make > the problem worse. If I recall your flag settings, you've set netsize to a > fixed value. That setting will override the the new ratio setting. You also > set Xmx==Xms and that also override adaptive sizing. Also you are using CMS > which is inherently not size adaptable. > > Last point, and this is the biggest one. The numbers you're publishing > right now suggest that you have a memory leak. There is no way you're going > to stabilize the memory /gc behaviour with a memory leak. Things will get > progressively worse as you consume more and more heap. This is a blocking > issue to all tuning efforts. It is the first thing that must be dealt with. > > To find the leak; > Identify the leaking object useing VisualVM's memory profiler with > generational counts and collect allocation stack traces turned on. Sort the > profile by generational counts. When you've identified the leaking object, > the domain class with the highest and always increasing generational count. > take an allocation stack trace snapshot and a heap dump. The heap dump > should be loaded into a heap walker. Use the knowledge gained from > generational counts to inspect the linkages for the leaking object and then > use that information in the allocation stack traces to identify casual > execution paths for creation. After that, it's into application code to > determine the fix. > > Kind regards, > Kirk Pepperdine > > On 2012-01-11, at 5:45 AM, Li Li wrote: > > if the young generation is too small that it can't afford space for > survivors and it have to throw them to old generation. and jvm found this, > it will turn down TenuringThreshold ? > I set TenuringThreshold to 10. and found that the full gc is less > frequent and every full gc collect less garbage. it seems the parameter > have the effect. But I found the load average is up and young gc time is > much more than before. And the response time is also increased. > I guess that there are more objects in young generation. so it have to > do more young gc. although they are garbage, it's not a good idea to > collect them too early. because ParNewGC will stop the world, the response > time is increasing. > So I adjust TenuringThreshold to 3 and there are no remarkable > difference. > maybe I should use object pool for my application because it use many > large temporary objects. > Another question, when my application runs for about 1-2 days. I found > the response time increases. I guess it's the problem of large young > generation. > in the beginning, the total memory usage is about 4-5GB and young > generation is 100-200MB, the rest is old generation. > After running for days, the total memory usage is 8GB and young > generation is about 2GB(I set new Ration 1:3) > I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation > and ?XX:MaxHeapFreeRation > the default value is 40 and 70. the memory manage white paper says if > the total heap free space is less than 40%, it will increase heap. if the > free space is larger than 70%, it will decrease heap size. > But why I see the young generation is 200mb while old is 4gb. does the > adjustment of young related to old generation? > I read in > http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young > generation should be less than 512MB, is it correct? > > > > On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna wrote: > >> I recommend Charlie's excellent book as well. >> >> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold >> (henceforth MTT), >> but in order to allow objects to age you also need sufficiently large >> survivor spaces to hold >> them for however long you wish, otherwise the adaptive tenuring policy >> will adjust the >> "current" tenuring threshold so as to prevent overflow. That may be what >> you saw. >> Check out the info printed by +PrintTenuringThreshold. >> >> -- ramki >> >> On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: >> >>> hi all >>> I have an application that generating many large objects and then >>> discard them. I found that full gc can free memory from 70% to 40%. >>> I want to let this objects in young generation longer. I found >>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. >>> But I found a blog that says MaxTenuringThreshold is not used in >>> ParNewGC. >>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it >>> seems no difference. >>> >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> >>> >> > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/1d7218e8/attachment.html From fancyerii at gmail.com Wed Jan 11 00:06:48 2012 From: fancyerii at gmail.com (Li Li) Date: Wed, 11 Jan 2012 16:06:48 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: I understand the first one. as for Xmx, when it reach the maxium 8GB, the young generation is in deed 1.8G and Eden:s0:s1=8:1:1. That's correct. but when I restart it for a few minutes. old is 4GB while young is 200-300MB I don't think there is memory leak because it has running for more than a month without OOV. My application is using lucene+solr to provide search service which need large memory. On Wed, Jan 11, 2012 at 3:55 PM, Kirk Pepperdine wrote: > > On 2012-01-11, at 8:47 AM, Li Li wrote: > > 1. I don't understand why tenuring thresholds are > calculated to be 1 > > > because the number of expected survivors exceeds the size of the survivor > space > > 2. I don't set Xms, I just set Xmx=8g > > > with a new ratio of 3.. you should have 2 gigs of young gen meaning a .2 > gigs for each survivor space and 1.6 for young gen. Do you have a GC log > you can use to confirm these values? If not try visualvm and this plugin > should give you a clear view (www.java.net/projects/memorypoolview). > > > 3. as for memory leak, I will try to find it. > > On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine wrote: > >> Hi Li LI, >> >> I fear that you are off in the wrong direction. Resetting tenuring >> thresholds in this case will never work because they are being calculated >> to be 1. You're suggesting numbers greater than 1 and so 1 will always be >> used which explains why you're not seeing a difference between runs. Having >> a calculated tenuring threshold set to 1 implies that the memory pool is >> too small. If the a memory pool is too small the only thing you can do to >> fix that is to make it bigger. In this case, your young generational space >> (as I've indicated in previous postings) is too small. Also, the cost of a >> young generational collection is dependent mostly upon the number of >> surviving objects, not dead ones. Pooling temporary objects will only make >> the problem worse. If I recall your flag settings, you've set netsize to a >> fixed value. That setting will override the the new ratio setting. You also >> set Xmx==Xms and that also override adaptive sizing. Also you are using CMS >> which is inherently not size adaptable. >> >> Last point, and this is the biggest one. The numbers you're publishing >> right now suggest that you have a memory leak. There is no way you're going >> to stabilize the memory /gc behaviour with a memory leak. Things will get >> progressively worse as you consume more and more heap. This is a blocking >> issue to all tuning efforts. It is the first thing that must be dealt with. >> >> To find the leak; >> Identify the leaking object useing VisualVM's memory profiler with >> generational counts and collect allocation stack traces turned on. Sort the >> profile by generational counts. When you've identified the leaking object, >> the domain class with the highest and always increasing generational count. >> take an allocation stack trace snapshot and a heap dump. The heap dump >> should be loaded into a heap walker. Use the knowledge gained from >> generational counts to inspect the linkages for the leaking object and then >> use that information in the allocation stack traces to identify casual >> execution paths for creation. After that, it's into application code to >> determine the fix. >> >> Kind regards, >> Kirk Pepperdine >> >> On 2012-01-11, at 5:45 AM, Li Li wrote: >> >> if the young generation is too small that it can't afford space for >> survivors and it have to throw them to old generation. and jvm found this, >> it will turn down TenuringThreshold ? >> I set TenuringThreshold to 10. and found that the full gc is less >> frequent and every full gc collect less garbage. it seems the parameter >> have the effect. But I found the load average is up and young gc time is >> much more than before. And the response time is also increased. >> I guess that there are more objects in young generation. so it have to >> do more young gc. although they are garbage, it's not a good idea to >> collect them too early. because ParNewGC will stop the world, the response >> time is increasing. >> So I adjust TenuringThreshold to 3 and there are no remarkable >> difference. >> maybe I should use object pool for my application because it use many >> large temporary objects. >> Another question, when my application runs for about 1-2 days. I found >> the response time increases. I guess it's the problem of large young >> generation. >> in the beginning, the total memory usage is about 4-5GB and young >> generation is 100-200MB, the rest is old generation. >> After running for days, the total memory usage is 8GB and young >> generation is about 2GB(I set new Ration 1:3) >> I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation >> and ?XX:MaxHeapFreeRation >> the default value is 40 and 70. the memory manage white paper says if >> the total heap free space is less than 40%, it will increase heap. if the >> free space is larger than 70%, it will decrease heap size. >> But why I see the young generation is 200mb while old is 4gb. does the >> adjustment of young related to old generation? >> I read in >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young >> generation should be less than 512MB, is it correct? >> >> >> >> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna wrote: >> >>> I recommend Charlie's excellent book as well. >>> >>> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold >>> (henceforth MTT), >>> but in order to allow objects to age you also need sufficiently large >>> survivor spaces to hold >>> them for however long you wish, otherwise the adaptive tenuring policy >>> will adjust the >>> "current" tenuring threshold so as to prevent overflow. That may be what >>> you saw. >>> Check out the info printed by +PrintTenuringThreshold. >>> >>> -- ramki >>> >>> On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: >>> >>>> hi all >>>> I have an application that generating many large objects and then >>>> discard them. I found that full gc can free memory from 70% to 40%. >>>> I want to let this objects in young generation longer. I found >>>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. >>>> But I found a blog that says MaxTenuringThreshold is not used in >>>> ParNewGC. >>>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it >>>> seems no difference. >>>> >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>> >>>> >>> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/243309d6/attachment-0001.html From ysr1729 at gmail.com Wed Jan 11 01:00:59 2012 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Wed, 11 Jan 2012 01:00:59 -0800 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: <4F0ACAAC.8020103@java4.info> References: <4F0ACAAC.8020103@java4.info> Message-ID: On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder wrote: > ... > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > Together with your and Chi-Ho's conclusion that this is possibly related to paging, a question to ponder is why this happens only after a week. Since your process' heap size is presumably fixed and you have seen multiple full GC's (from which i assume that your heap's pages have all been touched), have you checked to see if the size of either this process (i.e. its native size) or of another process on the machine has grown during the week so that you start swapping? I also find it interesting that you state that whenever you see this problem there's always a single block in the old gen, and that the problem seems to go away when there are more than one block in the old gen. That would seem to throw out the paging theory, and point the finger of suspicion to some kind of bottleneck in the allocation out of a large block. You also state that you do a compacting collection every night, but the bad behaviour sets in only after a week. So let me ask you if you see that the slow scavenge happens to be the first scavenge after a full gc, or does the condition persist for a long time and is independent if whether a full gc has happened recently? Try turning on -XX:+PrintOldPLAB to see if it sheds any light... -- ramki -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/0d137518/attachment.html From fancyerii at gmail.com Wed Jan 11 01:24:02 2012 From: fancyerii at gmail.com (Li Li) Date: Wed, 11 Jan 2012 17:24:02 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: the log is too large to post here I just post some lines here. I grep the lines that gc time is larger than 100ms. the question is: at the beginning, young generation is about 50M. but after running a while, the memory is growing to 1.8GB. 1.75GB is Eden and 0.2G is s0 and s1. e.g. [GC [ParNew: 1843200K->204800K(1843200K), 0.2584570 secs] it is clear that the eden is 1843200K(1.75G), s0 is 0.2G. before young gc, eden are all used. after gc, s1 is all used(other live object are moved to old generation) 2012-01-10T18:26:45.992+0800: [GC [ParNew: 58732K->6528K(59072K), 0.1234300 secs] 1391982K->1375194K(1707564K), 0.1234900 secs] [Times: user=1.44 sys=0.02, real=0.12 secs] 2012-01-10T18:26:47.185+0800: [GC [ParNew: 59072K->6528K(59072K), 0.1335480 secs] 1507767K->1490151K(2340184K), 0.1336020 secs] [Times: user=1.60 sys=0.01, real=0.13 secs] 2012-01-10T18:26:56.605+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0992650 secs] 1523647K->1509678K(2522312K), 0.0993220 secs] [Times: user=1.22 sys=0.01, real=0.10 secs] 2012-01-10T18:26:57.395+0800: [GC [ParNew: 52998K->6528K(59072K), 0.1948650 secs] 1556149K->1544918K(2522312K), 0.1949120 secs] [Times: user=2.46 sys=0.01, real=0.19 secs] 2012-01-10T18:27:05.072+0800: [GC [ParNew: 38463K->6528K(59072K), 0.1571700 secs] 2449032K->2447103K(2864820K), 0.1572150 secs] [Times: user=1.98 sys=0.02, real=0.16 secs] 2012-01-10T18:27:06.220+0800: [GC [ParNew: 59072K->6528K(59072K), 0.1641610 secs] 2499647K->2483866K(2864820K), 0.1642060 secs] [Times: user=2.07 sys=0.01, real=0.17 secs] 2012-01-10T22:24:08.939+0800: [GC [ParNew: 1826901K->204800K(1843200K), 0.1418510 secs] 3923985K->2352398K(7987200K), 0.1420700 secs] [Times: user=1.59 sys=0.05, real=0.14 secs] 2012-01-10T22:24:09.343+0800: [GC [ParNew: 1843200K->175652K(1843200K), 0.1994980 secs] 3990798K->2536312K(7987200K), 0.1996880 secs] [Times: user=1.98 sys=0.02, real=0.20 secs] 2012-01-10T22:24:10.049+0800: [GC [ParNew: 1814052K->151709K(1843200K), 0.1409050 secs] 4174712K->2618929K(7987200K), 0.1410940 secs] [Times: user=1.51 sys=0.00, real=0.14 secs] 2012-01-10T22:24:11.015+0800: [GC [ParNew: 1843200K->204800K(1843200K), 0.2584570 secs] 4311783K->2831783K(7987200K), 0.2586440 secs] [Times: user=2.83 sys=0.00, real=0.26 secs] 2012-01-10T22:24:11.543+0800: [GC [ParNew: 1843200K->188261K(1843200K), 0.2356920 secs] 4470183K->3028255K(7987200K), 0.2358800 secs] [Times: user=2.41 sys=0.01, real=0.24 secs] On Wed, Jan 11, 2012 at 4:24 PM, Kirk Pepperdine wrote: > > On 2012-01-11, at 9:06 AM, Li Li wrote: > > I understand the first one. > as for Xmx, when it reach the maxium 8GB, the young generation is in deed > 1.8G and Eden:s0:s1=8:1:1. That's correct. > but when I restart it for a few minutes. old is 4GB while young is > 200-300MB > > > Right, ratios are adaptive and if you're using CMS, will require a full GC > to occur before they will adapt. Size will start off small and then get > bigger as needed. > > I don't think there is memory leak because it has running for more than a > month without OOV. > My application is using lucene+solr to provide search service which need > large memory. > > > Well, if memory use stabilizes than you don't have a leak. But I'd need to > see a GC log to give you better advice. All I can say is that the more > switches you touch the more you've got to understand about how things work > in order to make effective changes. I generally start with minimal switch > settings and then adjust as needed. Starting with a ratio is better than > starting with a fixed value. If the ratio isn't working for you then moved > to a fixed size. But use the data in the gc log to tell you how to proceed. > > Also, if your application is swapping during GC you will increase the > duration of the collection. You need to monitor system level activity as > part of the investigation. > > Regards, > Kirk > > > On Wed, Jan 11, 2012 at 3:55 PM, Kirk Pepperdine < > kirk.pepperdine at gmail.com> wrote: > >> >> On 2012-01-11, at 8:47 AM, Li Li wrote: >> >> 1. I don't understand why tenuring thresholds are >> calculated to be 1 >> >> >> because the number of expected survivors exceeds the size of the survivor >> space >> >> 2. I don't set Xms, I just set Xmx=8g >> >> >> with a new ratio of 3.. you should have 2 gigs of young gen meaning a .2 >> gigs for each survivor space and 1.6 for young gen. Do you have a GC log >> you can use to confirm these values? If not try visualvm and this plugin >> should give you a clear view (www.java.net/projects/memorypoolview). >> >> >> 3. as for memory leak, I will try to find it. >> >> On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine wrote: >> >>> Hi Li LI, >>> >>> I fear that you are off in the wrong direction. Resetting tenuring >>> thresholds in this case will never work because they are being calculated >>> to be 1. You're suggesting numbers greater than 1 and so 1 will always be >>> used which explains why you're not seeing a difference between runs. Having >>> a calculated tenuring threshold set to 1 implies that the memory pool is >>> too small. If the a memory pool is too small the only thing you can do to >>> fix that is to make it bigger. In this case, your young generational space >>> (as I've indicated in previous postings) is too small. Also, the cost of a >>> young generational collection is dependent mostly upon the number of >>> surviving objects, not dead ones. Pooling temporary objects will only make >>> the problem worse. If I recall your flag settings, you've set netsize to a >>> fixed value. That setting will override the the new ratio setting. You also >>> set Xmx==Xms and that also override adaptive sizing. Also you are using CMS >>> which is inherently not size adaptable. >>> >>> Last point, and this is the biggest one. The numbers you're publishing >>> right now suggest that you have a memory leak. There is no way you're going >>> to stabilize the memory /gc behaviour with a memory leak. Things will get >>> progressively worse as you consume more and more heap. This is a blocking >>> issue to all tuning efforts. It is the first thing that must be dealt with. >>> >>> To find the leak; >>> Identify the leaking object useing VisualVM's memory profiler with >>> generational counts and collect allocation stack traces turned on. Sort the >>> profile by generational counts. When you've identified the leaking object, >>> the domain class with the highest and always increasing generational count. >>> take an allocation stack trace snapshot and a heap dump. The heap dump >>> should be loaded into a heap walker. Use the knowledge gained from >>> generational counts to inspect the linkages for the leaking object and then >>> use that information in the allocation stack traces to identify casual >>> execution paths for creation. After that, it's into application code to >>> determine the fix. >>> >>> Kind regards, >>> Kirk Pepperdine >>> >>> On 2012-01-11, at 5:45 AM, Li Li wrote: >>> >>> if the young generation is too small that it can't afford space for >>> survivors and it have to throw them to old generation. and jvm found this, >>> it will turn down TenuringThreshold ? >>> I set TenuringThreshold to 10. and found that the full gc is less >>> frequent and every full gc collect less garbage. it seems the parameter >>> have the effect. But I found the load average is up and young gc time is >>> much more than before. And the response time is also increased. >>> I guess that there are more objects in young generation. so it have >>> to do more young gc. although they are garbage, it's not a good idea to >>> collect them too early. because ParNewGC will stop the world, the response >>> time is increasing. >>> So I adjust TenuringThreshold to 3 and there are no remarkable >>> difference. >>> maybe I should use object pool for my application because it use many >>> large temporary objects. >>> Another question, when my application runs for about 1-2 days. I >>> found the response time increases. I guess it's the problem of large young >>> generation. >>> in the beginning, the total memory usage is about 4-5GB and young >>> generation is 100-200MB, the rest is old generation. >>> After running for days, the total memory usage is 8GB and young >>> generation is about 2GB(I set new Ration 1:3) >>> I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation >>> and ?XX:MaxHeapFreeRation >>> the default value is 40 and 70. the memory manage white paper says if >>> the total heap free space is less than 40%, it will increase heap. if the >>> free space is larger than 70%, it will decrease heap size. >>> But why I see the young generation is 200mb while old is 4gb. does >>> the adjustment of young related to old generation? >>> I read in >>> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young >>> generation should be less than 512MB, is it correct? >>> >>> >>> >>> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna >> > wrote: >>> >>>> I recommend Charlie's excellent book as well. >>>> >>>> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold >>>> (henceforth MTT), >>>> but in order to allow objects to age you also need sufficiently large >>>> survivor spaces to hold >>>> them for however long you wish, otherwise the adaptive tenuring policy >>>> will adjust the >>>> "current" tenuring threshold so as to prevent overflow. That may be >>>> what you saw. >>>> Check out the info printed by +PrintTenuringThreshold. >>>> >>>> -- ramki >>>> >>>> On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: >>>> >>>>> hi all >>>>> I have an application that generating many large objects and then >>>>> discard them. I found that full gc can free memory from 70% to 40%. >>>>> I want to let this objects in young generation longer. I found >>>>> -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. >>>>> But I found a blog that says MaxTenuringThreshold is not used in >>>>> ParNewGC. >>>>> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it >>>>> seems no difference. >>>>> >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> hotspot-gc-use at openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>> >>>>> >>>> >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> >>> >>> >> >> > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/325a0052/attachment-0001.html From fancyerii at gmail.com Wed Jan 11 01:32:30 2012 From: fancyerii at gmail.com (Li Li) Date: Wed, 11 Jan 2012 17:32:30 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB What's the logic behind this? 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs] 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs] 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs] (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs] 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs] 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/3777d402/attachment.html From java at java4.info Wed Jan 11 01:45:28 2012 From: java at java4.info (Florian Binder) Date: Wed, 11 Jan 2012 10:45:28 +0100 Subject: Very long young gc pause (ParNew with CMS) In-Reply-To: References: <4F0ACAAC.8020103@java4.info> Message-ID: <4F0D5A38.1090906@java4.info> I do not know why it has worked for a week. Maybe it is because this was the xmas week ;-) In the night there are a lot of disk operations (2 TB of data is written). Therefore the operating system caches a lot of files and tries to free memory for this, so unused pages are moved to swap space. I assume heap fragmentation avoids swapping, since more pages are touched during the application is running. After a compacting gc there is one large (free) block which is not touched until young gc copies the objects from eden space. This will yield the operating system to move the pages of this one free block to swap and at every young gc it has to read it from swap. After a CMS collection the following young gcs are much faster because the gaps in the heap are not swapped. Yesterday, we have turned off the swap on this machine and now all young gcs take less than 200ms (instead of 6s) :-) Thanks againt to Chi Ho Kwok for giving the key hint :-) Flo Am 11.01.2012 10:00, schrieb Srinivas Ramakrishna: > > > On Mon, Jan 9, 2012 at 3:08 AM, Florian Binder > wrote: > > ... > I have seen that this problem occurs only after about one week of > uptime. Even thought we make a full (compacting) gc every night. > Since real-time > user-time I assume it might be a synchronization > problem. Can this be true? > > > Together with your and Chi-Ho's conclusion that this is possibly > related to paging, > a question to ponder is why this happens only after a week. Since your > process' > heap size is presumably fixed and you have seen multiple full GC's > (from which > i assume that your heap's pages have all been touched), have you > checked to > see if the size of either this process (i.e. its native size) or of > another process > on the machine has grown during the week so that you start swapping? > > I also find it interesting that you state that whenever you see this > problem > there's always a single block in the old gen, and that the problem > seems to go > away when there are more than one block in the old gen. That would seem > to throw out the paging theory, and point the finger of suspicion to > some kind > of bottleneck in the allocation out of a large block. You also state > that you > do a compacting collection every night, but the bad behaviour sets in only > after a week. > > So let me ask you if you see that the slow scavenge happens to be the > first > scavenge after a full gc, or does the condition persist for a long > time and > is independent if whether a full gc has happened recently? > > Try turning on -XX:+PrintOldPLAB to see if it sheds any light... > > -- ramki -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/19dc97a7/attachment.html From taras.tielkes at gmail.com Wed Jan 11 03:00:22 2012 From: taras.tielkes at gmail.com (Taras Tielkes) Date: Wed, 11 Jan 2012 12:00:22 +0100 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: <4F06A270.3010701@oracle.com> References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> Message-ID: Hi Jon, We've added the -XX:+PrintPromotionFailure flag to our production application yesterday. The application is running on 4 (homogenous) nodes. In the gc logs of 3 out of 4 nodes, I've found a promotion failure event during ParNew: node-002 ------- 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: 357592K->23382K(368640K), 0.0298150 secs] 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 sys=0.01, real=0.03 secs] 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: 351062K->39795K(368640K), 0.0401170 secs] 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 sys=0.00, real=0.04 secs] 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: promotion failure size = 4281460) (promotion failed): 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 secs] [Times: user=5.10 sys=0.00, real=4.84 secs] 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] node-003 ------- 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: 346950K->21342K(368640K), 0.0333090 secs] 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 sys=0.00, real=0.03 secs] 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: 345070K->32211K(368640K), 0.0369260 secs] 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: promotion failure size = 1266955) (promotion failed): 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 secs] [Times: user=5.03 sys=0.00, real=4.71 secs] 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] node-004 ------- 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: 358429K->40960K(368640K), 0.0629910 secs] 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 sys=0.02, real=0.06 secs] 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: 368640K->40960K(368640K), 0.0819780 secs] 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 sys=0.00, real=0.08 secs] 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: promotion failure size = 2788662) (promotion failed): 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: 3310044K->330922K(4833280K), 4.5104170 secs] 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] On a fourth node, I've found a different event: promotion failure during CMS, with a much smaller size: node-001 ------- 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: 354039K->40960K(368640K), 0.0667340 secs] 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 sys=0.01, real=0.06 secs] 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: 368640K->40960K(368640K), 0.2586390 secs] 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 sys=0.13, real=0.26 secs] 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: user=0.07 sys=0.00, real=0.07 secs] 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-abortable-preclean-start] 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: 368640K->40960K(368640K), 0.1214420 secs] 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 sys=0.05, real=0.12 secs] 2012-01-10T18:30:12.785+0100: 48434.078: [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: user=10.72 sys=0.48, real=2.70 secs] 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: [ParNew (promotion failure size = 1026) (promotion failed): 206521K->206521K(368640K), 0.1667280 secs] 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 sys=0.04, real=0.17 secs] 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 secs]48434.474: [scrub symbol & string tables, 0.0088370 secs] [1 CMS-remark: 3489675K(4833280K)] 36961 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 secs] 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 secs] 2873988K->334385K(5201920K), [CMS Perm : 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 sys=0.00, real=8.61 secs] 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] I assume that the large sizes for the promotion failures during ParNew are confirming that eliminating large array allocations might help here. Do you agree? I'm not sure what to make of the concurrent mode failure. Thanks in advance for any suggestions, Taras On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu wrote: > > > On 1/5/2012 3:32 PM, Taras Tielkes wrote: >> Hi Jon, >> >> We've enabled the PrintPromotionFailure flag in our preprod >> environment, but so far, no failures yet. >> We know that the load we generate there is not representative. But >> perhaps we'll catch something, given enough patience. >> >> The flag will also be enabled in our production environment next week >> - so one way or the other, we'll get more diagnostic data soon. >> I'll also do some allocation profiling of the application in isolation >> - I know that there is abusive large byte[] and char[] allocation in >> there. >> >> I've got two questions for now: >> >> 1) From googling around on the output to expect >> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >> I see that -XX:+PrintPromotionFailure will generate output like this: >> ------- >> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion >> failed): 135865K->134943K(138240K), 0.1433555 secs] >> ------- >> In that example line, what does the "0" stand for? > > It's the index of the GC worker thread ?that experienced the promotion > failure. > >> 2) Below is a snippet of (real) gc log from our production application: >> ------- >> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >> 345951K->40960K(368640K), 0.0676780 secs] >> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >> sys=0.01, real=0.06 secs] >> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >> 368640K->40959K(368640K), 0.0618880 secs] >> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >> sys=0.00, real=0.06 secs] >> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >> user=0.04 sys=0.00, real=0.04 secs] >> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start] >> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >> 2011-12-30T22:42:24.099+0100: 2136593.001: >> [CMS-concurrent-abortable-preclean-start] >> ? CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >> [Times: user=5.70 sys=0.23, real=5.23 secs] >> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >> 3432839K->3423755K(5201920 >> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >> refs processing, 0.0034280 secs]2136605.804: [class unloading, >> 0.0289480 secs]2136605.833: [scrub symbol& ?string tables, 0.0093940 >> secs] [1 CMS-remark: 3318289K(4833280K >> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >> real=7.61 secs] >> 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start] >> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >> ? (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >> secs] 3491471K->291853K(5201920K), [CMS Perm : >> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >> sys=0.00, real=10.29 secs] >> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >> ------- >> >> In this case I don't know how to interpret the output. >> a) There's a promotion failure that took 7.49 secs > This is the time it took to attempt the minor collection (ParNew) and to > do recovery > from the failure. > >> b) There's a full GC that took 14.08 secs >> c) There's a concurrent mode failure that took 10.29 secs > > Not sure about b) and c) because the output is mixed up with the > concurrent-sweep > output but ?I think the "concurrent mode failure" message is part of the > "Full GC" > message. ?My guess is that the 10.29 is the time for the Full GC and the > 14.08 > maybe is part of the concurrent-sweep message. ?Really hard to be sure. > > Jon >> How are these events, and their (real) times related to each other? >> >> Thanks in advance, >> Taras >> >> On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu ?wrote: >>> Taras, >>> >>> PrintPromotionFailure seems like it would go a long >>> way to identify the root of your promotion failures (or >>> at least eliminating some possible causes). ? ?I think it >>> would help focus the discussion if you could send >>> the result of that experiment early. >>> >>> Jon >>> >>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>> Hi, >>>> >>>> We're running an application with the CMS/ParNew collectors that is >>>> experiencing occasional promotion failures. >>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>> I've listed the specific JVM options used below (a). >>>> >>>> The application is deployed across a handful of machines, and the >>>> promotion failures are fairly uniform across those. >>>> >>>> The first kind of failure we observe is a promotion failure during >>>> ParNew collection, I've included a snipped from the gc log below (b). >>>> The second kind of failure is a concurrrent mode failure (perhaps >>>> triggered by the same cause), see (c) below. >>>> The frequency (after running for a some weeks) is approximately once >>>> per day. This is bearable, but obviously we'd like to improve on this. >>>> >>>> Apart from high-volume request handling (which allocates a lot of >>>> small objects), the application also runs a few dozen background >>>> threads that download and process XML documents, typically in the 5-30 >>>> MB range. >>>> A known deficiency in the existing code is that the XML content is >>>> copied twice before processing (once to a byte[], and later again to a >>>> String/char[]). >>>> Given that a 30 MB XML stream will result in a 60 MB >>>> java.lang.String/char[], my suspicion is that these big array >>>> allocations are causing us to run into the CMS fragmentation issue. >>>> >>>> My questions are: >>>> 1) Does the data from the GC logs provide sufficient evidence to >>>> conclude that CMS fragmentation is the cause of the promotion failure? >>>> 2) If not, what's the next step of investigating the cause? >>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a >>>> feeling for the size of the objects that fail promotion. >>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>> case? >>>> >>>> Thanks in advance, >>>> Taras >>>> >>>> a) Current JVM options: >>>> -------------------------------- >>>> -server >>>> -Xms5g >>>> -Xmx5g >>>> -Xmn400m >>>> -XX:PermSize=256m >>>> -XX:MaxPermSize=256m >>>> -XX:+PrintGCTimeStamps >>>> -verbose:gc >>>> -XX:+PrintGCDateStamps >>>> -XX:+PrintGCDetails >>>> -XX:SurvivorRatio=8 >>>> -XX:+UseConcMarkSweepGC >>>> -XX:+UseParNewGC >>>> -XX:+DisableExplicitGC >>>> -XX:+UseCMSInitiatingOccupancyOnly >>>> -XX:+CMSClassUnloadingEnabled >>>> -XX:+CMSScavengeBeforeRemark >>>> -XX:CMSInitiatingOccupancyFraction=68 >>>> -Xloggc:gc.log >>>> -------------------------------- >>>> >>>> b) Promotion failure during ParNew >>>> -------------------------------- >>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>> 368640K->40959K(368640K), 0.0693460 secs] >>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>> sys=0.01, real=0.07 secs] >>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>> 368639K->31321K(368640K), 0.0511400 secs] >>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>> sys=0.00, real=0.05 secs] >>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>> 359001K->18694K(368640K), 0.0272970 secs] >>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>> sys=0.00, real=0.03 secs] >>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>> 3505808K->434291K >>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K), >>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K), >>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K), >>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>> -------------------------------- >>>> >>>> c) Promotion failure during CMS >>>> -------------------------------- >>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>> 357228K->40960K(368640K), 0.0525110 secs] >>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>> sys=0.00, real=0.05 secs] >>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>> 366075K->37119K(368640K), 0.0479780 secs] >>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>> sys=0.01, real=0.05 secs] >>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>> 364792K->40960K(368640K), 0.0421740 secs] >>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>> sys=0.00, real=0.04 secs] >>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>> user=0.02 sys=0.00, real=0.03 secs] >>>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start] >>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>> 368640K->40960K(368640K), 0.0836690 secs] >>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>> sys=0.01, real=0.08 secs] >>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start] >>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>> [CMS-concurrent-abortable-preclean-start] >>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>> user=6.68 sys=0.27, real=1.40 secs] >>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs] >>>> ? ?3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>> sys=2.58, real=9.88 secs] >>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs >>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>> secs]703031.419: [scrub symbol& ? ?string tables, 0.0094960 secs] [1 CMS >>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start] >>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>> ? ?(concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300 >>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>> sys=0.00, real=10.96 secs] >>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K), >>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K), >>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K), >>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>> -------------------------------- >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From kirk.pepperdine at gmail.com Tue Jan 10 23:55:34 2012 From: kirk.pepperdine at gmail.com (Kirk Pepperdine) Date: Wed, 11 Jan 2012 08:55:34 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: On 2012-01-11, at 8:47 AM, Li Li wrote: > 1. I don't understand why tenuring thresholds are > calculated to be 1 because the number of expected survivors exceeds the size of the survivor space > 2. I don't set Xms, I just set Xmx=8g with a new ratio of 3.. you should have 2 gigs of young gen meaning a .2 gigs for each survivor space and 1.6 for young gen. Do you have a GC log you can use to confirm these values? If not try visualvm and this plugin should give you a clear view (www.java.net/projects/memorypoolview). > 3. as for memory leak, I will try to find it. > > On Wed, Jan 11, 2012 at 3:18 PM, Kirk Pepperdine wrote: > Hi Li LI, > > I fear that you are off in the wrong direction. Resetting tenuring thresholds in this case will never work because they are being calculated to be 1. You're suggesting numbers greater than 1 and so 1 will always be used which explains why you're not seeing a difference between runs. Having a calculated tenuring threshold set to 1 implies that the memory pool is too small. If the a memory pool is too small the only thing you can do to fix that is to make it bigger. In this case, your young generational space (as I've indicated in previous postings) is too small. Also, the cost of a young generational collection is dependent mostly upon the number of surviving objects, not dead ones. Pooling temporary objects will only make the problem worse. If I recall your flag settings, you've set netsize to a fixed value. That setting will override the the new ratio setting. You also set Xmx==Xms and that also override adaptive sizing. Also you are using CMS which is inherently not size adaptable. > > Last point, and this is the biggest one. The numbers you're publishing right now suggest that you have a memory leak. There is no way you're going to stabilize the memory /gc behaviour with a memory leak. Things will get progressively worse as you consume more and more heap. This is a blocking issue to all tuning efforts. It is the first thing that must be dealt with. > > To find the leak; > Identify the leaking object useing VisualVM's memory profiler with generational counts and collect allocation stack traces turned on. Sort the profile by generational counts. When you've identified the leaking object, the domain class with the highest and always increasing generational count. take an allocation stack trace snapshot and a heap dump. The heap dump should be loaded into a heap walker. Use the knowledge gained from generational counts to inspect the linkages for the leaking object and then use that information in the allocation stack traces to identify casual execution paths for creation. After that, it's into application code to determine the fix. > > Kind regards, > Kirk Pepperdine > > On 2012-01-11, at 5:45 AM, Li Li wrote: > >> if the young generation is too small that it can't afford space for survivors and it have to throw them to old generation. and jvm found this, it will turn down TenuringThreshold ? >> I set TenuringThreshold to 10. and found that the full gc is less frequent and every full gc collect less garbage. it seems the parameter have the effect. But I found the load average is up and young gc time is much more than before. And the response time is also increased. >> I guess that there are more objects in young generation. so it have to do more young gc. although they are garbage, it's not a good idea to collect them too early. because ParNewGC will stop the world, the response time is increasing. >> So I adjust TenuringThreshold to 3 and there are no remarkable difference. >> maybe I should use object pool for my application because it use many large temporary objects. >> Another question, when my application runs for about 1-2 days. I found the response time increases. I guess it's the problem of large young generation. >> in the beginning, the total memory usage is about 4-5GB and young generation is 100-200MB, the rest is old generation. >> After running for days, the total memory usage is 8GB and young generation is about 2GB(I set new Ration 1:3) >> I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation and ?XX:MaxHeapFreeRation >> the default value is 40 and 70. the memory manage white paper says if the total heap free space is less than 40%, it will increase heap. if the free space is larger than 70%, it will decrease heap size. >> But why I see the young generation is 200mb while old is 4gb. does the adjustment of young related to old generation? >> I read in http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young generation should be less than 512MB, is it correct? >> >> >> >> On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna wrote: >> I recommend Charlie's excellent book as well. >> >> To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold (henceforth MTT), >> but in order to allow objects to age you also need sufficiently large survivor spaces to hold >> them for however long you wish, otherwise the adaptive tenuring policy will adjust the >> "current" tenuring threshold so as to prevent overflow. That may be what you saw. >> Check out the info printed by +PrintTenuringThreshold. >> >> -- ramki >> >> On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: >> hi all >> I have an application that generating many large objects and then discard them. I found that full gc can free memory from 70% to 40%. >> I want to let this objects in young generation longer. I found -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. >> But I found a blog that says MaxTenuringThreshold is not used in ParNewGC. >> And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it seems no difference. >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/1eafdc88/attachment.html From kirk.pepperdine at gmail.com Wed Jan 11 01:48:36 2012 From: kirk.pepperdine at gmail.com (Kirk Pepperdine) Date: Wed, 11 Jan 2012 10:48:36 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> CMS is not adaptive. To reconfigure heap, for many reasons, you need a full GC to occur. The response to a concurrent mode failure is always a full GC. That gave the JVM the opportunity to resize heap space. If this behaviour isn't happening when it should or is cause other problems it's time to either set the young gen size directly with NewSize or switch to the parallel collector with the adaptive sizing policy turned on. Logic here is that you want to avoid long pauses, use CMS. If CMS is giving you long pauses, than the parallel collector might be a better choice. Regards, Kirk On 2012-01-11, at 10:32 AM, Li Li wrote: > after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB > What's the logic behind this? > > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs] > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs] > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs] > (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs] > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs] > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs] > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From kirk at kodewerk.com Tue Jan 10 23:18:04 2012 From: kirk at kodewerk.com (Kirk Pepperdine) Date: Wed, 11 Jan 2012 08:18:04 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: Message-ID: Hi Li LI, I fear that you are off in the wrong direction. Resetting tenuring thresholds in this case will never work because they are being calculated to be 1. You're suggesting numbers greater than 1 and so 1 will always be used which explains why you're not seeing a difference between runs. Having a calculated tenuring threshold set to 1 implies that the memory pool is too small. If the a memory pool is too small the only thing you can do to fix that is to make it bigger. In this case, your young generational space (as I've indicated in previous postings) is too small. Also, the cost of a young generational collection is dependent mostly upon the number of surviving objects, not dead ones. Pooling temporary objects will only make the problem worse. If I recall your flag settings, you've set netsize to a fixed value. That setting will override the the new ratio setting. You also set Xmx==Xms and that also override adaptive sizing. Also you are using CMS which is inherently not size adaptable. Last point, and this is the biggest one. The numbers you're publishing right now suggest that you have a memory leak. There is no way you're going to stabilize the memory /gc behaviour with a memory leak. Things will get progressively worse as you consume more and more heap. This is a blocking issue to all tuning efforts. It is the first thing that must be dealt with. To find the leak; Identify the leaking object useing VisualVM's memory profiler with generational counts and collect allocation stack traces turned on. Sort the profile by generational counts. When you've identified the leaking object, the domain class with the highest and always increasing generational count. take an allocation stack trace snapshot and a heap dump. The heap dump should be loaded into a heap walker. Use the knowledge gained from generational counts to inspect the linkages for the leaking object and then use that information in the allocation stack traces to identify casual execution paths for creation. After that, it's into application code to determine the fix. Kind regards, Kirk Pepperdine On 2012-01-11, at 5:45 AM, Li Li wrote: > if the young generation is too small that it can't afford space for survivors and it have to throw them to old generation. and jvm found this, it will turn down TenuringThreshold ? > I set TenuringThreshold to 10. and found that the full gc is less frequent and every full gc collect less garbage. it seems the parameter have the effect. But I found the load average is up and young gc time is much more than before. And the response time is also increased. > I guess that there are more objects in young generation. so it have to do more young gc. although they are garbage, it's not a good idea to collect them too early. because ParNewGC will stop the world, the response time is increasing. > So I adjust TenuringThreshold to 3 and there are no remarkable difference. > maybe I should use object pool for my application because it use many large temporary objects. > Another question, when my application runs for about 1-2 days. I found the response time increases. I guess it's the problem of large young generation. > in the beginning, the total memory usage is about 4-5GB and young generation is 100-200MB, the rest is old generation. > After running for days, the total memory usage is 8GB and young generation is about 2GB(I set new Ration 1:3) > I am curious about the heap size adjusting. I found ?XX:MinHeapFreeRation and ?XX:MaxHeapFreeRation > the default value is 40 and 70. the memory manage white paper says if the total heap free space is less than 40%, it will increase heap. if the free space is larger than 70%, it will decrease heap size. > But why I see the young generation is 200mb while old is 4gb. does the adjustment of young related to old generation? > I read in http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ young generation should be less than 512MB, is it correct? > > > > On Wed, Jan 11, 2012 at 1:23 AM, Srinivas Ramakrishna wrote: > I recommend Charlie's excellent book as well. > > To answer yr question, yes, CMS + Parew does use MaxTenuringThreshold (henceforth MTT), > but in order to allow objects to age you also need sufficiently large survivor spaces to hold > them for however long you wish, otherwise the adaptive tenuring policy will adjust the > "current" tenuring threshold so as to prevent overflow. That may be what you saw. > Check out the info printed by +PrintTenuringThreshold. > > -- ramki > > On Tue, Jan 10, 2012 at 1:31 AM, Li Li wrote: > hi all > I have an application that generating many large objects and then discard them. I found that full gc can free memory from 70% to 40%. > I want to let this objects in young generation longer. I found -XX:MaxTenuringThreshold and -XX:PretenureSizeThreshold. > But I found a blog that says MaxTenuringThreshold is not used in ParNewGC. > And I use ParNewGC+CMS. I tried to set MaxTenuringThreshold=10, but it seems no difference. > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120111/20772300/attachment-0001.html From jon.masamitsu at oracle.com Wed Jan 11 08:54:28 2012 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Wed, 11 Jan 2012 08:54:28 -0800 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> Message-ID: <4F0DBEC4.7040907@oracle.com> Taras, > I assume that the large sizes for the promotion failures during ParNew > are confirming that eliminating large array allocations might help > here. Do you agree? I agree that eliminating the large array allocation will help but you are still having promotion failures when the allocation size is small (I think it was 1026). That says that you are filling up the old (cms) generation faster than the GC can collect it. The large arrays are aggrevating the problem but not necessarily the cause. If these are still your heap sizes, > -Xms5g > -Xmx5g > -Xmn400m Start by increasing the young gen size as may already have been suggested. If you have a test setup where you can experiment, try doubling the young gen size to start. If you have not seen this, it might be helpful. http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a > I'm not sure what to make of the concurrent mode The concurrent mode failure is a consequence of the promotion failure. Once the promotion failure happens the concurrent mode failure is inevitable. Jon > . On 1/11/2012 3:00 AM, Taras Tielkes wrote: > Hi Jon, > > We've added the -XX:+PrintPromotionFailure flag to our production > application yesterday. > The application is running on 4 (homogenous) nodes. > > In the gc logs of 3 out of 4 nodes, I've found a promotion failure > event during ParNew: > > node-002 > ------- > 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: > 357592K->23382K(368640K), 0.0298150 secs] > 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 > sys=0.01, real=0.03 secs] > 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: > 351062K->39795K(368640K), 0.0401170 secs] > 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 > sys=0.00, real=0.04 secs] > 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: > promotion failure size = 4281460) (promotion failed): > 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: > 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K > ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 > secs] [Times: user=5.10 sys=0.00, real=4.84 secs] > 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: > 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), > 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] > 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: > 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), > 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] > > node-003 > ------- > 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: > 346950K->21342K(368640K), 0.0333090 secs] > 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 > sys=0.00, real=0.03 secs] > 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: > 345070K->32211K(368640K), 0.0369260 secs] > 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 > sys=0.00, real=0.04 secs] > 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: > promotion failure size = 1266955) (promotion failed): > 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: > 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 > 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 > secs] [Times: user=5.03 sys=0.00, real=4.71 secs] > 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: > 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), > 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] > 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: > 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), > 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] > > node-004 > ------- > 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: > 358429K->40960K(368640K), 0.0629910 secs] > 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 > sys=0.02, real=0.06 secs] > 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: > 368640K->40960K(368640K), 0.0819780 secs] > 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 > sys=0.00, real=0.08 secs] > 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: > promotion failure size = 2788662) (promotion failed): > 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: > 3310044K->330922K(4833280K), 4.5104170 secs] > 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], > 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] > 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: > 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), > 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] > 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: > 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), > 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] > > On a fourth node, I've found a different event: promotion failure > during CMS, with a much smaller size: > > node-001 > ------- > 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: > 354039K->40960K(368640K), 0.0667340 secs] > 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 > sys=0.01, real=0.06 secs] > 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: > 368640K->40960K(368640K), 0.2586390 secs] > 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 > sys=0.13, real=0.26 secs] > 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: > 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: > user=0.07 sys=0.00, real=0.07 secs] > 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] > 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: > 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] > 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] > 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: > 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] > 2012-01-10T18:30:10.089+0100: 48431.382: > [CMS-concurrent-abortable-preclean-start] > 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: > 368640K->40960K(368640K), 0.1214420 secs] > 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 > sys=0.05, real=0.12 secs] > 2012-01-10T18:30:12.785+0100: 48434.078: > [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: > user=10.72 sys=0.48, real=2.70 secs] > 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K > (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: > [ParNew (promotion failure size = 1026) (promotion failed): > 206521K->206521K(368640K), 0.1667280 secs] > 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 > sys=0.04, real=0.17 secs] > 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs > processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 > secs]48434.474: [scrub symbol& string tables, 0.0088370 secs] [1 > CMS-remark: 3489675K(4833280K)] 36961 > 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 secs] > 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] > 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: > [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: > 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] > (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 > secs] 2873988K->334385K(5201920K), [CMS Perm : > 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 > sys=0.00, real=8.61 secs] > 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: > 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), > 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] > 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: > 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), > 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] > > I assume that the large sizes for the promotion failures during ParNew > are confirming that eliminating large array allocations might help > here. Do you agree? > I'm not sure what to make of the concurrent mode failure. > > Thanks in advance for any suggestions, > Taras > > On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu wrote: >> >> On 1/5/2012 3:32 PM, Taras Tielkes wrote: >>> Hi Jon, >>> >>> We've enabled the PrintPromotionFailure flag in our preprod >>> environment, but so far, no failures yet. >>> We know that the load we generate there is not representative. But >>> perhaps we'll catch something, given enough patience. >>> >>> The flag will also be enabled in our production environment next week >>> - so one way or the other, we'll get more diagnostic data soon. >>> I'll also do some allocation profiling of the application in isolation >>> - I know that there is abusive large byte[] and char[] allocation in >>> there. >>> >>> I've got two questions for now: >>> >>> 1) From googling around on the output to expect >>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >>> I see that -XX:+PrintPromotionFailure will generate output like this: >>> ------- >>> 592.079: [ParNew (0: promotion failure size = 2698) (promotion >>> failed): 135865K->134943K(138240K), 0.1433555 secs] >>> ------- >>> In that example line, what does the "0" stand for? >> It's the index of the GC worker thread that experienced the promotion >> failure. >> >>> 2) Below is a snippet of (real) gc log from our production application: >>> ------- >>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >>> 345951K->40960K(368640K), 0.0676780 secs] >>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >>> sys=0.01, real=0.06 secs] >>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >>> 368640K->40959K(368640K), 0.0618880 secs] >>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >>> sys=0.00, real=0.06 secs] >>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >>> user=0.04 sys=0.00, real=0.04 secs] >>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start] >>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >>> 2011-12-30T22:42:24.099+0100: 2136593.001: >>> [CMS-concurrent-abortable-preclean-start] >>> CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >>> [Times: user=5.70 sys=0.23, real=5.23 secs] >>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >>> 3432839K->3423755K(5201920 >>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >>> refs processing, 0.0034280 secs]2136605.804: [class unloading, >>> 0.0289480 secs]2136605.833: [scrub symbol& string tables, 0.0093940 >>> secs] [1 CMS-remark: 3318289K(4833280K >>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >>> real=7.61 secs] >>> 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start] >>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >>> (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >>> secs] 3491471K->291853K(5201920K), [CMS Perm : >>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >>> sys=0.00, real=10.29 secs] >>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >>> ------- >>> >>> In this case I don't know how to interpret the output. >>> a) There's a promotion failure that took 7.49 secs >> This is the time it took to attempt the minor collection (ParNew) and to >> do recovery >> from the failure. >> >>> b) There's a full GC that took 14.08 secs >>> c) There's a concurrent mode failure that took 10.29 secs >> Not sure about b) and c) because the output is mixed up with the >> concurrent-sweep >> output but I think the "concurrent mode failure" message is part of the >> "Full GC" >> message. My guess is that the 10.29 is the time for the Full GC and the >> 14.08 >> maybe is part of the concurrent-sweep message. Really hard to be sure. >> >> Jon >>> How are these events, and their (real) times related to each other? >>> >>> Thanks in advance, >>> Taras >>> >>> On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu wrote: >>>> Taras, >>>> >>>> PrintPromotionFailure seems like it would go a long >>>> way to identify the root of your promotion failures (or >>>> at least eliminating some possible causes). I think it >>>> would help focus the discussion if you could send >>>> the result of that experiment early. >>>> >>>> Jon >>>> >>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>>> Hi, >>>>> >>>>> We're running an application with the CMS/ParNew collectors that is >>>>> experiencing occasional promotion failures. >>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>>> I've listed the specific JVM options used below (a). >>>>> >>>>> The application is deployed across a handful of machines, and the >>>>> promotion failures are fairly uniform across those. >>>>> >>>>> The first kind of failure we observe is a promotion failure during >>>>> ParNew collection, I've included a snipped from the gc log below (b). >>>>> The second kind of failure is a concurrrent mode failure (perhaps >>>>> triggered by the same cause), see (c) below. >>>>> The frequency (after running for a some weeks) is approximately once >>>>> per day. This is bearable, but obviously we'd like to improve on this. >>>>> >>>>> Apart from high-volume request handling (which allocates a lot of >>>>> small objects), the application also runs a few dozen background >>>>> threads that download and process XML documents, typically in the 5-30 >>>>> MB range. >>>>> A known deficiency in the existing code is that the XML content is >>>>> copied twice before processing (once to a byte[], and later again to a >>>>> String/char[]). >>>>> Given that a 30 MB XML stream will result in a 60 MB >>>>> java.lang.String/char[], my suspicion is that these big array >>>>> allocations are causing us to run into the CMS fragmentation issue. >>>>> >>>>> My questions are: >>>>> 1) Does the data from the GC logs provide sufficient evidence to >>>>> conclude that CMS fragmentation is the cause of the promotion failure? >>>>> 2) If not, what's the next step of investigating the cause? >>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a >>>>> feeling for the size of the objects that fail promotion. >>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>>> case? >>>>> >>>>> Thanks in advance, >>>>> Taras >>>>> >>>>> a) Current JVM options: >>>>> -------------------------------- >>>>> -server >>>>> -Xms5g >>>>> -Xmx5g >>>>> -Xmn400m >>>>> -XX:PermSize=256m >>>>> -XX:MaxPermSize=256m >>>>> -XX:+PrintGCTimeStamps >>>>> -verbose:gc >>>>> -XX:+PrintGCDateStamps >>>>> -XX:+PrintGCDetails >>>>> -XX:SurvivorRatio=8 >>>>> -XX:+UseConcMarkSweepGC >>>>> -XX:+UseParNewGC >>>>> -XX:+DisableExplicitGC >>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>> -XX:+CMSClassUnloadingEnabled >>>>> -XX:+CMSScavengeBeforeRemark >>>>> -XX:CMSInitiatingOccupancyFraction=68 >>>>> -Xloggc:gc.log >>>>> -------------------------------- >>>>> >>>>> b) Promotion failure during ParNew >>>>> -------------------------------- >>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>>> 368640K->40959K(368640K), 0.0693460 secs] >>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>>> sys=0.01, real=0.07 secs] >>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>>> 368639K->31321K(368640K), 0.0511400 secs] >>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>>> sys=0.00, real=0.05 secs] >>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>>> 359001K->18694K(368640K), 0.0272970 secs] >>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>>> sys=0.00, real=0.03 secs] >>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>>> 3505808K->434291K >>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K), >>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K), >>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K), >>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>>> -------------------------------- >>>>> >>>>> c) Promotion failure during CMS >>>>> -------------------------------- >>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>>> 357228K->40960K(368640K), 0.0525110 secs] >>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>>> sys=0.00, real=0.05 secs] >>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>>> 366075K->37119K(368640K), 0.0479780 secs] >>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>>> sys=0.01, real=0.05 secs] >>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>>> 364792K->40960K(368640K), 0.0421740 secs] >>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>>> sys=0.00, real=0.04 secs] >>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>>> user=0.02 sys=0.00, real=0.03 secs] >>>>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start] >>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>>> 368640K->40960K(368640K), 0.0836690 secs] >>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>>> sys=0.01, real=0.08 secs] >>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start] >>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>>> [CMS-concurrent-abortable-preclean-start] >>>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>>> user=6.68 sys=0.27, real=1.40 secs] >>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs] >>>>> 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>>> sys=2.58, real=9.88 secs] >>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs >>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>>> secs]703031.419: [scrub symbol& string tables, 0.0094960 secs] [1 CMS >>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start] >>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>>> (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300 >>>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>>> sys=0.00, real=10.96 secs] >>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K), >>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K), >>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K), >>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>>> -------------------------------- >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> hotspot-gc-use at openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From fancyerii at gmail.com Thu Jan 12 04:09:28 2012 From: fancyerii at gmail.com (Li Li) Date: Thu, 12 Jan 2012 20:09:28 +0800 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> References: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> Message-ID: yesterday, we set the maxNewSize to 256mb. And it works as we expected. but an hours ago, there is a promotion failure and a concurrent mode failure which cost 14s! could anyone explain the gc logs for me? or any documents for the gc log format explanation? 1. Desired survivor size 3342336 bytes, new threshold 5 (max 5) it says survivor size is 3mb 2. 58282K->57602K(59072K), 0.0543930 secs] it says before young gc the memory used is 58282K, after young gc, there are 57602K live objects and the total young space is 59072K 3. (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs] before old gc, 7.9GB is used. after old gc 3GB is alive. total old space is 7.9GB in which situation will occur promotion failure and concurrent mode failure? from http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ the author says when CMS is doing concurrent work and JVM is asked for more memory. if there isn't any space for new allocation. then it will occur concurrent mode failure and it will stop the world and do a serial old gc. if there exist enough space but they are fragemented, then a promotion failure will occur. am I right? 2012-01-12T18:27:32.582+0800: [GC [ParNew Desired survivor size 3342336 bytes, new threshold 1 (max 5) - age 1: 4594648 bytes, 4594648 total - age 2: 569200 bytes, 5163848 total : 58548K->5738K(59072K), 0.0159400 secs] 7958648K->7908502K(7984352K), 0.0160610 secs] [Times: user=0.17 sys=0.00, real=0.02 secs] 2012-01-12T18:27:32.609+0800: [GC [ParNew (promotion failed) Desired survivor size 3342336 bytes, new threshold 5 (max 5) - age 1: 1666376 bytes, 1666376 total : 58282K->57602K(59072K), 0.0543930 secs][CMS2012-01-12T18:27:33.804+0800: [CMS-concurrent-preclean: 14.098/34.323 secs] [Times: user=370.28 sys=5.65, real=34.31 secs] (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs] On Wed, Jan 11, 2012 at 5:48 PM, Kirk Pepperdine wrote: > CMS is not adaptive. To reconfigure heap, for many reasons, you need a > full GC to occur. The response to a concurrent mode failure is always a > full GC. That gave the JVM the opportunity to resize heap space. If this > behaviour isn't happening when it should or is cause other problems it's > time to either set the young gen size directly with NewSize or switch to > the parallel collector with the adaptive sizing policy turned on. Logic > here is that you want to avoid long pauses, use CMS. If CMS is giving you > long pauses, than the parallel collector might be a better choice. > > Regards, > Kirk > > On 2012-01-11, at 10:32 AM, Li Li wrote: > > > after a concurrent mode failure. the young generation changed from about > 50MB to 1.8GB > > What's the logic behind this? > > > > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), > 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: > user=0.20 sys=0.00, real=0.01 secs] > > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), > 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: > user=0.24 sys=0.00, real=0.02 secs] > > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): > 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: > [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, > real=28.24 secs] > > (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 > secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], > 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs] > > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), > 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: > user=0.26 sys=0.02, real=0.02 secs] > > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), > 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: > user=0.44 sys=0.04, real=0.04 secs] > > > > _______________________________________________ > > hotspot-gc-use mailing list > > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/586eb629/attachment.html From bartosz.markocki at gmail.com Thu Jan 12 05:56:38 2012 From: bartosz.markocki at gmail.com (Bartek Markocki) Date: Thu, 12 Jan 2012 14:56:38 +0100 Subject: How can we cut down those two CMS STW times? Message-ID: Hi all, We have a backend type of application which primary purpose is to cache user specific graphs of objects. The graphs are relatively small in size however the rate at which they can change (based on users' requests) is key here. Our main challenge was to figure out JVM settings that will handle the peak memory allocation at the level of 4.5GB/s. To make things a bit more challenging ;) we have a limited number of RAM on the box (as there are multiple applications co-located on the box and the box has just 64GB of RAM). After a decent amount of testing we figured out the following settings work for us: -Xms6G -Xmx6G -Xmn3G -Xss256k -XX:MaxPermSize=512m -XX:PermSize=512m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:TargetSurvivorRatio=90 -XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+CMSScavengeBeforeRemark -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled We run Java6 update 27 (64bit, server) on Solaris10. The above settings work for us with exception of one CMS-initial marks and one CMS-remark. By working I mean the STW is less than 1 second for any STW pause. There is one case when CMS-initial mark took 3.44 seconds. Here is the extract from the log showing this situation: 90516.053: [GC 90516.053: [ParNew: 2633949K->154547K(2831168K), 0.1729255 secs] 4874963K->2395755K(5976896K), 0.1734674 secs] [Times: user=3.07 sys=0.01, real=0.17 secs] 90516.846: [GC 90516.846: [ParNew: 2671155K->106975K(2831168K), 0.2183780 secs] 4906534K->2365720K(5976896K), 0.2188906 secs] [Times: user=3.62 sys=0.05, real=0.22 secs] 90517.684: [GC 90517.684: [ParNew: 2623583K->106936K(2831168K), 0.0690728 secs] 4833212K->2316857K(5976896K), 0.0695870 secs] [Times: user=1.20 sys=0.01, real=0.07 secs] 90517.976: [CMS-concurrent-sweep: 4.574/5.767 secs] [Times: user=121.01 sys=1.90, real=5.77 secs] 90517.976: [CMS-concurrent-reset-start] 90518.112: [CMS-concurrent-reset: 0.136/0.136 secs] [Times: user=2.76 sys=0.05, real=0.14 secs] 90520.117: [GC [1 CMS-initial-mark: 2209921K(3145728K)] 4768007K(5976896K), 3.4458003 secs] [Times: user=3.45 sys=0.00, real=3.45 secs] 90523.564: [CMS-concurrent-mark-start] 90523.623: [GC 90523.623: [ParNew: 2623544K->119747K(2831168K), 0.1848339 secs] 4833465K->2329818K(5976896K), 0.1853529 secs] [Times: user=3.29 sys=0.01, real=0.19 secs] 90526.087: [CMS-concurrent-mark: 2.314/2.523 secs] [Times: user=18.11 sys=0.18, real=2.52 secs] 90526.087: [CMS-concurrent-preclean-start] 90526.155: [CMS-concurrent-preclean: 0.058/0.068 secs] [Times: user=0.16 sys=0.00, real=0.07 secs] 90526.155: [CMS-concurrent-abortable-preclean-start] 90531.301: [GC 90531.301: [ParNew: 2636355K->45254K(2831168K), 0.0206247 secs] 4846426K->2255579K(5976896K), 0.0211745 secs] [Times: user=0.33 sys=0.00, real=0.02 secs] CMS: abort preclean due to time 90531.470: [CMS-concurrent-abortable-preclean: 5.271/5.315 secs] [Times: user=18.85 sys=0.26, real=5.32 secs] 90531.476: [GC[YG occupancy: 662977 K (2831168 K)]90531.476: [GC 90531.476: [ParNew: 662977K->21990K(2831168K), 0.0342782 secs] 2873302K->2232487K(5976896K), 0.0347927 secs] [Times: user=0.40 sys=0.01, real=0.04 secs] 90531.511: [Rescan (parallel) , 0.0074306 secs]90531.519: [weak refs processing, 0.0000864 secs]90531.519: [class unloading, 0.0350356 secs]90531.554: [scrub symbol & string tables, 0.0266258 secs] [1 CMS-remark: 2210497K(3145728K)] 2232487K(5976896K), 0.1197919 secs] [Times: user=0.59 sys=0.01, real=0.12 secs] 90531.597: [CMS-concurrent-sweep-start] 90532.212: [GC 90532.212: [ParNew: 2538598K->14216K(2831168K), 0.0162798 secs] 4744071K->2219741K(5976896K), 0.0167729 secs] [Times: user=0.26 sys=0.00, real=0.02 secs] 90532.865: [GC 90532.865: [ParNew: 2530824K->18587K(2831168K), 0.0192318 secs] 4732677K->2220478K(5976896K), 0.0197659 secs] [Times: user=0.31 sys=0.00, real=0.02 secs] 90533.500: [GC 90533.500: [ParNew: 2535195K->20886K(2831168K), 0.0206055 secs] 4731793K->2217494K(5976896K), 0.0211439 secs] [Times: user=0.33 sys=0.00, real=0.02 secs] Of course almost immediately one can notice that young generation was almost full during that time, so what happened should not be a surprise. After some googling I found that a similar topic was discussed on this group in 2010 ? with indication that it is caused by the 6412968 bug (CMS: Long initial mark). We tried suggested workarounds and found out that they cannot be applied in our case (limited number of available RAM) and sooner or later we hit the promotion or/and concurrent mode failure with even worse STW time. Unfortunately, as in 2010, bugs.sun.com does not show the bug so I cannot check if there was any update for the bug, so here comes my first question: was there any update for the bug (what?s the status of the bug)? The next problem that we faced was related to another CMS related bug (6990419). After we applied the suggested workaround (to enable scavenge before remark) the problem was almost completely removed with one exception. There is the following CMS remark: 199848.296: [GC 199848.296: [ParNew: 2689127K->119235K(2831168K), 0.0906522 secs] 4868292K->2309941K(5976896K), 0.0912736 secs] [Times: user=1.22 sys=0.02, real=0.09 secs] 199853.617: [GC 199853.617: [ParNew: 2635843K->91628K(2831168K), 0.1040602 secs] 4826549K->2311078K(5976896K), 0.1046178 secs] [Times: user=1.15 sys=0.01, real=0.10 secs] 199853.726: [GC [1 CMS-initial-mark: 2219449K(3145728K)] 2311170K(5976896K), 0.1208219 secs] [Times: user=0.12 sys=0.00, real=0.12 secs] 199853.847: [CMS-concurrent-mark-start] 199856.405: [CMS-concurrent-mark: 2.547/2.557 secs] [Times: user=18.49 sys=0.35, real=2.56 secs] 199856.405: [CMS-concurrent-preclean-start] 199856.438: [CMS-concurrent-preclean: 0.031/0.033 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 199856.439: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 199861.899: [CMS-concurrent-abortable-preclean: 5.443/5.460 secs] [Times: user=27.67 sys=1.14, real=5.46 secs] 199861.903: [GC[YG occupancy: 1353639 K (2831168 K)]199861.903: [Rescan (parallel) , 1.4282026 secs]199863.332: [weak refs processing, 0.0019473 secs]199863.334: [class unloading, 0.0365617 secs]199863.370: [scrub symbol & string tables, 0.0267902 secs] [1 CMS-remark: 2219449K(3145728K)] 3573089K(5976896K), 1.5099836 secs] [Times: user=12.20 sys=0.17, real=1.51 secs] 199863.414: [CMS-concurrent-sweep-start] 199863.420: [GC 199863.421: [ParNew: 1355519K->53699K(2831168K), 0.1129519 secs] 3574969K->2308972K(5976896K), 0.1138995 secs] [Times: user=1.10 sys=0.01, real=0.11 secs] 199865.857: [CMS-concurrent-sweep: 2.324/2.443 secs] [Times: user=10.15 sys=0.61, real=2.44 secs] 199865.857: [CMS-concurrent-reset-start] 199865.888: [CMS-concurrent-reset: 0.031/0.031 secs] [Times: user=0.05 sys=0.00, real=0.03 secs] 199893.779: [GC 199893.780: [ParNew: 2570307K->58285K(2831168K), 0.0397922 secs] 4620179K->2108197K(5976896K), 0.0403072 secs] [Times: user=0.68 sys=0.00, real=0.04 secs] 199906.510: [GC 199906.510: [ParNew: 2574893K->55484K(2831168K), 0.0390212 secs] 4624805K->2105432K(5976896K), 0.0395148 secs] [Times: user=0.67 sys=0.01, real=0.04 secs] There are two things to notice here: 1. The time of this rescan was 20 times longer than any other rescan time (1.4 seconds comparing to 58 ms) 2. There was no minor GC before CMS-remark even though it was explicitly requested. The question here is: is that something already covered by the 6990419 for which workaround simply does not work or something else? Thank you, Bartek From charlie.hunt at oracle.com Thu Jan 12 05:55:57 2012 From: charlie.hunt at oracle.com (charlie hunt) Date: Thu, 12 Jan 2012 07:55:57 -0600 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> Message-ID: <4F0EE66D.7090406@oracle.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/858971dd/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5166 bytes Desc: S/MIME Cryptographic Signature Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/858971dd/smime-0001.p7s From charlie.hunt at oracle.com Thu Jan 12 11:11:26 2012 From: charlie.hunt at oracle.com (charlie hunt) Date: Thu, 12 Jan 2012 13:11:26 -0600 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: <1C5F8123-6AA8-4D08-A248-50F3CE4A5516@gmail.com> References: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> <4F0EE66D.7090406@oracle.com> <1C5F8123-6AA8-4D08-A248-50F3CE4A5516@gmail.com> Message-ID: <4F0F305E.1010305@oracle.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/6c7ea4c1/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5166 bytes Desc: S/MIME Cryptographic Signature Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/6c7ea4c1/smime.p7s From kirk.pepperdine at gmail.com Thu Jan 12 11:03:41 2012 From: kirk.pepperdine at gmail.com (Kirk Pepperdine) Date: Thu, 12 Jan 2012 20:03:41 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: References: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> Message-ID: Hi, CMS failures occur as a result of a trend over time. It's almost impossible to recommend a correction from a single incident. that said, Jon's blog entry explains CMS failure very clearly. This the record you've sent suggests that young gen is way too small.. but again, I can't say anything with a single record. Regards, Kirk On 2012-01-12, at 1:09 PM, Li Li wrote: > yesterday, we set the maxNewSize to 256mb. And it works as we expected. but an hours ago, there is a promotion failure and a concurrent mode failure which cost 14s! could anyone explain the gc logs for me? > or any documents for the gc log format explanation? > > 1. Desired survivor size 3342336 bytes, new threshold 5 (max 5) > it says survivor size is 3mb > 2. 58282K->57602K(59072K), 0.0543930 secs] > it says before young gc the memory used is 58282K, after young gc, there are 57602K live objects and the total young space is 59072K > 3. (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs] > before old gc, 7.9GB is used. after old gc 3GB is alive. total old space is 7.9GB > > in which situation will occur promotion failure and concurrent mode failure? > from http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ > the author says when CMS is doing concurrent work and JVM is asked for more memory. if there isn't any space for new allocation. then it will occur concurrent mode failure and it will stop the world and do a serial old gc. > if there exist enough space but they are fragemented, then a promotion failure will occur. > am I right? > > 2012-01-12T18:27:32.582+0800: [GC [ParNew > Desired survivor size 3342336 bytes, new threshold 1 (max 5) > - age 1: 4594648 bytes, 4594648 total > - age 2: 569200 bytes, 5163848 total > : 58548K->5738K(59072K), 0.0159400 secs] 7958648K->7908502K(7984352K), 0.0160610 secs] [Times: user=0.17 sys=0.00, real=0.02 secs] > 2012-01-12T18:27:32.609+0800: [GC [ParNew (promotion failed) > Desired survivor size 3342336 bytes, new threshold 5 (max 5) > - age 1: 1666376 bytes, 1666376 total > : 58282K->57602K(59072K), 0.0543930 secs][CMS2012-01-12T18:27:33.804+0800: [CMS-concurrent-preclean: 14.098/34.323 secs] [Times: user=370.28 sys=5.65, real=34.31 secs] > (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs] > > On Wed, Jan 11, 2012 at 5:48 PM, Kirk Pepperdine wrote: > CMS is not adaptive. To reconfigure heap, for many reasons, you need a full GC to occur. The response to a concurrent mode failure is always a full GC. That gave the JVM the opportunity to resize heap space. If this behaviour isn't happening when it should or is cause other problems it's time to either set the young gen size directly with NewSize or switch to the parallel collector with the adaptive sizing policy turned on. Logic here is that you want to avoid long pauses, use CMS. If CMS is giving you long pauses, than the parallel collector might be a better choice. > > Regards, > Kirk > > On 2012-01-11, at 10:32 AM, Li Li wrote: > > > after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB > > What's the logic behind this? > > > > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs] > > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs] > > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs] > > (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs] > > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs] > > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs] > > > > _______________________________________________ > > hotspot-gc-use mailing list > > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/9ec8a7e7/attachment-0001.html From kirk.pepperdine at gmail.com Thu Jan 12 11:08:17 2012 From: kirk.pepperdine at gmail.com (Kirk Pepperdine) Date: Thu, 12 Jan 2012 20:08:17 +0100 Subject: MaxTenuringThreshold available in ParNewGC? In-Reply-To: <4F0EE66D.7090406@oracle.com> References: <9462441C-C11B-4CDC-83BD-D78E2A1138AB@gmail.com> <4F0EE66D.7090406@oracle.com> Message-ID: <1C5F8123-6AA8-4D08-A248-50F3CE4A5516@gmail.com> Charlie, You shameless shameless self promoter!!!!! Shame on you!!!! LiLi, please ignore Charlie's shameless self promotion and run out and buy the book. I think it will be of great help to your understanding of the problems your currently facing. Charlie, what's my commission on the sale? Regards, Kirk ps ;-) On 2012-01-12, at 2:55 PM, charlie hunt wrote: > At the risk of sounding self-promotional ....based on the questions you are asking, I think you'd find a lot of value in the Java Performance book: > http://www.amazon.com/Java-Performance-Charlie-Hunt/dp/0137142528 > > Many of the folks on the mailing list were key contributors to its content. > > Almost forget ... yes, the book offers a description of the GC log and it also offers suggestions on how you can use the "Desired survivor size", "new threshold" and tenuring distribution information to help determine how to size young generation. > > hths, > > charlie ... > > On 01/12/12 06:09 AM, Li Li wrote: >> >> yesterday, we set the maxNewSize to 256mb. And it works as we expected. but an hours ago, there is a promotion failure and a concurrent mode failure which cost 14s! could anyone explain the gc logs for me? >> or any documents for the gc log format explanation? >> >> 1. Desired survivor size 3342336 bytes, new threshold 5 (max 5) >> it says survivor size is 3mb >> 2. 58282K->57602K(59072K), 0.0543930 secs] >> it says before young gc the memory used is 58282K, after young gc, there are 57602K live objects and the total young space is 59072K >> 3. (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs] >> before old gc, 7.9GB is used. after old gc 3GB is alive. total old space is 7.9GB >> >> in which situation will occur promotion failure and concurrent mode failure? >> from http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ >> the author says when CMS is doing concurrent work and JVM is asked for more memory. if there isn't any space for new allocation. then it will occur concurrent mode failure and it will stop the world and do a serial old gc. >> if there exist enough space but they are fragemented, then a promotion failure will occur. >> am I right? >> >> 2012-01-12T18:27:32.582+0800: [GC [ParNew >> Desired survivor size 3342336 bytes, new threshold 1 (max 5) >> - age 1: 4594648 bytes, 4594648 total >> - age 2: 569200 bytes, 5163848 total >> : 58548K->5738K(59072K), 0.0159400 secs] 7958648K->7908502K(7984352K), 0.0160610 secs] [Times: user=0.17 sys=0.00, real=0.02 secs] >> 2012-01-12T18:27:32.609+0800: [GC [ParNew (promotion failed) >> Desired survivor size 3342336 bytes, new threshold 5 (max 5) >> - age 1: 1666376 bytes, 1666376 total >> : 58282K->57602K(59072K), 0.0543930 secs][CMS2012-01-12T18:27:33.804+0800: [CMS-concurrent-preclean: 14.098/34.323 secs] [Times: user=370.28 sys=5.65, real=34.31 secs] >> (concurrent mode failure): 7907405K->3086848K(7929856K), 14.3005340 secs] 7961046K->3086848K(7988928K), [CMS Perm : 32296K->31852K(53932K)], 14.3552450 secs] [Times: user=14.53 sys=0.01, real=14.35 secs] >> >> On Wed, Jan 11, 2012 at 5:48 PM, Kirk Pepperdine wrote: >> CMS is not adaptive. To reconfigure heap, for many reasons, you need a full GC to occur. The response to a concurrent mode failure is always a full GC. That gave the JVM the opportunity to resize heap space. If this behaviour isn't happening when it should or is cause other problems it's time to either set the young gen size directly with NewSize or switch to the parallel collector with the adaptive sizing policy turned on. Logic here is that you want to avoid long pauses, use CMS. If CMS is giving you long pauses, than the parallel collector might be a better choice. >> >> Regards, >> Kirk >> >> On 2012-01-11, at 10:32 AM, Li Li wrote: >> >> > after a concurrent mode failure. the young generation changed from about 50MB to 1.8GB >> > What's the logic behind this? >> > >> > 2012-01-10T22:23:54.544+0800: [GC [ParNew: 55389K->6528K(59072K), 0.0175440 secs] 5886124K->5839323K(6195204K), 0.0177480 secs] [Times: user=0.20 sys=0.00, real=0.01 secs] >> > 2012-01-10T22:23:54.575+0800: [GC [ParNew: 59072K->6528K(59072K), 0.0234040 secs] 5891867K->5845823K(6201540K), 0.0236070 secs] [Times: user=0.24 sys=0.00, real=0.02 secs] >> > 2012-01-10T22:23:54.612+0800: [GC [ParNew (promotion failed): 59072K->58862K(59072K), 2.3119860 secs][CMS2012-01-10T22:23:57.153+0800: [CMS-concurrent-preclean: 10.999/28.245 secs] [Times: user=290.41 sys=4.65, real=28.24 secs] >> > (concurrent mode failure): 5841457K->2063142K(6144000K), 8.8971660 secs] 5898367K->2063142K(6203072K), [CMS Perm : 31369K->31131K(52316K)], 11.2110080 secs] [Times: user=11.73 sys=0.51, real=11.21 secs] >> > 2012-01-10T22:24:06.125+0800: [GC [ParNew: 1638400K->46121K(1843200K), 0.0225800 secs] 3701542K->2109263K(7987200K), 0.0228190 secs] [Times: user=0.26 sys=0.02, real=0.02 secs] >> > 2012-01-10T22:24:06.357+0800: [GC [ParNew: 1684521K->111262K(1843200K), 0.0381370 secs] 3747663K->2174404K(7987200K), 0.0383860 secs] [Times: user=0.44 sys=0.04, real=0.04 secs] >> > >> > _______________________________________________ >> > hotspot-gc-use mailing list >> > hotspot-gc-use at openjdk.java.net >> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120112/80268148/attachment.html From gbowyer at fastmail.co.uk Thu Jan 19 13:37:00 2012 From: gbowyer at fastmail.co.uk (Greg Bowyer) Date: Thu, 19 Jan 2012 13:37:00 -0800 Subject: CMS Sudden long pauses with large number of weak references Message-ID: <4F188CFC.9030401@fastmail.co.uk> Hi all I have an application that is solr/lucene running on JDK 1.6 / 1.7 with CMS, 9Gb heap. The only unusual options here are DisableExplictGC and UseCompressedOops (which I do not think are the issue). Generally the application has a heap usage of ~4G with a 2-3G float of continual garbage on top. The application creates several large heap objects, some of these are sizeable arrays (~300mb), some of these are objects that have large retained graphs. One of these is a WeakHashMap that in lucene stores IndexReaders (as the key) -> "Fields", this mapping is used to avoid disk access, there are not typically a large number of readers open at any time (at worst this could be say 25). Generally we see the GC behavior being fairly solid across JVMS, however we get ever increasing amounts of concurrent-mode-failues with a marked spike of weak references, this only seems to appear when new indexes load on lucene, which is when the references are changed and when new versions of these caches are created. ---- %< ---- 4747.399: [GC 4747.399: [ParNew4747.492: [SoftReference, 0 refs, 0.0000120 secs]4747.492: [WeakReference, 5 refs, 0.0000050 secs]4747.492: [FinalReference, 0 refs, 0.0000040 secs]4747.492: [PhantomReference, 0 refs, 0.0000030 secs]4747.492: [JNI Weak Reference, 0.0000020 secs]: 133242K->11776K(153344K), 0.0925850 secs] 7199972K->7187397K(9420160K), 0.0926840 secs] [Times: user=0.72 sys=0.00, real=0.09 secs] Total time for which application threads were stopped: 0.0931920 seconds 4747.530: [GC [1 CMS-initial-mark: 7175620K(9266816K)] 7300153K(9420160K), 0.0044900 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] Total time for which application threads were stopped: 0.0398160 seconds 4747.535: [CMS-concurrent-mark-start] 4747.537: [GC 4747.537: [ParNew4747.568: [SoftReference, 0 refs, 0.0000120 secs]4747.568: [WeakReference, 6 refs, 0.0000060 secs]4747.568: [FinalReference, 0 refs, 0.0000040 secs]4747.568: [PhantomReference, 0 refs, 0.0000040 secs]4747.568: [JNI Weak Reference, 0.0000020 secs] (promotion failed): 127253K->127965K(153344K), 0.0315280 secs]4747.569: [CMS4755.725: [CMS-concurrent-mark: 8.157/8.190 secs] [Times: user=10.81 sys=3.49, real=8.19 secs] (concurrent mode failure)4757.436: [SoftReference, 0 refs, 0.0000050 secs]4757.436: [WeakReference, 6360 refs, 0.0016500 secs]4757.437: [FinalReference, 462 refs, 0.0088580 secs]4757.446: [PhantomReference, 114 refs, 0.0000220 secs]4757.446: [JNI Weak Reference, 0.0000070 secs]: 7180161K->3603572K(9266816K), 17.1070460 secs] 7302874K->3603572K(9420160K), [CMS Perm : 47059K->47044K(78380K)], 17.1387610 secs] [Times: user=19.75 sys=3.49, real=17.14 secs] Total time for which application threads were stopped: 17.1393510 seconds Total time for which application threads were stopped: 0.0245300 seconds ---- >% ---- ? Is this due to the defects fixed here http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f1391adc6681, would -XX:+ParallelRefProcEnabled help here ? Also is there anything that could explain the large number of suddenly processed and traced weak-references Many thanks -- Greg From rednaxelafx at gmail.com Thu Jan 19 19:40:09 2012 From: rednaxelafx at gmail.com (Krystal Mok) Date: Fri, 20 Jan 2012 11:40:09 +0800 Subject: CMS Sudden long pauses with large number of weak references In-Reply-To: <4F188CFC.9030401@fastmail.co.uk> References: <4F188CFC.9030401@fastmail.co.uk> Message-ID: Hi Greg, This might just be another case of "7112034: Parallel CMS fails to properly mark reference objects" [1], which is only fixed recently, and isn't delivered in any FCS releases yet. Could you try a recent JDK7u4 preview and see if the problem goes away? Or, try -XX:-CMSConcurrentMTEnabled, which should workaround this particular bug, but you'll get longer CMS collection cycles. - Kris [1]: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7112034 [2]: http://jdk7.java.net/download.html On Fri, Jan 20, 2012 at 5:37 AM, Greg Bowyer wrote: > Hi all > > I have an application that is solr/lucene running on JDK 1.6 / 1.7 with > CMS, 9Gb heap. The only unusual options here are DisableExplictGC and > UseCompressedOops (which I do not think are the issue). > > Generally the application has a heap usage of ~4G with a 2-3G float of > continual garbage on top. > > The application creates several large heap objects, some of these are > sizeable arrays (~300mb), some of these are objects that have large > retained graphs. > > One of these is a WeakHashMap that in lucene stores IndexReaders (as the > key) -> "Fields", this mapping is used to avoid disk access, there are > not typically a large number of readers open at any time (at worst this > could be say 25). > > Generally we see the GC behavior being fairly solid across JVMS, however > we get ever increasing amounts of concurrent-mode-failues with a marked > spike of weak references, this only seems to appear when new indexes > load on lucene, which is when the references are changed and when new > versions of these caches are created. > > ---- %< ---- > 4747.399: [GC 4747.399: [ParNew4747.492: [SoftReference, 0 refs, > 0.0000120 secs]4747.492: [WeakReference, 5 refs, 0.0000050 > secs]4747.492: [FinalReference, 0 refs, 0.0000040 secs]4747.492: > [PhantomReference, 0 refs, 0.0000030 secs]4747.492: [JNI Weak Reference, > 0.0000020 secs]: 133242K->11776K(153344K), 0.0925850 secs] > 7199972K->7187397K(9420160K), 0.0926840 secs] [Times: user=0.72 > sys=0.00, real=0.09 secs] > Total time for which application threads were stopped: 0.0931920 seconds > 4747.530: [GC [1 CMS-initial-mark: 7175620K(9266816K)] > 7300153K(9420160K), 0.0044900 secs] [Times: user=0.00 sys=0.00, > real=0.00 secs] > Total time for which application threads were stopped: 0.0398160 seconds > 4747.535: [CMS-concurrent-mark-start] > 4747.537: [GC 4747.537: [ParNew4747.568: [SoftReference, 0 refs, > 0.0000120 secs]4747.568: [WeakReference, 6 refs, 0.0000060 > secs]4747.568: [FinalReference, 0 refs, 0.0000040 secs]4747.568: > [PhantomReference, 0 refs, 0.0000040 secs]4747.568: [JNI Weak Reference, > 0.0000020 secs] (promotion failed): 127253K->127965K(153344K), 0.0315280 > secs]4747.569: [CMS4755.725: [CMS-concurrent-mark: 8.157/8.190 secs] > [Times: user=10.81 sys=3.49, real=8.19 secs] > (concurrent mode failure)4757.436: [SoftReference, 0 refs, 0.0000050 > secs]4757.436: [WeakReference, 6360 refs, 0.0016500 secs]4757.437: > [FinalReference, 462 refs, 0.0088580 secs]4757.446: [PhantomReference, > 114 refs, 0.0000220 secs]4757.446: [JNI Weak Reference, 0.0000070 secs]: > 7180161K->3603572K(9266816K), 17.1070460 secs] > 7302874K->3603572K(9420160K), [CMS Perm : 47059K->47044K(78380K)], > 17.1387610 secs] [Times: user=19.75 sys=3.49, real=17.14 secs] > Total time for which application threads were stopped: 17.1393510 seconds > Total time for which application threads were stopped: 0.0245300 seconds > ---- >% ---- > > ? Is this due to the defects fixed here > http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f1391adc6681, > would -XX:+ParallelRefProcEnabled help here ? > > Also is there anything that could explain the large number of suddenly > processed and traced weak-references > > Many thanks > -- Greg > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120120/ed4b5f1f/attachment.html From gbowyer at fastmail.co.uk Fri Jan 20 10:48:32 2012 From: gbowyer at fastmail.co.uk (Greg Bowyer) Date: Fri, 20 Jan 2012 10:48:32 -0800 Subject: CMS Sudden long pauses with large number of weak references In-Reply-To: References: <4F188CFC.9030401@fastmail.co.uk> Message-ID: <4F19B700.8030404@fastmail.co.uk> I didnt know this was in a forthcoming release so I had already compiled JDK7 from tip with the change (since its so small and simple) I still saw some pauses, although I think that might be down to CMS starting to late, so I have lowered the Occupancy Fraction and started testing. So far with the fix for 7112034 , ParallelRefProcessing and an Occupancy Fraction of 70 I am seeing fairly good results with the worst pause time so far of ~200 ms (this is far better than multi-second pauses I had before) I will let a few more tests run and then try jdk7u4 and let you know if this helps me thank you for confirming my thoughts on this. -- Greg On 19/01/12 19:40, Krystal Mok wrote: > Hi Greg, > > This might just be another case of "7112034: Parallel CMS fails to > properly mark reference objects" [1], which is only fixed recently, > and isn't delivered in any FCS releases yet. > > Could you try a recent JDK7u4 preview and see if the problem goes > away? Or, try -XX:-CMSConcurrentMTEnabled, which should workaround > this particular bug, but you'll get longer CMS collection cycles. > > - Kris > > [1]: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7112034 > [2]: http://jdk7.java.net/download.html > > On Fri, Jan 20, 2012 at 5:37 AM, Greg Bowyer > wrote: > > Hi all > > I have an application that is solr/lucene running on JDK 1.6 / 1.7 > with > CMS, 9Gb heap. The only unusual options here are DisableExplictGC and > UseCompressedOops (which I do not think are the issue). > > Generally the application has a heap usage of ~4G with a 2-3G float of > continual garbage on top. > > The application creates several large heap objects, some of these are > sizeable arrays (~300mb), some of these are objects that have large > retained graphs. > > One of these is a WeakHashMap that in lucene stores IndexReaders > (as the > key) -> "Fields", this mapping is used to avoid disk access, there are > not typically a large number of readers open at any time (at worst > this > could be say 25). > > Generally we see the GC behavior being fairly solid across JVMS, > however > we get ever increasing amounts of concurrent-mode-failues with a > marked > spike of weak references, this only seems to appear when new indexes > load on lucene, which is when the references are changed and when new > versions of these caches are created. > > ---- %< ---- > 4747.399: [GC 4747.399: [ParNew4747.492: [SoftReference, 0 refs, > 0.0000120 secs]4747.492: [WeakReference, 5 refs, 0.0000050 > secs]4747.492: [FinalReference, 0 refs, 0.0000040 secs]4747.492: > [PhantomReference, 0 refs, 0.0000030 secs]4747.492: [JNI Weak > Reference, > 0.0000020 secs]: 133242K->11776K(153344K), 0.0925850 secs] > 7199972K->7187397K(9420160K), 0.0926840 secs] [Times: user=0.72 > sys=0.00, real=0.09 secs] > Total time for which application threads were stopped: 0.0931920 > seconds > 4747.530: [GC [1 CMS-initial-mark: 7175620K(9266816K)] > 7300153K(9420160K), 0.0044900 secs] [Times: user=0.00 sys=0.00, > real=0.00 secs] > Total time for which application threads were stopped: 0.0398160 > seconds > 4747.535: [CMS-concurrent-mark-start] > 4747.537: [GC 4747.537: [ParNew4747.568: [SoftReference, 0 refs, > 0.0000120 secs]4747.568: [WeakReference, 6 refs, 0.0000060 > secs]4747.568: [FinalReference, 0 refs, 0.0000040 secs]4747.568: > [PhantomReference, 0 refs, 0.0000040 secs]4747.568: [JNI Weak > Reference, > 0.0000020 secs] (promotion failed): 127253K->127965K(153344K), > 0.0315280 > secs]4747.569: [CMS4755.725: [CMS-concurrent-mark: 8.157/8.190 secs] > [Times: user=10.81 sys=3.49, real=8.19 secs] > (concurrent mode failure)4757.436: [SoftReference, 0 refs, 0.0000050 > secs]4757.436: [WeakReference, 6360 refs, 0.0016500 secs]4757.437: > [FinalReference, 462 refs, 0.0088580 secs]4757.446: [PhantomReference, > 114 refs, 0.0000220 secs]4757.446: [JNI Weak Reference, 0.0000070 > secs]: > 7180161K->3603572K(9266816K), 17.1070460 secs] > 7302874K->3603572K(9420160K), [CMS Perm : 47059K->47044K(78380K)], > 17.1387610 secs] [Times: user=19.75 sys=3.49, real=17.14 secs] > Total time for which application threads were stopped: 17.1393510 > seconds > Total time for which application threads were stopped: 0.0245300 > seconds > ---- >% ---- > > ? Is this due to the defects fixed here > http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f1391adc6681, > would -XX:+ParallelRefProcEnabled help here ? > > Also is there anything that could explain the large number of suddenly > processed and traced weak-references > > Many thanks > -- Greg > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120120/283a2161/attachment-0001.html From gbowyer at fastmail.co.uk Mon Jan 23 14:39:52 2012 From: gbowyer at fastmail.co.uk (Greg Bowyer) Date: Mon, 23 Jan 2012 14:39:52 -0800 Subject: Odd pause with ParNew Message-ID: <4F1DE1B8.1070909@fastmail.co.uk> Hi all, working through my GC issues I have found that the fix for CMS weak references is making my GC far more predictable However I occasionally find that sometimes I see a ParNew collection of 1 second, outside of the number of VM operations for the safepoint there is nothing that would seem to be an issue ? does anyone know why this ParNew claims a 1 second wait ? This is for a JDK compiled from the tip of jdk7/jdk7 repo in openjdk with the recent fix for CMS parallel marking. ---- %< ---- Total time for which application threads were stopped: 0.0005490 seconds Application time: 0.7253910 seconds 38668.349: [GC 38668.349: [ParNew38669.323: [SoftReference, 0 refs, 0.0000160 secs]38669.323: [WeakReference, 12 refs, 0.0000080 secs]38669.323: [FinalReference, 0 refs, 0.0000040 secs]38669.323: [PhantomReference, 0 refs, 0.0000060 secs]38669.323: [JNI Weak Reference, 0.0000080 secs] Desired survivor size 34865152 bytes, new threshold 1 (max 6) - age 1: 69728352 bytes, 69728352 total : 610922K->68096K(613440K), 0.9741030 secs] 10005726K->9689043K(12514816K), 0.9742160 secs] [Times: user=7.26 sys=0.01, real=0.97 secs] Total time for which application threads were stopped: 1.0050700 seconds ---- >% ---- This matches the following safepoint (I guess): ---- %< ---- 38668.316: GenCollectForAllocation [ 55 7 10 ] [ 0 0 30 0 974 ] 7 ---- >% ---- Which is hiding for reference in --- %< ---- vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 38643.172: GenCollectForAllocation [ 55 9 12 ] [ 0 0 148 0 130 ] 9 38646.203: GenCollectForAllocation [ 55 12 12 ] [ 2 0 22 0 139 ] 12 38648.293: GenCollectForAllocation [ 55 13 13 ] [ 1 0 1 0 144 ] 13 38649.852: CMS_Final_Remark [ 55 14 14 ] [ 0 0 1 0 168 ] 14 38652.328: FindDeadlocks [ 55 16 16 ] [ 0 0 59 0 0 ] 16 38652.387: FindDeadlocks [ 55 4 4 ] [ 0 0 0 0 0 ] 4 38652.762: GenCollectForAllocation [ 55 9 9 ] [ 0 0 24 0 132 ] 9 38655.586: GenCollectForAllocation [ 55 10 10 ] [ 0 1 9 0 294 ] 10 38656.961: GenCollectForAllocation [ 55 6 6 ] [ 0 0 0 0 215 ] 5 38658.125: GenCollectForAllocation [ 55 3 4 ] [ 0 0 1 0 91 ] 2 38658.223: CMS_Initial_Mark [ 55 2 4 ] [ 0 0 0 0 6 ] 1 38658.926: GenCollectForAllocation [ 55 6 6 ] [ 0 0 1 0 102 ] 6 38661.621: GenCollectForAllocation [ 55 5 5 ] [ 0 0 1 0 72 ] 5 38663.527: GenCollectForAllocation [ 55 7 7 ] [ 0 0 0 0 56 ] 7 38665.180: GenCollectForAllocation [ 55 5 5 ] [ 0 0 1 0 659 ] 4 38667.230: GenCollectForAllocation [ 55 11 11 ] [ 0 0 0 0 88 ] 11 38667.566: FindDeadlocks [ 55 13 13 ] [ 0 0 0 0 0 ] 13 38667.566: FindDeadlocks [ 55 5 5 ] [ 0 0 0 0 0 ] 5 38667.582: RevokeBias [ 55 10 11 ] [ 0 0 8 0 0 ] 9 > 38668.316: GenCollectForAllocation [ 55 7 10 ] [ 0 0 30 0 974 ] 7 38670.551: GenCollectForAllocation [ 55 10 10 ] [ 0 0 1 0 391 ] 10 38671.875: GenCollectForAllocation [ 55 6 6 ] [ 0 0 25 0 409 ] 6 38674.230: GenCollectForAllocation [ 55 11 11 ] [ 1 0 1 0 415 ] 10 38676.121: GenCollectForAllocation [ 55 1 1 ] [ 0 0 0 0 558 ] 0 38677.691: GenCollectForAllocation [ 55 7 9 ] [ 0 0 1 0 388 ] 6 38679.488: GenCollectForAllocation [ 55 6 9 ] [ 0 0 10 0 297 ] 6 38680.367: CMS_Final_Remark [ 55 12 12 ] [ 0 0 0 0 310 ] 12 38682.016: GenCollectForAllocation [ 55 10 11 ] [ 0 0 0 0 295 ] 10 38683.473: FindDeadlocks [ 55 6 8 ] [ 1 0 9 0 0 ] 4 38683.484: FindDeadlocks [ 55 7 7 ] [ 0 0 0 0 0 ] 7 vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 38683.867: GenCollectForAllocation [ 55 7 8 ] [ 1 0 2 0 198 ] 7 38685.070: GenCollectForAllocation [ 55 8 8 ] [ 0 0 0 0 430 ] 8 38687.312: GenCollectForAllocation [ 55 6 8 ] [ 0 0 1 0 287 ] 6 38690.094: GenCollectForAllocation [ 55 8 10 ] [ 0 0 16 0 49 ] 7 38692.910: CMS_Initial_Mark [ 55 3 3 ] [ 0 0 1 0 129 ] 2 38694.043: no vm operation [ 55 7 9 ] [ 0 0 644 0 0 ] 5 38696.605: GenCollectForAllocation [ 55 9 9 ] [ 0 0 272 0 90 ] 9 38698.535: FindDeadlocks [ 55 8 13 ] [ 0 0 40 0 0 ] 8 38698.578: FindDeadlocks [ 55 8 8 ] [ 0 0 3 0 0 ] 8 38702.559: GenCollectForAllocation [ 55 6 6 ] [ 0 0 1 0 55 ] 6 38709.008: GenCollectForAllocation [ 55 1 1 ] [ 0 0 0 0 48 ] 0 38712.719: CMS_Final_Remark [ 55 3 3 ] [ 0 0 0 0 51 ] 2 38713.625: FindDeadlocks [ 55 2 2 ] [ 0 0 0 0 0 ] 0 38713.625: FindDeadlocks [ 55 2 2 ] [ 0 0 0 0 0 ] 2 38720.492: CMS_Initial_Mark [ 55 4 4 ] [ 0 0 1 0 88 ] 4 38721.059: GenCollectForAllocation [ 55 4 4 ] [ 0 0 1 0 58 ] 3 38725.684: GenCollectForAllocation [ 55 3 4 ] [ 0 0 0 0 57 ] 2 38725.742: CMS_Final_Remark [ 55 2 4 ] [ 0 0 0 0 254 ] 2 38729.410: FindDeadlocks [ 55 5 5 ] [ 0 0 0 0 0 ] 5 38729.410: FindDeadlocks [ 55 5 5 ] [ 0 0 0 0 0 ] 5 38730.527: GenCollectForAllocation [ 55 11 10 ] [ 1 0 16 0 43 ] 8 38734.977: GenCollectForAllocation [ 55 2 2 ] [ 0 0 0 0 53 ] 0 ---- >% ---- Any thoughts ? Many thanks -- Greg From ysr1729 at gmail.com Mon Jan 23 17:45:34 2012 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Mon, 23 Jan 2012 17:45:34 -0800 Subject: Odd pause with ParNew In-Reply-To: <4F1DE1B8.1070909@fastmail.co.uk> References: <4F1DE1B8.1070909@fastmail.co.uk> Message-ID: Hi Greg -- one of the first things I check in such cases is to see if that particular scavenge happened to copy much more data than the rest. (Since scavenge times are directly proportional to copy/allocation costs.) The second thing I check is whether the scavenge immediately followed a compacting collection (for which there is a known allocation pathology with an existing CR). For the first above, there isn't sufficient surrounding info/context, but the "new threshold (max 6)" indicates that there may have been a spurt in promotion (which is explained by the sudden lowering of tenuring threshold), so you may find upon further analysis that there was indeed a sudden jump in surviving objects (and thence a concommitant increase perhaps in objects copied). >From the safepoint stats you present it doesn't look like we'd have to look beyond GC to find an explanation (i.e. safeppointing or non-GC processing do not seem to be implicated here). HTHS. -- ramki On Mon, Jan 23, 2012 at 2:39 PM, Greg Bowyer wrote: > Hi all, working through my GC issues I have found that the fix for CMS > weak references is making my GC far more predictable > > However I occasionally find that sometimes I see a ParNew collection of > 1 second, outside of the number of VM operations for the safepoint there > is nothing that would seem to be an issue ? does anyone know why this > ParNew claims a 1 second wait ? > > This is for a JDK compiled from the tip of jdk7/jdk7 repo in openjdk > with the recent fix for CMS parallel marking. > > ---- %< ---- > Total time for which application threads were stopped: 0.0005490 seconds > Application time: 0.7253910 seconds > 38668.349: [GC 38668.349: [ParNew38669.323: [SoftReference, 0 refs, > 0.0000160 secs]38669.323: [WeakReference, 12 refs, 0.0000080 > secs]38669.323: [FinalReference, 0 refs, 0.0000040 secs]38669.323: > [PhantomReference, 0 refs, 0.0000060 secs]38669.323: [JNI Weak > Reference, 0.0000080 secs] > Desired survivor size 34865152 bytes, new threshold 1 (max 6) > - age 1: 69728352 bytes, 69728352 total > : 610922K->68096K(613440K), 0.9741030 secs] > 10005726K->9689043K(12514816K), 0.9742160 secs] [Times: user=7.26 > sys=0.01, real=0.97 secs] > Total time for which application threads were stopped: 1.0050700 seconds > ---- >% ---- > > This matches the following safepoint (I guess): > ---- %< ---- > 38668.316: GenCollectForAllocation [ 55 > 7 10 ] [ 0 0 30 0 974 ] 7 > ---- >% ---- > > Which is hiding for reference in > > --- %< ---- > vmop [threads: total initially_running > wait_to_block] [time: spin block sync cleanup vmop] page_trap_count > 38643.172: GenCollectForAllocation [ 55 > 9 12 ] [ 0 0 148 0 130 ] 9 > 38646.203: GenCollectForAllocation [ 55 > 12 12 ] [ 2 0 22 0 139 ] 12 > 38648.293: GenCollectForAllocation [ 55 > 13 13 ] [ 1 0 1 0 144 ] 13 > 38649.852: CMS_Final_Remark [ 55 > 14 14 ] [ 0 0 1 0 168 ] 14 > 38652.328: FindDeadlocks [ 55 > 16 16 ] [ 0 0 59 0 0 ] 16 > 38652.387: FindDeadlocks [ 55 > 4 4 ] [ 0 0 0 0 0 ] 4 > 38652.762: GenCollectForAllocation [ 55 > 9 9 ] [ 0 0 24 0 132 ] 9 > 38655.586: GenCollectForAllocation [ 55 > 10 10 ] [ 0 1 9 0 294 ] 10 > 38656.961: GenCollectForAllocation [ 55 > 6 6 ] [ 0 0 0 0 215 ] 5 > 38658.125: GenCollectForAllocation [ 55 > 3 4 ] [ 0 0 1 0 91 ] 2 > 38658.223: CMS_Initial_Mark [ 55 > 2 4 ] [ 0 0 0 0 6 ] 1 > 38658.926: GenCollectForAllocation [ 55 > 6 6 ] [ 0 0 1 0 102 ] 6 > 38661.621: GenCollectForAllocation [ 55 > 5 5 ] [ 0 0 1 0 72 ] 5 > 38663.527: GenCollectForAllocation [ 55 > 7 7 ] [ 0 0 0 0 56 ] 7 > 38665.180: GenCollectForAllocation [ 55 > 5 5 ] [ 0 0 1 0 659 ] 4 > 38667.230: GenCollectForAllocation [ 55 > 11 11 ] [ 0 0 0 0 88 ] 11 > 38667.566: FindDeadlocks [ 55 > 13 13 ] [ 0 0 0 0 0 ] 13 > 38667.566: FindDeadlocks [ 55 > 5 5 ] [ 0 0 0 0 0 ] 5 > 38667.582: RevokeBias [ 55 > 10 11 ] [ 0 0 8 0 0 ] 9 > > 38668.316: GenCollectForAllocation [ 55 > 7 10 ] [ 0 0 30 0 974 ] 7 > 38670.551: GenCollectForAllocation [ 55 > 10 10 ] [ 0 0 1 0 391 ] 10 > 38671.875: GenCollectForAllocation [ 55 > 6 6 ] [ 0 0 25 0 409 ] 6 > 38674.230: GenCollectForAllocation [ 55 > 11 11 ] [ 1 0 1 0 415 ] 10 > 38676.121: GenCollectForAllocation [ 55 > 1 1 ] [ 0 0 0 0 558 ] 0 > 38677.691: GenCollectForAllocation [ 55 > 7 9 ] [ 0 0 1 0 388 ] 6 > 38679.488: GenCollectForAllocation [ 55 > 6 9 ] [ 0 0 10 0 297 ] 6 > 38680.367: CMS_Final_Remark [ 55 > 12 12 ] [ 0 0 0 0 310 ] 12 > 38682.016: GenCollectForAllocation [ 55 > 10 11 ] [ 0 0 0 0 295 ] 10 > 38683.473: FindDeadlocks [ 55 > 6 8 ] [ 1 0 9 0 0 ] 4 > 38683.484: FindDeadlocks [ 55 > 7 7 ] [ 0 0 0 0 0 ] 7 > vmop [threads: total initially_running > wait_to_block] [time: spin block sync cleanup vmop] page_trap_count > 38683.867: GenCollectForAllocation [ 55 > 7 8 ] [ 1 0 2 0 198 ] 7 > 38685.070: GenCollectForAllocation [ 55 > 8 8 ] [ 0 0 0 0 430 ] 8 > 38687.312: GenCollectForAllocation [ 55 > 6 8 ] [ 0 0 1 0 287 ] 6 > 38690.094: GenCollectForAllocation [ 55 > 8 10 ] [ 0 0 16 0 49 ] 7 > 38692.910: CMS_Initial_Mark [ 55 > 3 3 ] [ 0 0 1 0 129 ] 2 > 38694.043: no vm operation [ 55 > 7 9 ] [ 0 0 644 0 0 ] 5 > 38696.605: GenCollectForAllocation [ 55 > 9 9 ] [ 0 0 272 0 90 ] 9 > 38698.535: FindDeadlocks [ 55 > 8 13 ] [ 0 0 40 0 0 ] 8 > 38698.578: FindDeadlocks [ 55 > 8 8 ] [ 0 0 3 0 0 ] 8 > 38702.559: GenCollectForAllocation [ 55 > 6 6 ] [ 0 0 1 0 55 ] 6 > 38709.008: GenCollectForAllocation [ 55 > 1 1 ] [ 0 0 0 0 48 ] 0 > 38712.719: CMS_Final_Remark [ 55 > 3 3 ] [ 0 0 0 0 51 ] 2 > 38713.625: FindDeadlocks [ 55 > 2 2 ] [ 0 0 0 0 0 ] 0 > 38713.625: FindDeadlocks [ 55 > 2 2 ] [ 0 0 0 0 0 ] 2 > 38720.492: CMS_Initial_Mark [ 55 > 4 4 ] [ 0 0 1 0 88 ] 4 > 38721.059: GenCollectForAllocation [ 55 > 4 4 ] [ 0 0 1 0 58 ] 3 > 38725.684: GenCollectForAllocation [ 55 > 3 4 ] [ 0 0 0 0 57 ] 2 > 38725.742: CMS_Final_Remark [ 55 > 2 4 ] [ 0 0 0 0 254 ] 2 > 38729.410: FindDeadlocks [ 55 > 5 5 ] [ 0 0 0 0 0 ] 5 > 38729.410: FindDeadlocks [ 55 > 5 5 ] [ 0 0 0 0 0 ] 5 > 38730.527: GenCollectForAllocation [ 55 > 11 10 ] [ 1 0 16 0 43 ] 8 > 38734.977: GenCollectForAllocation [ 55 > 2 2 ] [ 0 0 0 0 53 ] 0 > > ---- >% ---- > > Any thoughts ? Many thanks > > -- Greg > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120123/08d1c232/attachment-0001.html From taras.tielkes at gmail.com Tue Jan 24 10:15:39 2012 From: taras.tielkes at gmail.com (Taras Tielkes) Date: Tue, 24 Jan 2012 19:15:39 +0100 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: <4F1ECE7B.3040502@oracle.com> References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> <4F0DBEC4.7040907@oracle.com> <4F1ECE7B.3040502@oracle.com> Message-ID: Hi Jon, Xmx is 5g, PermGen is 256m, new is 400m. The overall tenured gen usage is at the point when I would expect the CMS to kick in though. Does this mean I'd have to lower the CMS initiating occupancy setting (currently at 68%)? In addition, are the promotion failure sizes expressed in bytes? If so, I'm surprised to see such odd-sized (7, for example) sizes. Thanks, Taras On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu wrote: > > Taras, > > The pattern makes sense if the tenured (cms) gen is in fact full. > Multiple ?GC workers try to get a chunk of space for > an allocation and there is no space. > > Jon > > > On 01/24/12 04:22, Taras Tielkes wrote: >> >> Hi Jon, >> >> While inspecting our production logs for promotion failures, I saw the >> following one today: >> -------- >> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew: >> 349623K->20008K(368640K), 0.0294350 secs] >> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21 >> sys=0.00, real=0.03 secs] >> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew: >> 347688K->40960K(368640K), 0.0536700 secs] >> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36 >> sys=0.00, real=0.05 secs] >> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew >> (0: promotion failure size = 6) ?(1: promotion failure size = 6) ?(2: >> promotion failure size = 7) ?(3: promotion failure size = 7) ?(4: >> promotion failure size = 9) ?(5: p >> romotion failure size = 9) ?(6: promotion failure size = 6) ?(7: >> promotion failure size = 9) ?(promotion failed): >> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS: >> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K( >> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs] >> [Times: user=10.17 sys=1.10, real=9.11 secs] >> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew: >> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K), >> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] >> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew: >> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K), >> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >> -------- >> >> It's different from the others in two ways: >> 1) a "parallel" promotion failure in all 8 ParNew threads? >> 2) the very small size of the promoted object >> >> Do such an promotion failure pattern ring a bell? It does not make sense >> to me. >> >> Thanks, >> Taras >> >> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu >> ?wrote: >>> >>> Taras, >>> >>>> I assume that the large sizes for the promotion failures during ParNew >>>> are confirming that eliminating large array allocations might help >>>> here. Do you agree? >>> >>> I agree that eliminating the large array allocation will help but you >>> are still having >>> promotion failures when the allocation size is small (I think it was >>> 1026). ?That >>> says that you are filling up the old (cms) generation faster than the GC >>> can >>> collect it. ?The large arrays are aggrevating the problem but not >>> necessarily >>> the cause. >>> >>> If these are still your heap sizes, >>> >>>> -Xms5g >>>> -Xmx5g >>>> -Xmn400m >>> >>> Start by increasing the young gen size as may already have been >>> suggested. ?If you have a test setup where you can experiment, >>> try doubling the young gen size to start. >>> >>> If you have not seen this, it might be helpful. >>> >>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a >>>> >>>> I'm not sure what to make of the concurrent mode >>> >>> The concurrent mode failure is a consequence of the promotion failure. >>> Once the promotion failure happens the concurrent mode failure is >>> inevitable. >>> >>> Jon >>> >>> >>>> . >>> >>> >>> On 1/11/2012 3:00 AM, Taras Tielkes wrote: >>>> >>>> Hi Jon, >>>> >>>> We've added the -XX:+PrintPromotionFailure flag to our production >>>> application yesterday. >>>> The application is running on 4 (homogenous) nodes. >>>> >>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure >>>> event during ParNew: >>>> >>>> node-002 >>>> ------- >>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: >>>> 357592K->23382K(368640K), 0.0298150 secs] >>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 >>>> sys=0.01, real=0.03 secs] >>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: >>>> 351062K->39795K(368640K), 0.0401170 secs] >>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 >>>> sys=0.00, real=0.04 secs] >>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: >>>> promotion failure size = 4281460) ?(promotion failed): >>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: >>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K >>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 >>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs] >>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: >>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), >>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] >>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: >>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), >>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] >>>> >>>> node-003 >>>> ------- >>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: >>>> 346950K->21342K(368640K), 0.0333090 secs] >>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 >>>> sys=0.00, real=0.03 secs] >>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: >>>> 345070K->32211K(368640K), 0.0369260 secs] >>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 >>>> sys=0.00, real=0.04 secs] >>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: >>>> promotion failure size = 1266955) ?(promotion failed): >>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: >>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 >>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 >>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs] >>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: >>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), >>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] >>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: >>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), >>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] >>>> >>>> node-004 >>>> ------- >>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: >>>> 358429K->40960K(368640K), 0.0629910 secs] >>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 >>>> sys=0.02, real=0.06 secs] >>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: >>>> 368640K->40960K(368640K), 0.0819780 secs] >>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 >>>> sys=0.00, real=0.08 secs] >>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: >>>> promotion failure size = 2788662) ?(promotion failed): >>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: >>>> 3310044K->330922K(4833280K), 4.5104170 secs] >>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], >>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] >>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: >>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), >>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] >>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: >>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), >>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] >>>> >>>> On a fourth node, I've found a different event: promotion failure >>>> during CMS, with a much smaller size: >>>> >>>> node-001 >>>> ------- >>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: >>>> 354039K->40960K(368640K), 0.0667340 secs] >>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 >>>> sys=0.01, real=0.06 secs] >>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: >>>> 368640K->40960K(368640K), 0.2586390 secs] >>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 >>>> sys=0.13, real=0.26 secs] >>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: >>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: >>>> user=0.07 sys=0.00, real=0.07 secs] >>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] >>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: >>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] >>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] >>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: >>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] >>>> 2012-01-10T18:30:10.089+0100: 48431.382: >>>> [CMS-concurrent-abortable-preclean-start] >>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: >>>> 368640K->40960K(368640K), 0.1214420 secs] >>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 >>>> sys=0.05, real=0.12 secs] >>>> 2012-01-10T18:30:12.785+0100: 48434.078: >>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: >>>> user=10.72 sys=0.48, real=2.70 secs] >>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K >>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: >>>> [ParNew (promotion failure size = 1026) ?(promotion failed): >>>> 206521K->206521K(368640K), 0.1667280 secs] >>>> ? 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 >>>> sys=0.04, real=0.17 secs] >>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs >>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 >>>> secs]48434.474: [scrub symbol& ? ?string tables, 0.0088370 secs] [1 >>>> CMS-remark: 3489675K(4833280K)] 36961 >>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 >>>> secs] >>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] >>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: >>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: >>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] >>>> ? (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 >>>> secs] 2873988K->334385K(5201920K), [CMS Perm : >>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 >>>> sys=0.00, real=8.61 secs] >>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: >>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), >>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] >>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: >>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), >>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] >>>> >>>> I assume that the large sizes for the promotion failures during ParNew >>>> are confirming that eliminating large array allocations might help >>>> here. Do you agree? >>>> I'm not sure what to make of the concurrent mode failure. >>>> >>>> Thanks in advance for any suggestions, >>>> Taras >>>> >>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu >>>> ? ?wrote: >>>>> >>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote: >>>>>> >>>>>> Hi Jon, >>>>>> >>>>>> We've enabled the PrintPromotionFailure flag in our preprod >>>>>> environment, but so far, no failures yet. >>>>>> We know that the load we generate there is not representative. But >>>>>> perhaps we'll catch something, given enough patience. >>>>>> >>>>>> The flag will also be enabled in our production environment next week >>>>>> - so one way or the other, we'll get more diagnostic data soon. >>>>>> I'll also do some allocation profiling of the application in isolation >>>>>> - I know that there is abusive large byte[] and char[] allocation in >>>>>> there. >>>>>> >>>>>> I've got two questions for now: >>>>>> >>>>>> 1) From googling around on the output to expect >>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >>>>>> I see that -XX:+PrintPromotionFailure will generate output like this: >>>>>> ------- >>>>>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion >>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs] >>>>>> ------- >>>>>> In that example line, what does the "0" stand for? >>>>> >>>>> It's the index of the GC worker thread ?that experienced the promotion >>>>> failure. >>>>> >>>>>> 2) Below is a snippet of (real) gc log from our production >>>>>> application: >>>>>> ------- >>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >>>>>> 345951K->40960K(368640K), 0.0676780 secs] >>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >>>>>> sys=0.01, real=0.06 secs] >>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >>>>>> 368640K->40959K(368640K), 0.0618880 secs] >>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >>>>>> sys=0.00, real=0.06 secs] >>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >>>>>> user=0.04 sys=0.00, real=0.04 secs] >>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: >>>>>> [CMS-concurrent-preclean-start] >>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001: >>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>> ? ?CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs] >>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >>>>>> 3432839K->3423755K(5201920 >>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading, >>>>>> 0.0289480 secs]2136605.833: [scrub symbol& ? ? ?string tables, >>>>>> 0.0093940 >>>>>> secs] [1 CMS-remark: 3318289K(4833280K >>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >>>>>> real=7.61 secs] >>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850: >>>>>> [CMS-concurrent-sweep-start] >>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >>>>>> ? ?(concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm : >>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >>>>>> sys=0.00, real=10.29 secs] >>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >>>>>> ------- >>>>>> >>>>>> In this case I don't know how to interpret the output. >>>>>> a) There's a promotion failure that took 7.49 secs >>>>> >>>>> This is the time it took to attempt the minor collection (ParNew) and >>>>> to >>>>> do recovery >>>>> from the failure. >>>>> >>>>>> b) There's a full GC that took 14.08 secs >>>>>> c) There's a concurrent mode failure that took 10.29 secs >>>>> >>>>> Not sure about b) and c) because the output is mixed up with the >>>>> concurrent-sweep >>>>> output but ?I think the "concurrent mode failure" message is part of >>>>> the >>>>> "Full GC" >>>>> message. ?My guess is that the 10.29 is the time for the Full GC and >>>>> the >>>>> 14.08 >>>>> maybe is part of the concurrent-sweep message. ?Really hard to be sure. >>>>> >>>>> Jon >>>>>> >>>>>> How are these events, and their (real) times related to each other? >>>>>> >>>>>> Thanks in advance, >>>>>> Taras >>>>>> >>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon >>>>>> Masamitsu ? ? ?wrote: >>>>>>> >>>>>>> Taras, >>>>>>> >>>>>>> PrintPromotionFailure seems like it would go a long >>>>>>> way to identify the root of your promotion failures (or >>>>>>> at least eliminating some possible causes). ? ?I think it >>>>>>> would help focus the discussion if you could send >>>>>>> the result of that experiment early. >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We're running an application with the CMS/ParNew collectors that is >>>>>>>> experiencing occasional promotion failures. >>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>>>>>> I've listed the specific JVM options used below (a). >>>>>>>> >>>>>>>> The application is deployed across a handful of machines, and the >>>>>>>> promotion failures are fairly uniform across those. >>>>>>>> >>>>>>>> The first kind of failure we observe is a promotion failure during >>>>>>>> ParNew collection, I've included a snipped from the gc log below >>>>>>>> (b). >>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps >>>>>>>> triggered by the same cause), see (c) below. >>>>>>>> The frequency (after running for a some weeks) is approximately once >>>>>>>> per day. This is bearable, but obviously we'd like to improve on >>>>>>>> this. >>>>>>>> >>>>>>>> Apart from high-volume request handling (which allocates a lot of >>>>>>>> small objects), the application also runs a few dozen background >>>>>>>> threads that download and process XML documents, typically in the >>>>>>>> 5-30 >>>>>>>> MB range. >>>>>>>> A known deficiency in the existing code is that the XML content is >>>>>>>> copied twice before processing (once to a byte[], and later again to >>>>>>>> a >>>>>>>> String/char[]). >>>>>>>> Given that a 30 MB XML stream will result in a 60 MB >>>>>>>> java.lang.String/char[], my suspicion is that these big array >>>>>>>> allocations are causing us to run into the CMS fragmentation issue. >>>>>>>> >>>>>>>> My questions are: >>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to >>>>>>>> conclude that CMS fragmentation is the cause of the promotion >>>>>>>> failure? >>>>>>>> 2) If not, what's the next step of investigating the cause? >>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get >>>>>>>> a >>>>>>>> feeling for the size of the objects that fail promotion. >>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>>>>>> case? >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> Taras >>>>>>>> >>>>>>>> a) Current JVM options: >>>>>>>> -------------------------------- >>>>>>>> -server >>>>>>>> -Xms5g >>>>>>>> -Xmx5g >>>>>>>> -Xmn400m >>>>>>>> -XX:PermSize=256m >>>>>>>> -XX:MaxPermSize=256m >>>>>>>> -XX:+PrintGCTimeStamps >>>>>>>> -verbose:gc >>>>>>>> -XX:+PrintGCDateStamps >>>>>>>> -XX:+PrintGCDetails >>>>>>>> -XX:SurvivorRatio=8 >>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>> -XX:+UseParNewGC >>>>>>>> -XX:+DisableExplicitGC >>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>>>>> -XX:+CMSClassUnloadingEnabled >>>>>>>> -XX:+CMSScavengeBeforeRemark >>>>>>>> -XX:CMSInitiatingOccupancyFraction=68 >>>>>>>> -Xloggc:gc.log >>>>>>>> -------------------------------- >>>>>>>> >>>>>>>> b) Promotion failure during ParNew >>>>>>>> -------------------------------- >>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>>>>>> 368640K->40959K(368640K), 0.0693460 secs] >>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>>>>>> sys=0.01, real=0.07 secs] >>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>>>>>> 368639K->31321K(368640K), 0.0511400 secs] >>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>>>>>> 359001K->18694K(368640K), 0.0272970 secs] >>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>>>>>> sys=0.00, real=0.03 secs] >>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>>>>>> 3505808K->434291K >>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>>>>>> 327680K->40960K(368640K), 0.0949460 secs] >>>>>>>> 761971K->514584K(5201920K), >>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.1299190 secs] >>>>>>>> 842264K->625681K(5201920K), >>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.0870940 secs] >>>>>>>> 953361K->684121K(5201920K), >>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>>>>>> -------------------------------- >>>>>>>> >>>>>>>> c) Promotion failure during CMS >>>>>>>> -------------------------------- >>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>>>>>> 357228K->40960K(368640K), 0.0525110 secs] >>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>>>>>> 366075K->37119K(368640K), 0.0479780 secs] >>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>>>>>> sys=0.01, real=0.05 secs] >>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>>>>>> 364792K->40960K(368640K), 0.0421740 secs] >>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>>>>>> user=0.02 sys=0.00, real=0.03 secs] >>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529: >>>>>>>> [CMS-concurrent-mark-start] >>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.0836690 secs] >>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>>>>>> sys=0.01, real=0.08 secs] >>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: >>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>>>>>> user=6.68 sys=0.27, real=1.40 secs] >>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 >>>>>>>> secs] >>>>>>>> ? ? 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>>>>>> sys=2.58, real=9.88 secs] >>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak >>>>>>>> refs >>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>>>>>> secs]703031.419: [scrub symbol& ? ? ? ?string tables, 0.0094960 >>>>>>>> secs] [1 CMS >>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436: >>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>>>>>> ? ? (concurrent mode failure): 3370829K->433437K(4833280K), >>>>>>>> 10.9594300 >>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>>>>>> sys=0.00, real=10.96 secs] >>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>>>>>> 327680K->40960K(368640K), 0.0799960 secs] >>>>>>>> 761117K->517836K(5201920K), >>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.0784460 secs] >>>>>>>> 845516K->557872K(5201920K), >>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.0784040 secs] >>>>>>>> 885552K->603017K(5201920K), >>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>>>>>> -------------------------------- >>>>>>>> _______________________________________________ >>>>>>>> hotspot-gc-use mailing list >>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>> >>>>>>> _______________________________________________ >>>>>>> hotspot-gc-use mailing list >>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>> >>>>>> _______________________________________________ >>>>>> hotspot-gc-use mailing list >>>>>> hotspot-gc-use at openjdk.java.net >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>> >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> hotspot-gc-use at openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>> >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From jon.masamitsu at oracle.com Tue Jan 24 14:21:11 2012 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Tue, 24 Jan 2012 14:21:11 -0800 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> <4F0DBEC4.7040907@oracle.com> <4F1ECE7B.3040502@oracle.com> Message-ID: <4F1F2ED7.6060308@oracle.com> On 01/24/12 10:15, Taras Tielkes wrote: > Hi Jon, > > Xmx is 5g, PermGen is 256m, new is 400m. > > The overall tenured gen usage is at the point when I would expect the > CMS to kick in though. > Does this mean I'd have to lower the CMS initiating occupancy setting > (currently at 68%)? I don't have any quick answers as to what to try next. > > In addition, are the promotion failure sizes expressed in bytes? If > so, I'm surprised to see such odd-sized (7, for example) sizes. It's in words. Jon > > Thanks, > Taras > > On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu wrote: >> Taras, >> >> The pattern makes sense if the tenured (cms) gen is in fact full. >> Multiple GC workers try to get a chunk of space for >> an allocation and there is no space. >> >> Jon >> >> >> On 01/24/12 04:22, Taras Tielkes wrote: >>> Hi Jon, >>> >>> While inspecting our production logs for promotion failures, I saw the >>> following one today: >>> -------- >>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew: >>> 349623K->20008K(368640K), 0.0294350 secs] >>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21 >>> sys=0.00, real=0.03 secs] >>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew: >>> 347688K->40960K(368640K), 0.0536700 secs] >>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36 >>> sys=0.00, real=0.05 secs] >>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew >>> (0: promotion failure size = 6) (1: promotion failure size = 6) (2: >>> promotion failure size = 7) (3: promotion failure size = 7) (4: >>> promotion failure size = 9) (5: p >>> romotion failure size = 9) (6: promotion failure size = 6) (7: >>> promotion failure size = 9) (promotion failed): >>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS: >>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K( >>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs] >>> [Times: user=10.17 sys=1.10, real=9.11 secs] >>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew: >>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K), >>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] >>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew: >>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K), >>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>> -------- >>> >>> It's different from the others in two ways: >>> 1) a "parallel" promotion failure in all 8 ParNew threads? >>> 2) the very small size of the promoted object >>> >>> Do such an promotion failure pattern ring a bell? It does not make sense >>> to me. >>> >>> Thanks, >>> Taras >>> >>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu >>> wrote: >>>> Taras, >>>> >>>>> I assume that the large sizes for the promotion failures during ParNew >>>>> are confirming that eliminating large array allocations might help >>>>> here. Do you agree? >>>> I agree that eliminating the large array allocation will help but you >>>> are still having >>>> promotion failures when the allocation size is small (I think it was >>>> 1026). That >>>> says that you are filling up the old (cms) generation faster than the GC >>>> can >>>> collect it. The large arrays are aggrevating the problem but not >>>> necessarily >>>> the cause. >>>> >>>> If these are still your heap sizes, >>>> >>>>> -Xms5g >>>>> -Xmx5g >>>>> -Xmn400m >>>> Start by increasing the young gen size as may already have been >>>> suggested. If you have a test setup where you can experiment, >>>> try doubling the young gen size to start. >>>> >>>> If you have not seen this, it might be helpful. >>>> >>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a >>>>> I'm not sure what to make of the concurrent mode >>>> The concurrent mode failure is a consequence of the promotion failure. >>>> Once the promotion failure happens the concurrent mode failure is >>>> inevitable. >>>> >>>> Jon >>>> >>>> >>>>> . >>>> >>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote: >>>>> Hi Jon, >>>>> >>>>> We've added the -XX:+PrintPromotionFailure flag to our production >>>>> application yesterday. >>>>> The application is running on 4 (homogenous) nodes. >>>>> >>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure >>>>> event during ParNew: >>>>> >>>>> node-002 >>>>> ------- >>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: >>>>> 357592K->23382K(368640K), 0.0298150 secs] >>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 >>>>> sys=0.01, real=0.03 secs] >>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: >>>>> 351062K->39795K(368640K), 0.0401170 secs] >>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 >>>>> sys=0.00, real=0.04 secs] >>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: >>>>> promotion failure size = 4281460) (promotion failed): >>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: >>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K >>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 >>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs] >>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: >>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), >>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] >>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: >>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), >>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] >>>>> >>>>> node-003 >>>>> ------- >>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: >>>>> 346950K->21342K(368640K), 0.0333090 secs] >>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 >>>>> sys=0.00, real=0.03 secs] >>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: >>>>> 345070K->32211K(368640K), 0.0369260 secs] >>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 >>>>> sys=0.00, real=0.04 secs] >>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: >>>>> promotion failure size = 1266955) (promotion failed): >>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: >>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 >>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 >>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs] >>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: >>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), >>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] >>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: >>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), >>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] >>>>> >>>>> node-004 >>>>> ------- >>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: >>>>> 358429K->40960K(368640K), 0.0629910 secs] >>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 >>>>> sys=0.02, real=0.06 secs] >>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: >>>>> 368640K->40960K(368640K), 0.0819780 secs] >>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 >>>>> sys=0.00, real=0.08 secs] >>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: >>>>> promotion failure size = 2788662) (promotion failed): >>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: >>>>> 3310044K->330922K(4833280K), 4.5104170 secs] >>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], >>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] >>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: >>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), >>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] >>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: >>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), >>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] >>>>> >>>>> On a fourth node, I've found a different event: promotion failure >>>>> during CMS, with a much smaller size: >>>>> >>>>> node-001 >>>>> ------- >>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: >>>>> 354039K->40960K(368640K), 0.0667340 secs] >>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 >>>>> sys=0.01, real=0.06 secs] >>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: >>>>> 368640K->40960K(368640K), 0.2586390 secs] >>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 >>>>> sys=0.13, real=0.26 secs] >>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: >>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: >>>>> user=0.07 sys=0.00, real=0.07 secs] >>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] >>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: >>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] >>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] >>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: >>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] >>>>> 2012-01-10T18:30:10.089+0100: 48431.382: >>>>> [CMS-concurrent-abortable-preclean-start] >>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: >>>>> 368640K->40960K(368640K), 0.1214420 secs] >>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 >>>>> sys=0.05, real=0.12 secs] >>>>> 2012-01-10T18:30:12.785+0100: 48434.078: >>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: >>>>> user=10.72 sys=0.48, real=2.70 secs] >>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K >>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: >>>>> [ParNew (promotion failure size = 1026) (promotion failed): >>>>> 206521K->206521K(368640K), 0.1667280 secs] >>>>> 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 >>>>> sys=0.04, real=0.17 secs] >>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs >>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 >>>>> secs]48434.474: [scrub symbol& string tables, 0.0088370 secs] [1 >>>>> CMS-remark: 3489675K(4833280K)] 36961 >>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 >>>>> secs] >>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] >>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: >>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: >>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] >>>>> (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 >>>>> secs] 2873988K->334385K(5201920K), [CMS Perm : >>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 >>>>> sys=0.00, real=8.61 secs] >>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: >>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), >>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] >>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: >>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), >>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] >>>>> >>>>> I assume that the large sizes for the promotion failures during ParNew >>>>> are confirming that eliminating large array allocations might help >>>>> here. Do you agree? >>>>> I'm not sure what to make of the concurrent mode failure. >>>>> >>>>> Thanks in advance for any suggestions, >>>>> Taras >>>>> >>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu >>>>> wrote: >>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote: >>>>>>> Hi Jon, >>>>>>> >>>>>>> We've enabled the PrintPromotionFailure flag in our preprod >>>>>>> environment, but so far, no failures yet. >>>>>>> We know that the load we generate there is not representative. But >>>>>>> perhaps we'll catch something, given enough patience. >>>>>>> >>>>>>> The flag will also be enabled in our production environment next week >>>>>>> - so one way or the other, we'll get more diagnostic data soon. >>>>>>> I'll also do some allocation profiling of the application in isolation >>>>>>> - I know that there is abusive large byte[] and char[] allocation in >>>>>>> there. >>>>>>> >>>>>>> I've got two questions for now: >>>>>>> >>>>>>> 1) From googling around on the output to expect >>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this: >>>>>>> ------- >>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) (promotion >>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs] >>>>>>> ------- >>>>>>> In that example line, what does the "0" stand for? >>>>>> It's the index of the GC worker thread that experienced the promotion >>>>>> failure. >>>>>> >>>>>>> 2) Below is a snippet of (real) gc log from our production >>>>>>> application: >>>>>>> ------- >>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >>>>>>> 345951K->40960K(368640K), 0.0676780 secs] >>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >>>>>>> sys=0.01, real=0.06 secs] >>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >>>>>>> 368640K->40959K(368640K), 0.0618880 secs] >>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >>>>>>> sys=0.00, real=0.06 secs] >>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >>>>>>> user=0.04 sys=0.00, real=0.04 secs] >>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: >>>>>>> [CMS-concurrent-preclean-start] >>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001: >>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>> CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs] >>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >>>>>>> 3432839K->3423755K(5201920 >>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading, >>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& string tables, >>>>>>> 0.0093940 >>>>>>> secs] [1 CMS-remark: 3318289K(4833280K >>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >>>>>>> real=7.61 secs] >>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850: >>>>>>> [CMS-concurrent-sweep-start] >>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >>>>>>> (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm : >>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >>>>>>> sys=0.00, real=10.29 secs] >>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >>>>>>> ------- >>>>>>> >>>>>>> In this case I don't know how to interpret the output. >>>>>>> a) There's a promotion failure that took 7.49 secs >>>>>> This is the time it took to attempt the minor collection (ParNew) and >>>>>> to >>>>>> do recovery >>>>>> from the failure. >>>>>> >>>>>>> b) There's a full GC that took 14.08 secs >>>>>>> c) There's a concurrent mode failure that took 10.29 secs >>>>>> Not sure about b) and c) because the output is mixed up with the >>>>>> concurrent-sweep >>>>>> output but I think the "concurrent mode failure" message is part of >>>>>> the >>>>>> "Full GC" >>>>>> message. My guess is that the 10.29 is the time for the Full GC and >>>>>> the >>>>>> 14.08 >>>>>> maybe is part of the concurrent-sweep message. Really hard to be sure. >>>>>> >>>>>> Jon >>>>>>> How are these events, and their (real) times related to each other? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Taras >>>>>>> >>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon >>>>>>> Masamitsu wrote: >>>>>>>> Taras, >>>>>>>> >>>>>>>> PrintPromotionFailure seems like it would go a long >>>>>>>> way to identify the root of your promotion failures (or >>>>>>>> at least eliminating some possible causes). I think it >>>>>>>> would help focus the discussion if you could send >>>>>>>> the result of that experiment early. >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We're running an application with the CMS/ParNew collectors that is >>>>>>>>> experiencing occasional promotion failures. >>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>>>>>>> I've listed the specific JVM options used below (a). >>>>>>>>> >>>>>>>>> The application is deployed across a handful of machines, and the >>>>>>>>> promotion failures are fairly uniform across those. >>>>>>>>> >>>>>>>>> The first kind of failure we observe is a promotion failure during >>>>>>>>> ParNew collection, I've included a snipped from the gc log below >>>>>>>>> (b). >>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps >>>>>>>>> triggered by the same cause), see (c) below. >>>>>>>>> The frequency (after running for a some weeks) is approximately once >>>>>>>>> per day. This is bearable, but obviously we'd like to improve on >>>>>>>>> this. >>>>>>>>> >>>>>>>>> Apart from high-volume request handling (which allocates a lot of >>>>>>>>> small objects), the application also runs a few dozen background >>>>>>>>> threads that download and process XML documents, typically in the >>>>>>>>> 5-30 >>>>>>>>> MB range. >>>>>>>>> A known deficiency in the existing code is that the XML content is >>>>>>>>> copied twice before processing (once to a byte[], and later again to >>>>>>>>> a >>>>>>>>> String/char[]). >>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB >>>>>>>>> java.lang.String/char[], my suspicion is that these big array >>>>>>>>> allocations are causing us to run into the CMS fragmentation issue. >>>>>>>>> >>>>>>>>> My questions are: >>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to >>>>>>>>> conclude that CMS fragmentation is the cause of the promotion >>>>>>>>> failure? >>>>>>>>> 2) If not, what's the next step of investigating the cause? >>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get >>>>>>>>> a >>>>>>>>> feeling for the size of the objects that fail promotion. >>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>>>>>>> case? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> Taras >>>>>>>>> >>>>>>>>> a) Current JVM options: >>>>>>>>> -------------------------------- >>>>>>>>> -server >>>>>>>>> -Xms5g >>>>>>>>> -Xmx5g >>>>>>>>> -Xmn400m >>>>>>>>> -XX:PermSize=256m >>>>>>>>> -XX:MaxPermSize=256m >>>>>>>>> -XX:+PrintGCTimeStamps >>>>>>>>> -verbose:gc >>>>>>>>> -XX:+PrintGCDateStamps >>>>>>>>> -XX:+PrintGCDetails >>>>>>>>> -XX:SurvivorRatio=8 >>>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>>> -XX:+UseParNewGC >>>>>>>>> -XX:+DisableExplicitGC >>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>>>>>> -XX:+CMSClassUnloadingEnabled >>>>>>>>> -XX:+CMSScavengeBeforeRemark >>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68 >>>>>>>>> -Xloggc:gc.log >>>>>>>>> -------------------------------- >>>>>>>>> >>>>>>>>> b) Promotion failure during ParNew >>>>>>>>> -------------------------------- >>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs] >>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>>>>>>> sys=0.01, real=0.07 secs] >>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs] >>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs] >>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>>>>>>> sys=0.00, real=0.03 secs] >>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>>>>>>> 3505808K->434291K >>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs] >>>>>>>>> 761971K->514584K(5201920K), >>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs] >>>>>>>>> 842264K->625681K(5201920K), >>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs] >>>>>>>>> 953361K->684121K(5201920K), >>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>>>>>>> -------------------------------- >>>>>>>>> >>>>>>>>> c) Promotion failure during CMS >>>>>>>>> -------------------------------- >>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs] >>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs] >>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>>>>>>> sys=0.01, real=0.05 secs] >>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs] >>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>>>>>>> user=0.02 sys=0.00, real=0.03 secs] >>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529: >>>>>>>>> [CMS-concurrent-mark-start] >>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs] >>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>>>>>>> sys=0.01, real=0.08 secs] >>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: >>>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>>>>>>> user=6.68 sys=0.27, real=1.40 secs] >>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 >>>>>>>>> secs] >>>>>>>>> 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>>>>>>> sys=2.58, real=9.88 secs] >>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak >>>>>>>>> refs >>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>>>>>>> secs]703031.419: [scrub symbol& string tables, 0.0094960 >>>>>>>>> secs] [1 CMS >>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436: >>>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>>>>>>> (concurrent mode failure): 3370829K->433437K(4833280K), >>>>>>>>> 10.9594300 >>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>>>>>>> sys=0.00, real=10.96 secs] >>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs] >>>>>>>>> 761117K->517836K(5201920K), >>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs] >>>>>>>>> 845516K->557872K(5201920K), >>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs] >>>>>>>>> 885552K->603017K(5201920K), >>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>>>>>>> -------------------------------- >>>>>>>>> _______________________________________________ >>>>>>>>> hotspot-gc-use mailing list >>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>> _______________________________________________ >>>>>>>> hotspot-gc-use mailing list >>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>> _______________________________________________ >>>>>>> hotspot-gc-use mailing list >>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>> _______________________________________________ >>>>>> hotspot-gc-use mailing list >>>>>> hotspot-gc-use at openjdk.java.net >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> hotspot-gc-use at openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From karmazilla at gmail.com Wed Jan 25 01:37:28 2012 From: karmazilla at gmail.com (Christian Vest Hansen) Date: Wed, 25 Jan 2012 10:37:28 +0100 Subject: JVM Crash during GC In-Reply-To: <4DD29C64.8060501@oracle.com> References: <4DD29C64.8060501@oracle.com> Message-ID: Hi, How do you enable heap verification? Is there a place where I can read more about what it does? On Tue, May 17, 2011 at 18:03, Y. S. Ramakrishna wrote: > Hi Shane, that's 6u18 which is about 18 months old. Could you try > the latest, 6u25, and see if the problem reproduces? > > The crash is somewhat generic in that we crash when scanning > cards during a scavenge, presumably running across a bad pointer. > > If you need to stick with that JVM, you can try turning off > compressed oops explicitly, and/or enable heap verification > to see if it catches anything sooner. > > If the problem reproduces with the latest bits, we'd definitely > be interested in a formal bug report with a test case. > > -- ramki > > On 05/17/11 08:46, Shane Cox wrote: > > Has anyone seen a JVM crash similar to this one? Wondering if this is a > > new or existing problem. Any insights would be appreciated. > > > > Thanks, > > Shane > > > > > > # A fatal error has been detected by the Java Runtime Environment: > > # > > # SIGSEGV (0xb) at pc=0x00002b0f1733cdc9, pid=14532, tid=1093286208 > > # > > # SIGSEGV (0xb) at pc=0x00002b0f1733cdc9, pid=14532, tid=1093286208 > > # > > # JRE version: 6.0_18-b07 > > # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode > > linux-amd64 ) > > # Problematic frame: > > # V [libjvm.so+0x3b1dc9] > > > > > > Current thread (0x0000000056588800): GCTaskThread [stack: > > 0x0000000000000000,0x0000000000000000] [id=14539] > > > > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > > si_addr=0x0000000000000025;; > > > > > > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, > > C=native code) > > V [libjvm.so+0x3b1dc9];; void > > ParScanClosure::do_oop_work(oopDesc**, bool, bool)+0x79 > > V [libjvm.so+0x5e7f03];; > > ParRootScanWithBarrierTwoGensClosure::do_oop(oopDesc**)+0x13 > > V [libjvm.so+0x3ab18c];; instanceKlass::oop_oop_iterate_nv_m(oopDesc*, > > FilteringClosure*, MemRegion)+0x16c > > V [libjvm.so+0x297aff];; > > FreeListSpace_DCTOC::walk_mem_region_with_cl_par(MemRegion, HeapWord*, > > HeapWord*, FilteringClosure*)+0x13f > > V [libjvm.so+0x297995];; > > FreeListSpace_DCTOC::walk_mem_region_with_cl(MemRegion, HeapWord*, > > HeapWord*, FilteringClosure*)+0x35 > > V [libjvm.so+0x66014f];; Filtering_DCTOC::walk_mem_region(MemRegion, > > HeapWord*, HeapWord*)+0x5f > > V [libjvm.so+0x65fee9];; > > DirtyCardToOopClosure::do_MemRegion(MemRegion)+0xf9 > > V [libjvm.so+0x24153d];; > > ClearNoncleanCardWrapper::do_MemRegion(MemRegion)+0xdd > > V [libjvm.so+0x23ffea];; > > CardTableModRefBS::non_clean_card_iterate_work(MemRegion, > > MemRegionClosure*, bool)+0x1ca > > V [libjvm.so+0x5e504b];; CardTableModRefBS::process_stride(Space*, > > MemRegion, int, int, DirtyCardToOopClosure*, MemRegionClosure*, bool, > > signed char**, unsigned long, unsigned long)+0x13b > > V [libjvm.so+0x5e4e98];; > > CardTableModRefBS::par_non_clean_card_iterate_work(Space*, MemRegion, > > DirtyCardToOopClosure*, MemRegionClosure*, bool, int)+0xc8 > > V [libjvm.so+0x23fdfb];; > > CardTableModRefBS::non_clean_card_iterate(Space*, MemRegion, > > DirtyCardToOopClosure*, MemRegionClosure*, bool)+0x5b > > V [libjvm.so+0x240b9a];; > > CardTableRS::younger_refs_in_space_iterate(Space*, > OopsInGenClosure*)+0x8a > > V [libjvm.so+0x379378];; > > Generation::younger_refs_in_space_iterate(Space*, OopsInGenClosure*)+0x18 > > V [libjvm.so+0x2c5c5f];; > > > ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*)+0x4f > > V [libjvm.so+0x240a8a];; > > CardTableRS::younger_refs_iterate(Generation*, OopsInGenClosure*)+0x2a > > V [libjvm.so+0x36bfcd];; > > GenCollectedHeap::gen_process_strong_roots(int, bool, bool, > > SharedHeap::ScanningOption, OopsInGenClosure*, OopsInGenClosure*)+0x9d > > V [libjvm.so+0x5e82c9];; ParNewGenTask::work(int)+0xc9 > > V [libjvm.so+0x722e0d];; GangWorker::loop()+0xad > > V [libjvm.so+0x722d24];; GangWorker::run()+0x24 > > V [libjvm.so+0x5da2af];; java_start(Thread*)+0x13f > > > > VM Arguments: > > jvm_args: -verbose:gc -XX:+PrintGCDetails -XX:+PrintHeapAtGC > > -XX:+PrintGCDateStamps -XX:+UseParNewGC -Xmx4000m -Xms4000m -Xss256k > > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:+UseConcMarkSweepGC > > -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing > > -XX:+CMSPermGenSweepingEnabled -XX:+ExplicitGCInvokesConcurrent > > > > OS:Red Hat Enterprise Linux Server release 5.3 (Tikanga) > > > > uname:Linux 2.6.18-128.el5 #1 SMP Wed Jan 21 08:45:05 EST 2009 x86_64 > > > > vm_info: Java HotSpot(TM) 64-Bit Server VM (16.0-b13) for linux-amd64 > > JRE (1.6.0_18-b07), built on Dec 17 2009 13:42:22 by "java_re" with gcc > > 3.2.2 (SuSE Linux) > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > hotspot-gc-use mailing list > > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -- Venlig hilsen / Kind regards, Christian Vest Hansen. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120125/fb2235b4/attachment.html From taras.tielkes at gmail.com Wed Jan 25 02:41:54 2012 From: taras.tielkes at gmail.com (Taras Tielkes) Date: Wed, 25 Jan 2012 11:41:54 +0100 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: <4F1F2ED7.6060308@oracle.com> References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> <4F0DBEC4.7040907@oracle.com> <4F1ECE7B.3040502@oracle.com> <4F1F2ED7.6060308@oracle.com> Message-ID: Hi Jon, At the risk of asking a stupid question, what's the word size on x64 when using CompressedOops? Thanks, Taras On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu wrote: > > > On 01/24/12 10:15, Taras Tielkes wrote: >> Hi Jon, >> >> Xmx is 5g, PermGen is 256m, new is 400m. >> >> The overall tenured gen usage is at the point when I would expect the >> CMS to kick in though. >> Does this mean I'd have to lower the CMS initiating occupancy setting >> (currently at 68%)? > > I don't have any quick answers as to what to try next. > >> >> In addition, are the promotion failure sizes expressed in bytes? If >> so, I'm surprised to see such odd-sized (7, for example) sizes. > > It's in words. > > Jon >> >> Thanks, >> Taras >> >> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu ?wrote: >>> Taras, >>> >>> The pattern makes sense if the tenured (cms) gen is in fact full. >>> Multiple ?GC workers try to get a chunk of space for >>> an allocation and there is no space. >>> >>> Jon >>> >>> >>> On 01/24/12 04:22, Taras Tielkes wrote: >>>> Hi Jon, >>>> >>>> While inspecting our production logs for promotion failures, I saw the >>>> following one today: >>>> -------- >>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew: >>>> 349623K->20008K(368640K), 0.0294350 secs] >>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21 >>>> sys=0.00, real=0.03 secs] >>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew: >>>> 347688K->40960K(368640K), 0.0536700 secs] >>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36 >>>> sys=0.00, real=0.05 secs] >>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew >>>> (0: promotion failure size = 6) ?(1: promotion failure size = 6) ?(2: >>>> promotion failure size = 7) ?(3: promotion failure size = 7) ?(4: >>>> promotion failure size = 9) ?(5: p >>>> romotion failure size = 9) ?(6: promotion failure size = 6) ?(7: >>>> promotion failure size = 9) ?(promotion failed): >>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS: >>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K( >>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs] >>>> [Times: user=10.17 sys=1.10, real=9.11 secs] >>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew: >>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K), >>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] >>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew: >>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K), >>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>> -------- >>>> >>>> It's different from the others in two ways: >>>> 1) a "parallel" promotion failure in all 8 ParNew threads? >>>> 2) the very small size of the promoted object >>>> >>>> Do such an promotion failure pattern ring a bell? It does not make sense >>>> to me. >>>> >>>> Thanks, >>>> Taras >>>> >>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu >>>> ? wrote: >>>>> Taras, >>>>> >>>>>> I assume that the large sizes for the promotion failures during ParNew >>>>>> are confirming that eliminating large array allocations might help >>>>>> here. Do you agree? >>>>> I agree that eliminating the large array allocation will help but you >>>>> are still having >>>>> promotion failures when the allocation size is small (I think it was >>>>> 1026). ?That >>>>> says that you are filling up the old (cms) generation faster than the GC >>>>> can >>>>> collect it. ?The large arrays are aggrevating the problem but not >>>>> necessarily >>>>> the cause. >>>>> >>>>> If these are still your heap sizes, >>>>> >>>>>> -Xms5g >>>>>> -Xmx5g >>>>>> -Xmn400m >>>>> Start by increasing the young gen size as may already have been >>>>> suggested. ?If you have a test setup where you can experiment, >>>>> try doubling the young gen size to start. >>>>> >>>>> If you have not seen this, it might be helpful. >>>>> >>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a >>>>>> I'm not sure what to make of the concurrent mode >>>>> The concurrent mode failure is a consequence of the promotion failure. >>>>> Once the promotion failure happens the concurrent mode failure is >>>>> inevitable. >>>>> >>>>> Jon >>>>> >>>>> >>>>>> . >>>>> >>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote: >>>>>> Hi Jon, >>>>>> >>>>>> We've added the -XX:+PrintPromotionFailure flag to our production >>>>>> application yesterday. >>>>>> The application is running on 4 (homogenous) nodes. >>>>>> >>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure >>>>>> event during ParNew: >>>>>> >>>>>> node-002 >>>>>> ------- >>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: >>>>>> 357592K->23382K(368640K), 0.0298150 secs] >>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 >>>>>> sys=0.01, real=0.03 secs] >>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: >>>>>> 351062K->39795K(368640K), 0.0401170 secs] >>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 >>>>>> sys=0.00, real=0.04 secs] >>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: >>>>>> promotion failure size = 4281460) ?(promotion failed): >>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: >>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K >>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 >>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs] >>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: >>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), >>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] >>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: >>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), >>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] >>>>>> >>>>>> node-003 >>>>>> ------- >>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: >>>>>> 346950K->21342K(368640K), 0.0333090 secs] >>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 >>>>>> sys=0.00, real=0.03 secs] >>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: >>>>>> 345070K->32211K(368640K), 0.0369260 secs] >>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 >>>>>> sys=0.00, real=0.04 secs] >>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: >>>>>> promotion failure size = 1266955) ?(promotion failed): >>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: >>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 >>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 >>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs] >>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: >>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), >>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] >>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: >>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), >>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] >>>>>> >>>>>> node-004 >>>>>> ------- >>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: >>>>>> 358429K->40960K(368640K), 0.0629910 secs] >>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 >>>>>> sys=0.02, real=0.06 secs] >>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: >>>>>> 368640K->40960K(368640K), 0.0819780 secs] >>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 >>>>>> sys=0.00, real=0.08 secs] >>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: >>>>>> promotion failure size = 2788662) ?(promotion failed): >>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: >>>>>> 3310044K->330922K(4833280K), 4.5104170 secs] >>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], >>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] >>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: >>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), >>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] >>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: >>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), >>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] >>>>>> >>>>>> On a fourth node, I've found a different event: promotion failure >>>>>> during CMS, with a much smaller size: >>>>>> >>>>>> node-001 >>>>>> ------- >>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: >>>>>> 354039K->40960K(368640K), 0.0667340 secs] >>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 >>>>>> sys=0.01, real=0.06 secs] >>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: >>>>>> 368640K->40960K(368640K), 0.2586390 secs] >>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 >>>>>> sys=0.13, real=0.26 secs] >>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: >>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: >>>>>> user=0.07 sys=0.00, real=0.07 secs] >>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] >>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: >>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] >>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] >>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: >>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] >>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: >>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: >>>>>> 368640K->40960K(368640K), 0.1214420 secs] >>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 >>>>>> sys=0.05, real=0.12 secs] >>>>>> 2012-01-10T18:30:12.785+0100: 48434.078: >>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: >>>>>> user=10.72 sys=0.48, real=2.70 secs] >>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K >>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: >>>>>> [ParNew (promotion failure size = 1026) ?(promotion failed): >>>>>> 206521K->206521K(368640K), 0.1667280 secs] >>>>>> ? ?3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 >>>>>> sys=0.04, real=0.17 secs] >>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs >>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 >>>>>> secs]48434.474: [scrub symbol& ? ? ?string tables, 0.0088370 secs] [1 >>>>>> CMS-remark: 3489675K(4833280K)] 36961 >>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 >>>>>> secs] >>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] >>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: >>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: >>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] >>>>>> ? ?(concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 >>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm : >>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 >>>>>> sys=0.00, real=8.61 secs] >>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: >>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), >>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] >>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: >>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), >>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] >>>>>> >>>>>> I assume that the large sizes for the promotion failures during ParNew >>>>>> are confirming that eliminating large array allocations might help >>>>>> here. Do you agree? >>>>>> I'm not sure what to make of the concurrent mode failure. >>>>>> >>>>>> Thanks in advance for any suggestions, >>>>>> Taras >>>>>> >>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu >>>>>> ? ? wrote: >>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote: >>>>>>>> Hi Jon, >>>>>>>> >>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod >>>>>>>> environment, but so far, no failures yet. >>>>>>>> We know that the load we generate there is not representative. But >>>>>>>> perhaps we'll catch something, given enough patience. >>>>>>>> >>>>>>>> The flag will also be enabled in our production environment next week >>>>>>>> - so one way or the other, we'll get more diagnostic data soon. >>>>>>>> I'll also do some allocation profiling of the application in isolation >>>>>>>> - I know that there is abusive large byte[] and char[] allocation in >>>>>>>> there. >>>>>>>> >>>>>>>> I've got two questions for now: >>>>>>>> >>>>>>>> 1) From googling around on the output to expect >>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this: >>>>>>>> ------- >>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion >>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs] >>>>>>>> ------- >>>>>>>> In that example line, what does the "0" stand for? >>>>>>> It's the index of the GC worker thread ?that experienced the promotion >>>>>>> failure. >>>>>>> >>>>>>>> 2) Below is a snippet of (real) gc log from our production >>>>>>>> application: >>>>>>>> ------- >>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >>>>>>>> 345951K->40960K(368640K), 0.0676780 secs] >>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >>>>>>>> sys=0.01, real=0.06 secs] >>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >>>>>>>> 368640K->40959K(368640K), 0.0618880 secs] >>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >>>>>>>> sys=0.00, real=0.06 secs] >>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >>>>>>>> user=0.04 sys=0.00, real=0.04 secs] >>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: >>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001: >>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>> ? ? CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs] >>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >>>>>>>> 3432839K->3423755K(5201920 >>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading, >>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& ? ? ? ?string tables, >>>>>>>> 0.0093940 >>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K >>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >>>>>>>> real=7.61 secs] >>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850: >>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >>>>>>>> ? ? (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm : >>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >>>>>>>> sys=0.00, real=10.29 secs] >>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >>>>>>>> ------- >>>>>>>> >>>>>>>> In this case I don't know how to interpret the output. >>>>>>>> a) There's a promotion failure that took 7.49 secs >>>>>>> This is the time it took to attempt the minor collection (ParNew) and >>>>>>> to >>>>>>> do recovery >>>>>>> from the failure. >>>>>>> >>>>>>>> b) There's a full GC that took 14.08 secs >>>>>>>> c) There's a concurrent mode failure that took 10.29 secs >>>>>>> Not sure about b) and c) because the output is mixed up with the >>>>>>> concurrent-sweep >>>>>>> output but ?I think the "concurrent mode failure" message is part of >>>>>>> the >>>>>>> "Full GC" >>>>>>> message. ?My guess is that the 10.29 is the time for the Full GC and >>>>>>> the >>>>>>> 14.08 >>>>>>> maybe is part of the concurrent-sweep message. ?Really hard to be sure. >>>>>>> >>>>>>> Jon >>>>>>>> How are these events, and their (real) times related to each other? >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> Taras >>>>>>>> >>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon >>>>>>>> Masamitsu ? ? ? ?wrote: >>>>>>>>> Taras, >>>>>>>>> >>>>>>>>> PrintPromotionFailure seems like it would go a long >>>>>>>>> way to identify the root of your promotion failures (or >>>>>>>>> at least eliminating some possible causes). ? ?I think it >>>>>>>>> would help focus the discussion if you could send >>>>>>>>> the result of that experiment early. >>>>>>>>> >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> We're running an application with the CMS/ParNew collectors that is >>>>>>>>>> experiencing occasional promotion failures. >>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>>>>>>>> I've listed the specific JVM options used below (a). >>>>>>>>>> >>>>>>>>>> The application is deployed across a handful of machines, and the >>>>>>>>>> promotion failures are fairly uniform across those. >>>>>>>>>> >>>>>>>>>> The first kind of failure we observe is a promotion failure during >>>>>>>>>> ParNew collection, I've included a snipped from the gc log below >>>>>>>>>> (b). >>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps >>>>>>>>>> triggered by the same cause), see (c) below. >>>>>>>>>> The frequency (after running for a some weeks) is approximately once >>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on >>>>>>>>>> this. >>>>>>>>>> >>>>>>>>>> Apart from high-volume request handling (which allocates a lot of >>>>>>>>>> small objects), the application also runs a few dozen background >>>>>>>>>> threads that download and process XML documents, typically in the >>>>>>>>>> 5-30 >>>>>>>>>> MB range. >>>>>>>>>> A known deficiency in the existing code is that the XML content is >>>>>>>>>> copied twice before processing (once to a byte[], and later again to >>>>>>>>>> a >>>>>>>>>> String/char[]). >>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB >>>>>>>>>> java.lang.String/char[], my suspicion is that these big array >>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue. >>>>>>>>>> >>>>>>>>>> My questions are: >>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to >>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion >>>>>>>>>> failure? >>>>>>>>>> 2) If not, what's the next step of investigating the cause? >>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get >>>>>>>>>> a >>>>>>>>>> feeling for the size of the objects that fail promotion. >>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>>>>>>>> case? >>>>>>>>>> >>>>>>>>>> Thanks in advance, >>>>>>>>>> Taras >>>>>>>>>> >>>>>>>>>> a) Current JVM options: >>>>>>>>>> -------------------------------- >>>>>>>>>> -server >>>>>>>>>> -Xms5g >>>>>>>>>> -Xmx5g >>>>>>>>>> -Xmn400m >>>>>>>>>> -XX:PermSize=256m >>>>>>>>>> -XX:MaxPermSize=256m >>>>>>>>>> -XX:+PrintGCTimeStamps >>>>>>>>>> -verbose:gc >>>>>>>>>> -XX:+PrintGCDateStamps >>>>>>>>>> -XX:+PrintGCDetails >>>>>>>>>> -XX:SurvivorRatio=8 >>>>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>>>> -XX:+UseParNewGC >>>>>>>>>> -XX:+DisableExplicitGC >>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>>>>>>> -XX:+CMSClassUnloadingEnabled >>>>>>>>>> -XX:+CMSScavengeBeforeRemark >>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68 >>>>>>>>>> -Xloggc:gc.log >>>>>>>>>> -------------------------------- >>>>>>>>>> >>>>>>>>>> b) Promotion failure during ParNew >>>>>>>>>> -------------------------------- >>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs] >>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>>>>>>>> sys=0.01, real=0.07 secs] >>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs] >>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs] >>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>>>>>>>> sys=0.00, real=0.03 secs] >>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>>>>>>>> 3505808K->434291K >>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs] >>>>>>>>>> 761971K->514584K(5201920K), >>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs] >>>>>>>>>> 842264K->625681K(5201920K), >>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs] >>>>>>>>>> 953361K->684121K(5201920K), >>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>>>>>>>> -------------------------------- >>>>>>>>>> >>>>>>>>>> c) Promotion failure during CMS >>>>>>>>>> -------------------------------- >>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs] >>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs] >>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>>>>>>>> sys=0.01, real=0.05 secs] >>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs] >>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs] >>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529: >>>>>>>>>> [CMS-concurrent-mark-start] >>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs] >>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>>>>>>>> sys=0.01, real=0.08 secs] >>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: >>>>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs] >>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 >>>>>>>>>> secs] >>>>>>>>>> ? ? ?3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>>>>>>>> sys=2.58, real=9.88 secs] >>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak >>>>>>>>>> refs >>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>>>>>>>> secs]703031.419: [scrub symbol& ? ? ? ? ?string tables, 0.0094960 >>>>>>>>>> secs] [1 CMS >>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436: >>>>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>>>>>>>> ? ? ?(concurrent mode failure): 3370829K->433437K(4833280K), >>>>>>>>>> 10.9594300 >>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>>>>>>>> sys=0.00, real=10.96 secs] >>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs] >>>>>>>>>> 761117K->517836K(5201920K), >>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs] >>>>>>>>>> 845516K->557872K(5201920K), >>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs] >>>>>>>>>> 885552K->603017K(5201920K), >>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>>>>>>>> -------------------------------- >>>>>>>>>> _______________________________________________ >>>>>>>>>> hotspot-gc-use mailing list >>>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>>> _______________________________________________ >>>>>>>>> hotspot-gc-use mailing list >>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>> _______________________________________________ >>>>>>>> hotspot-gc-use mailing list >>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>> _______________________________________________ >>>>>>> hotspot-gc-use mailing list >>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>> _______________________________________________ >>>>>> hotspot-gc-use mailing list >>>>>> hotspot-gc-use at openjdk.java.net >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> hotspot-gc-use at openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From jon.masamitsu at oracle.com Wed Jan 25 22:49:49 2012 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Wed, 25 Jan 2012 22:49:49 -0800 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> <4F0DBEC4.7040907@oracle.com> <4F1ECE7B.3040502@oracle.com> <4F1F2ED7.6060308@oracle.com> Message-ID: <4F20F78D.9070905@oracle.com> On 1/25/2012 2:41 AM, Taras Tielkes wrote: > Hi Jon, > > At the risk of asking a stupid question, what's the word size on x64 > when using CompressedOops? Word size is the same with and without CompressedOops (8 bytes). With CompressedOops we can just point to words with a 32bit reference (i.e., map the 32bit reference to a full 64bit address). Jon > Thanks, > Taras > > On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu > wrote: >> On 01/24/12 10:15, Taras Tielkes wrote: >>> Hi Jon, >>> >>> Xmx is 5g, PermGen is 256m, new is 400m. >>> >>> The overall tenured gen usage is at the point when I would expect the >>> CMS to kick in though. >>> Does this mean I'd have to lower the CMS initiating occupancy setting >>> (currently at 68%)? >> I don't have any quick answers as to what to try next. >> >>> In addition, are the promotion failure sizes expressed in bytes? If >>> so, I'm surprised to see such odd-sized (7, for example) sizes. >> It's in words. >> >> Jon >>> Thanks, >>> Taras >>> >>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu wrote: >>>> Taras, >>>> >>>> The pattern makes sense if the tenured (cms) gen is in fact full. >>>> Multiple GC workers try to get a chunk of space for >>>> an allocation and there is no space. >>>> >>>> Jon >>>> >>>> >>>> On 01/24/12 04:22, Taras Tielkes wrote: >>>>> Hi Jon, >>>>> >>>>> While inspecting our production logs for promotion failures, I saw the >>>>> following one today: >>>>> -------- >>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew: >>>>> 349623K->20008K(368640K), 0.0294350 secs] >>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21 >>>>> sys=0.00, real=0.03 secs] >>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew: >>>>> 347688K->40960K(368640K), 0.0536700 secs] >>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36 >>>>> sys=0.00, real=0.05 secs] >>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew >>>>> (0: promotion failure size = 6) (1: promotion failure size = 6) (2: >>>>> promotion failure size = 7) (3: promotion failure size = 7) (4: >>>>> promotion failure size = 9) (5: p >>>>> romotion failure size = 9) (6: promotion failure size = 6) (7: >>>>> promotion failure size = 9) (promotion failed): >>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS: >>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K( >>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs] >>>>> [Times: user=10.17 sys=1.10, real=9.11 secs] >>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew: >>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K), >>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] >>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew: >>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K), >>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>> -------- >>>>> >>>>> It's different from the others in two ways: >>>>> 1) a "parallel" promotion failure in all 8 ParNew threads? >>>>> 2) the very small size of the promoted object >>>>> >>>>> Do such an promotion failure pattern ring a bell? It does not make sense >>>>> to me. >>>>> >>>>> Thanks, >>>>> Taras >>>>> >>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu >>>>> wrote: >>>>>> Taras, >>>>>> >>>>>>> I assume that the large sizes for the promotion failures during ParNew >>>>>>> are confirming that eliminating large array allocations might help >>>>>>> here. Do you agree? >>>>>> I agree that eliminating the large array allocation will help but you >>>>>> are still having >>>>>> promotion failures when the allocation size is small (I think it was >>>>>> 1026). That >>>>>> says that you are filling up the old (cms) generation faster than the GC >>>>>> can >>>>>> collect it. The large arrays are aggrevating the problem but not >>>>>> necessarily >>>>>> the cause. >>>>>> >>>>>> If these are still your heap sizes, >>>>>> >>>>>>> -Xms5g >>>>>>> -Xmx5g >>>>>>> -Xmn400m >>>>>> Start by increasing the young gen size as may already have been >>>>>> suggested. If you have a test setup where you can experiment, >>>>>> try doubling the young gen size to start. >>>>>> >>>>>> If you have not seen this, it might be helpful. >>>>>> >>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a >>>>>>> I'm not sure what to make of the concurrent mode >>>>>> The concurrent mode failure is a consequence of the promotion failure. >>>>>> Once the promotion failure happens the concurrent mode failure is >>>>>> inevitable. >>>>>> >>>>>> Jon >>>>>> >>>>>> >>>>>>> . >>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote: >>>>>>> Hi Jon, >>>>>>> >>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production >>>>>>> application yesterday. >>>>>>> The application is running on 4 (homogenous) nodes. >>>>>>> >>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure >>>>>>> event during ParNew: >>>>>>> >>>>>>> node-002 >>>>>>> ------- >>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: >>>>>>> 357592K->23382K(368640K), 0.0298150 secs] >>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 >>>>>>> sys=0.01, real=0.03 secs] >>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: >>>>>>> 351062K->39795K(368640K), 0.0401170 secs] >>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 >>>>>>> sys=0.00, real=0.04 secs] >>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: >>>>>>> promotion failure size = 4281460) (promotion failed): >>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: >>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K >>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 >>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs] >>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: >>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), >>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] >>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: >>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), >>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] >>>>>>> >>>>>>> node-003 >>>>>>> ------- >>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: >>>>>>> 346950K->21342K(368640K), 0.0333090 secs] >>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 >>>>>>> sys=0.00, real=0.03 secs] >>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: >>>>>>> 345070K->32211K(368640K), 0.0369260 secs] >>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 >>>>>>> sys=0.00, real=0.04 secs] >>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: >>>>>>> promotion failure size = 1266955) (promotion failed): >>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: >>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 >>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 >>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs] >>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: >>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), >>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] >>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: >>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), >>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] >>>>>>> >>>>>>> node-004 >>>>>>> ------- >>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: >>>>>>> 358429K->40960K(368640K), 0.0629910 secs] >>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 >>>>>>> sys=0.02, real=0.06 secs] >>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: >>>>>>> 368640K->40960K(368640K), 0.0819780 secs] >>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 >>>>>>> sys=0.00, real=0.08 secs] >>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: >>>>>>> promotion failure size = 2788662) (promotion failed): >>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: >>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs] >>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], >>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] >>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: >>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), >>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] >>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: >>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), >>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] >>>>>>> >>>>>>> On a fourth node, I've found a different event: promotion failure >>>>>>> during CMS, with a much smaller size: >>>>>>> >>>>>>> node-001 >>>>>>> ------- >>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: >>>>>>> 354039K->40960K(368640K), 0.0667340 secs] >>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 >>>>>>> sys=0.01, real=0.06 secs] >>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: >>>>>>> 368640K->40960K(368640K), 0.2586390 secs] >>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 >>>>>>> sys=0.13, real=0.26 secs] >>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: >>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: >>>>>>> user=0.07 sys=0.00, real=0.07 secs] >>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] >>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: >>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] >>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] >>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: >>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] >>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: >>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: >>>>>>> 368640K->40960K(368640K), 0.1214420 secs] >>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 >>>>>>> sys=0.05, real=0.12 secs] >>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078: >>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: >>>>>>> user=10.72 sys=0.48, real=2.70 secs] >>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K >>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: >>>>>>> [ParNew (promotion failure size = 1026) (promotion failed): >>>>>>> 206521K->206521K(368640K), 0.1667280 secs] >>>>>>> 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 >>>>>>> sys=0.04, real=0.17 secs] >>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs >>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 >>>>>>> secs]48434.474: [scrub symbol& string tables, 0.0088370 secs] [1 >>>>>>> CMS-remark: 3489675K(4833280K)] 36961 >>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 >>>>>>> secs] >>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] >>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: >>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: >>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] >>>>>>> (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 >>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm : >>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 >>>>>>> sys=0.00, real=8.61 secs] >>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: >>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), >>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] >>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: >>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), >>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] >>>>>>> >>>>>>> I assume that the large sizes for the promotion failures during ParNew >>>>>>> are confirming that eliminating large array allocations might help >>>>>>> here. Do you agree? >>>>>>> I'm not sure what to make of the concurrent mode failure. >>>>>>> >>>>>>> Thanks in advance for any suggestions, >>>>>>> Taras >>>>>>> >>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu >>>>>>> wrote: >>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote: >>>>>>>>> Hi Jon, >>>>>>>>> >>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod >>>>>>>>> environment, but so far, no failures yet. >>>>>>>>> We know that the load we generate there is not representative. But >>>>>>>>> perhaps we'll catch something, given enough patience. >>>>>>>>> >>>>>>>>> The flag will also be enabled in our production environment next week >>>>>>>>> - so one way or the other, we'll get more diagnostic data soon. >>>>>>>>> I'll also do some allocation profiling of the application in isolation >>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in >>>>>>>>> there. >>>>>>>>> >>>>>>>>> I've got two questions for now: >>>>>>>>> >>>>>>>>> 1) From googling around on the output to expect >>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this: >>>>>>>>> ------- >>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) (promotion >>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs] >>>>>>>>> ------- >>>>>>>>> In that example line, what does the "0" stand for? >>>>>>>> It's the index of the GC worker thread that experienced the promotion >>>>>>>> failure. >>>>>>>> >>>>>>>>> 2) Below is a snippet of (real) gc log from our production >>>>>>>>> application: >>>>>>>>> ------- >>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs] >>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >>>>>>>>> sys=0.01, real=0.06 secs] >>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs] >>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >>>>>>>>> sys=0.00, real=0.06 secs] >>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >>>>>>>>> user=0.04 sys=0.00, real=0.04 secs] >>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: >>>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001: >>>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>>> CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs] >>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >>>>>>>>> 3432839K->3423755K(5201920 >>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading, >>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& string tables, >>>>>>>>> 0.0093940 >>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K >>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >>>>>>>>> real=7.61 secs] >>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850: >>>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >>>>>>>>> (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm : >>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >>>>>>>>> sys=0.00, real=10.29 secs] >>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >>>>>>>>> ------- >>>>>>>>> >>>>>>>>> In this case I don't know how to interpret the output. >>>>>>>>> a) There's a promotion failure that took 7.49 secs >>>>>>>> This is the time it took to attempt the minor collection (ParNew) and >>>>>>>> to >>>>>>>> do recovery >>>>>>>> from the failure. >>>>>>>> >>>>>>>>> b) There's a full GC that took 14.08 secs >>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs >>>>>>>> Not sure about b) and c) because the output is mixed up with the >>>>>>>> concurrent-sweep >>>>>>>> output but I think the "concurrent mode failure" message is part of >>>>>>>> the >>>>>>>> "Full GC" >>>>>>>> message. My guess is that the 10.29 is the time for the Full GC and >>>>>>>> the >>>>>>>> 14.08 >>>>>>>> maybe is part of the concurrent-sweep message. Really hard to be sure. >>>>>>>> >>>>>>>> Jon >>>>>>>>> How are these events, and their (real) times related to each other? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> Taras >>>>>>>>> >>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon >>>>>>>>> Masamitsu wrote: >>>>>>>>>> Taras, >>>>>>>>>> >>>>>>>>>> PrintPromotionFailure seems like it would go a long >>>>>>>>>> way to identify the root of your promotion failures (or >>>>>>>>>> at least eliminating some possible causes). I think it >>>>>>>>>> would help focus the discussion if you could send >>>>>>>>>> the result of that experiment early. >>>>>>>>>> >>>>>>>>>> Jon >>>>>>>>>> >>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is >>>>>>>>>>> experiencing occasional promotion failures. >>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>>>>>>>>> I've listed the specific JVM options used below (a). >>>>>>>>>>> >>>>>>>>>>> The application is deployed across a handful of machines, and the >>>>>>>>>>> promotion failures are fairly uniform across those. >>>>>>>>>>> >>>>>>>>>>> The first kind of failure we observe is a promotion failure during >>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below >>>>>>>>>>> (b). >>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps >>>>>>>>>>> triggered by the same cause), see (c) below. >>>>>>>>>>> The frequency (after running for a some weeks) is approximately once >>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on >>>>>>>>>>> this. >>>>>>>>>>> >>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of >>>>>>>>>>> small objects), the application also runs a few dozen background >>>>>>>>>>> threads that download and process XML documents, typically in the >>>>>>>>>>> 5-30 >>>>>>>>>>> MB range. >>>>>>>>>>> A known deficiency in the existing code is that the XML content is >>>>>>>>>>> copied twice before processing (once to a byte[], and later again to >>>>>>>>>>> a >>>>>>>>>>> String/char[]). >>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB >>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array >>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue. >>>>>>>>>>> >>>>>>>>>>> My questions are: >>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to >>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion >>>>>>>>>>> failure? >>>>>>>>>>> 2) If not, what's the next step of investigating the cause? >>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get >>>>>>>>>>> a >>>>>>>>>>> feeling for the size of the objects that fail promotion. >>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>>>>>>>>> case? >>>>>>>>>>> >>>>>>>>>>> Thanks in advance, >>>>>>>>>>> Taras >>>>>>>>>>> >>>>>>>>>>> a) Current JVM options: >>>>>>>>>>> -------------------------------- >>>>>>>>>>> -server >>>>>>>>>>> -Xms5g >>>>>>>>>>> -Xmx5g >>>>>>>>>>> -Xmn400m >>>>>>>>>>> -XX:PermSize=256m >>>>>>>>>>> -XX:MaxPermSize=256m >>>>>>>>>>> -XX:+PrintGCTimeStamps >>>>>>>>>>> -verbose:gc >>>>>>>>>>> -XX:+PrintGCDateStamps >>>>>>>>>>> -XX:+PrintGCDetails >>>>>>>>>>> -XX:SurvivorRatio=8 >>>>>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>>>>> -XX:+UseParNewGC >>>>>>>>>>> -XX:+DisableExplicitGC >>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>>>>>>>> -XX:+CMSClassUnloadingEnabled >>>>>>>>>>> -XX:+CMSScavengeBeforeRemark >>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68 >>>>>>>>>>> -Xloggc:gc.log >>>>>>>>>>> -------------------------------- >>>>>>>>>>> >>>>>>>>>>> b) Promotion failure during ParNew >>>>>>>>>>> -------------------------------- >>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs] >>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>>>>>>>>> sys=0.01, real=0.07 secs] >>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs] >>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs] >>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>>>>>>>>> sys=0.00, real=0.03 secs] >>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>>>>>>>>> 3505808K->434291K >>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs] >>>>>>>>>>> 761971K->514584K(5201920K), >>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs] >>>>>>>>>>> 842264K->625681K(5201920K), >>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs] >>>>>>>>>>> 953361K->684121K(5201920K), >>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>>>>>>>>> -------------------------------- >>>>>>>>>>> >>>>>>>>>>> c) Promotion failure during CMS >>>>>>>>>>> -------------------------------- >>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs] >>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs] >>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>>>>>>>>> sys=0.01, real=0.05 secs] >>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs] >>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>>>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs] >>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529: >>>>>>>>>>> [CMS-concurrent-mark-start] >>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs] >>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>>>>>>>>> sys=0.01, real=0.08 secs] >>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: >>>>>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>>>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs] >>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 >>>>>>>>>>> secs] >>>>>>>>>>> 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>>>>>>>>> sys=2.58, real=9.88 secs] >>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak >>>>>>>>>>> refs >>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>>>>>>>>> secs]703031.419: [scrub symbol& string tables, 0.0094960 >>>>>>>>>>> secs] [1 CMS >>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436: >>>>>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>>>>>>>>> (concurrent mode failure): 3370829K->433437K(4833280K), >>>>>>>>>>> 10.9594300 >>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>>>>>>>>> sys=0.00, real=10.96 secs] >>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs] >>>>>>>>>>> 761117K->517836K(5201920K), >>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs] >>>>>>>>>>> 845516K->557872K(5201920K), >>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs] >>>>>>>>>>> 885552K->603017K(5201920K), >>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>>>>>>>>> -------------------------------- >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> hotspot-gc-use mailing list >>>>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>>>> _______________________________________________ >>>>>>>>>> hotspot-gc-use mailing list >>>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>>> _______________________________________________ >>>>>>>>> hotspot-gc-use mailing list >>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>> _______________________________________________ >>>>>>>> hotspot-gc-use mailing list >>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>> _______________________________________________ >>>>>>> hotspot-gc-use mailing list >>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>> _______________________________________________ >>>>>> hotspot-gc-use mailing list >>>>>> hotspot-gc-use at openjdk.java.net >>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From taras.tielkes at gmail.com Thu Jan 26 12:22:14 2012 From: taras.tielkes at gmail.com (Taras Tielkes) Date: Thu, 26 Jan 2012 21:22:14 +0100 Subject: Promotion failures: indication of CMS fragmentation? In-Reply-To: <4F20F78D.9070905@oracle.com> References: <4EF9FCAC.3030208@oracle.com> <4F06A270.3010701@oracle.com> <4F0DBEC4.7040907@oracle.com> <4F1ECE7B.3040502@oracle.com> <4F1F2ED7.6060308@oracle.com> <4F20F78D.9070905@oracle.com> Message-ID: Hi Jon, Thanks for clearing up the word size question. Over the past weeks, I've seen promotion failures exceeding 10M words, so arrays of over 80 megabytes in size :) Today, we deployed a production release that eliminates the huge buffer allocations - instead data is processed in a streaming fashion. We'll see how things are performing after collecting operations data for a few weeks. Thanks for your help, Taras On Thu, Jan 26, 2012 at 7:49 AM, Jon Masamitsu wrote: > > > On 1/25/2012 2:41 AM, Taras Tielkes wrote: >> Hi Jon, >> >> At the risk of asking a stupid question, what's the word size on x64 >> when using CompressedOops? > > Word size is the same with and without CompressedOops (8 bytes). ?With > CompressedOops > we can just point to words with a 32bit reference (i.e., map the 32bit > reference to a full > 64bit address). > > Jon > >> Thanks, >> Taras >> >> On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu >> ?wrote: >>> On 01/24/12 10:15, Taras Tielkes wrote: >>>> Hi Jon, >>>> >>>> Xmx is 5g, PermGen is 256m, new is 400m. >>>> >>>> The overall tenured gen usage is at the point when I would expect the >>>> CMS to kick in though. >>>> Does this mean I'd have to lower the CMS initiating occupancy setting >>>> (currently at 68%)? >>> I don't have any quick answers as to what to try next. >>> >>>> In addition, are the promotion failure sizes expressed in bytes? If >>>> so, I'm surprised to see such odd-sized (7, for example) sizes. >>> It's in words. >>> >>> Jon >>>> Thanks, >>>> Taras >>>> >>>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu ? ?wrote: >>>>> Taras, >>>>> >>>>> The pattern makes sense if the tenured (cms) gen is in fact full. >>>>> Multiple ?GC workers try to get a chunk of space for >>>>> an allocation and there is no space. >>>>> >>>>> Jon >>>>> >>>>> >>>>> On 01/24/12 04:22, Taras Tielkes wrote: >>>>>> Hi Jon, >>>>>> >>>>>> While inspecting our production logs for promotion failures, I saw the >>>>>> following one today: >>>>>> -------- >>>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew: >>>>>> 349623K->20008K(368640K), 0.0294350 secs] >>>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21 >>>>>> sys=0.00, real=0.03 secs] >>>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew: >>>>>> 347688K->40960K(368640K), 0.0536700 secs] >>>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36 >>>>>> sys=0.00, real=0.05 secs] >>>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew >>>>>> (0: promotion failure size = 6) ?(1: promotion failure size = 6) ?(2: >>>>>> promotion failure size = 7) ?(3: promotion failure size = 7) ?(4: >>>>>> promotion failure size = 9) ?(5: p >>>>>> romotion failure size = 9) ?(6: promotion failure size = 6) ?(7: >>>>>> promotion failure size = 9) ?(promotion failed): >>>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS: >>>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K( >>>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs] >>>>>> [Times: user=10.17 sys=1.10, real=9.11 secs] >>>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew: >>>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K), >>>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] >>>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew: >>>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K), >>>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>>> -------- >>>>>> >>>>>> It's different from the others in two ways: >>>>>> 1) a "parallel" promotion failure in all 8 ParNew threads? >>>>>> 2) the very small size of the promoted object >>>>>> >>>>>> Do such an promotion failure pattern ring a bell? It does not make sense >>>>>> to me. >>>>>> >>>>>> Thanks, >>>>>> Taras >>>>>> >>>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu >>>>>> ? ?wrote: >>>>>>> Taras, >>>>>>> >>>>>>>> I assume that the large sizes for the promotion failures during ParNew >>>>>>>> are confirming that eliminating large array allocations might help >>>>>>>> here. Do you agree? >>>>>>> I agree that eliminating the large array allocation will help but you >>>>>>> are still having >>>>>>> promotion failures when the allocation size is small (I think it was >>>>>>> 1026). ?That >>>>>>> says that you are filling up the old (cms) generation faster than the GC >>>>>>> can >>>>>>> collect it. ?The large arrays are aggrevating the problem but not >>>>>>> necessarily >>>>>>> the cause. >>>>>>> >>>>>>> If these are still your heap sizes, >>>>>>> >>>>>>>> -Xms5g >>>>>>>> -Xmx5g >>>>>>>> -Xmn400m >>>>>>> Start by increasing the young gen size as may already have been >>>>>>> suggested. ?If you have a test setup where you can experiment, >>>>>>> try doubling the young gen size to start. >>>>>>> >>>>>>> If you have not seen this, it might be helpful. >>>>>>> >>>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a >>>>>>>> I'm not sure what to make of the concurrent mode >>>>>>> The concurrent mode failure is a consequence of the promotion failure. >>>>>>> Once the promotion failure happens the concurrent mode failure is >>>>>>> inevitable. >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> >>>>>>>> . >>>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote: >>>>>>>> Hi Jon, >>>>>>>> >>>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production >>>>>>>> application yesterday. >>>>>>>> The application is running on 4 (homogenous) nodes. >>>>>>>> >>>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure >>>>>>>> event during ParNew: >>>>>>>> >>>>>>>> node-002 >>>>>>>> ------- >>>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew: >>>>>>>> 357592K->23382K(368640K), 0.0298150 secs] >>>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22 >>>>>>>> sys=0.01, real=0.03 secs] >>>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew: >>>>>>>> 351062K->39795K(368640K), 0.0401170 secs] >>>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28 >>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4: >>>>>>>> promotion failure size = 4281460) ?(promotion failed): >>>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS: >>>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K >>>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590 >>>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs] >>>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew: >>>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K), >>>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs] >>>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K), >>>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs] >>>>>>>> >>>>>>>> node-003 >>>>>>>> ------- >>>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew: >>>>>>>> 346950K->21342K(368640K), 0.0333090 secs] >>>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23 >>>>>>>> sys=0.00, real=0.03 secs] >>>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew: >>>>>>>> 345070K->32211K(368640K), 0.0369260 secs] >>>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25 >>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0: >>>>>>>> promotion failure size = 1266955) ?(promotion failed): >>>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS: >>>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3 >>>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640 >>>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs] >>>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew: >>>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K), >>>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] >>>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew: >>>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K), >>>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs] >>>>>>>> >>>>>>>> node-004 >>>>>>>> ------- >>>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew: >>>>>>>> 358429K->40960K(368640K), 0.0629910 secs] >>>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40 >>>>>>>> sys=0.02, real=0.06 secs] >>>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.0819780 secs] >>>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40 >>>>>>>> sys=0.00, real=0.08 secs] >>>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6: >>>>>>>> promotion failure size = 2788662) ?(promotion failed): >>>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS: >>>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs] >>>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)], >>>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs] >>>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew: >>>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K), >>>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs] >>>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew: >>>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K), >>>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] >>>>>>>> >>>>>>>> On a fourth node, I've found a different event: promotion failure >>>>>>>> during CMS, with a much smaller size: >>>>>>>> >>>>>>>> node-001 >>>>>>>> ------- >>>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew: >>>>>>>> 354039K->40960K(368640K), 0.0667340 secs] >>>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37 >>>>>>>> sys=0.01, real=0.06 secs] >>>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.2586390 secs] >>>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73 >>>>>>>> sys=0.13, real=0.26 secs] >>>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark: >>>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times: >>>>>>>> user=0.07 sys=0.00, real=0.07 secs] >>>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start] >>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark: >>>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs] >>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start] >>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean: >>>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs] >>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: >>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew: >>>>>>>> 368640K->40960K(368640K), 0.1214420 secs] >>>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66 >>>>>>>> sys=0.05, real=0.12 secs] >>>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078: >>>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times: >>>>>>>> user=10.72 sys=0.48, real=2.70 secs] >>>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K >>>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081: >>>>>>>> [ParNew (promotion failure size = 1026) ?(promotion failed): >>>>>>>> 206521K->206521K(368640K), 0.1667280 secs] >>>>>>>> ? ? 3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48 >>>>>>>> sys=0.04, real=0.17 secs] >>>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs >>>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750 >>>>>>>> secs]48434.474: [scrub symbol& ? ? ? ?string tables, 0.0088370 secs] [1 >>>>>>>> CMS-remark: 3489675K(4833280K)] 36961 >>>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41 >>>>>>>> secs] >>>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start] >>>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720: >>>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep: >>>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs] >>>>>>>> ? ? (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050 >>>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm : >>>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61 >>>>>>>> sys=0.00, real=8.61 secs] >>>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew: >>>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K), >>>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs] >>>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew: >>>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K), >>>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] >>>>>>>> >>>>>>>> I assume that the large sizes for the promotion failures during ParNew >>>>>>>> are confirming that eliminating large array allocations might help >>>>>>>> here. Do you agree? >>>>>>>> I'm not sure what to make of the concurrent mode failure. >>>>>>>> >>>>>>>> Thanks in advance for any suggestions, >>>>>>>> Taras >>>>>>>> >>>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu >>>>>>>> ? ? ?wrote: >>>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote: >>>>>>>>>> Hi Jon, >>>>>>>>>> >>>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod >>>>>>>>>> environment, but so far, no failures yet. >>>>>>>>>> We know that the load we generate there is not representative. But >>>>>>>>>> perhaps we'll catch something, given enough patience. >>>>>>>>>> >>>>>>>>>> The flag will also be enabled in our production environment next week >>>>>>>>>> - so one way or the other, we'll get more diagnostic data soon. >>>>>>>>>> I'll also do some allocation profiling of the application in isolation >>>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in >>>>>>>>>> there. >>>>>>>>>> >>>>>>>>>> I've got two questions for now: >>>>>>>>>> >>>>>>>>>> 1) From googling around on the output to expect >>>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html), >>>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this: >>>>>>>>>> ------- >>>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698) ?(promotion >>>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs] >>>>>>>>>> ------- >>>>>>>>>> In that example line, what does the "0" stand for? >>>>>>>>> It's the index of the GC worker thread ?that experienced the promotion >>>>>>>>> failure. >>>>>>>>> >>>>>>>>>> 2) Below is a snippet of (real) gc log from our production >>>>>>>>>> application: >>>>>>>>>> ------- >>>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew: >>>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs] >>>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36 >>>>>>>>>> sys=0.01, real=0.06 secs] >>>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew: >>>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs] >>>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31 >>>>>>>>>> sys=0.00, real=0.06 secs] >>>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark: >>>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times: >>>>>>>>>> user=0.04 sys=0.00, real=0.04 secs] >>>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start] >>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark: >>>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs] >>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: >>>>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean: >>>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] >>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001: >>>>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>>>> ? ? ?CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100: >>>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs] >>>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs] >>>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K >>>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242: >>>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs] >>>>>>>>>> 3432839K->3423755K(5201920 >>>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs] >>>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak >>>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading, >>>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol& ? ? ? ? ?string tables, >>>>>>>>>> 0.0093940 >>>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K >>>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10, >>>>>>>>>> real=7.61 secs] >>>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850: >>>>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908: >>>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep: >>>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs] >>>>>>>>>> ? ? ?(concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040 >>>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm : >>>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29 >>>>>>>>>> sys=0.00, real=10.29 secs] >>>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew: >>>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K), >>>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] >>>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew: >>>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K), >>>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs] >>>>>>>>>> ------- >>>>>>>>>> >>>>>>>>>> In this case I don't know how to interpret the output. >>>>>>>>>> a) There's a promotion failure that took 7.49 secs >>>>>>>>> This is the time it took to attempt the minor collection (ParNew) and >>>>>>>>> to >>>>>>>>> do recovery >>>>>>>>> from the failure. >>>>>>>>> >>>>>>>>>> b) There's a full GC that took 14.08 secs >>>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs >>>>>>>>> Not sure about b) and c) because the output is mixed up with the >>>>>>>>> concurrent-sweep >>>>>>>>> output but ?I think the "concurrent mode failure" message is part of >>>>>>>>> the >>>>>>>>> "Full GC" >>>>>>>>> message. ?My guess is that the 10.29 is the time for the Full GC and >>>>>>>>> the >>>>>>>>> 14.08 >>>>>>>>> maybe is part of the concurrent-sweep message. ?Really hard to be sure. >>>>>>>>> >>>>>>>>> Jon >>>>>>>>>> How are these events, and their (real) times related to each other? >>>>>>>>>> >>>>>>>>>> Thanks in advance, >>>>>>>>>> Taras >>>>>>>>>> >>>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon >>>>>>>>>> Masamitsu ? ? ? ? ?wrote: >>>>>>>>>>> Taras, >>>>>>>>>>> >>>>>>>>>>> PrintPromotionFailure seems like it would go a long >>>>>>>>>>> way to identify the root of your promotion failures (or >>>>>>>>>>> at least eliminating some possible causes). ? ?I think it >>>>>>>>>>> would help focus the discussion if you could send >>>>>>>>>>> the result of that experiment early. >>>>>>>>>>> >>>>>>>>>>> Jon >>>>>>>>>>> >>>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is >>>>>>>>>>>> experiencing occasional promotion failures. >>>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode. >>>>>>>>>>>> I've listed the specific JVM options used below (a). >>>>>>>>>>>> >>>>>>>>>>>> The application is deployed across a handful of machines, and the >>>>>>>>>>>> promotion failures are fairly uniform across those. >>>>>>>>>>>> >>>>>>>>>>>> The first kind of failure we observe is a promotion failure during >>>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below >>>>>>>>>>>> (b). >>>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps >>>>>>>>>>>> triggered by the same cause), see (c) below. >>>>>>>>>>>> The frequency (after running for a some weeks) is approximately once >>>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on >>>>>>>>>>>> this. >>>>>>>>>>>> >>>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of >>>>>>>>>>>> small objects), the application also runs a few dozen background >>>>>>>>>>>> threads that download and process XML documents, typically in the >>>>>>>>>>>> 5-30 >>>>>>>>>>>> MB range. >>>>>>>>>>>> A known deficiency in the existing code is that the XML content is >>>>>>>>>>>> copied twice before processing (once to a byte[], and later again to >>>>>>>>>>>> a >>>>>>>>>>>> String/char[]). >>>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB >>>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array >>>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue. >>>>>>>>>>>> >>>>>>>>>>>> My questions are: >>>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to >>>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion >>>>>>>>>>>> failure? >>>>>>>>>>>> 2) If not, what's the next step of investigating the cause? >>>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get >>>>>>>>>>>> a >>>>>>>>>>>> feeling for the size of the objects that fail promotion. >>>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only >>>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the >>>>>>>>>>>> case? >>>>>>>>>>>> >>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>> Taras >>>>>>>>>>>> >>>>>>>>>>>> a) Current JVM options: >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> -server >>>>>>>>>>>> -Xms5g >>>>>>>>>>>> -Xmx5g >>>>>>>>>>>> -Xmn400m >>>>>>>>>>>> -XX:PermSize=256m >>>>>>>>>>>> -XX:MaxPermSize=256m >>>>>>>>>>>> -XX:+PrintGCTimeStamps >>>>>>>>>>>> -verbose:gc >>>>>>>>>>>> -XX:+PrintGCDateStamps >>>>>>>>>>>> -XX:+PrintGCDetails >>>>>>>>>>>> -XX:SurvivorRatio=8 >>>>>>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>>>>>> -XX:+UseParNewGC >>>>>>>>>>>> -XX:+DisableExplicitGC >>>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>>>>>>>>> -XX:+CMSClassUnloadingEnabled >>>>>>>>>>>> -XX:+CMSScavengeBeforeRemark >>>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68 >>>>>>>>>>>> -Xloggc:gc.log >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> b) Promotion failure during ParNew >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew: >>>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs] >>>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39 >>>>>>>>>>>> sys=0.01, real=0.07 secs] >>>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew: >>>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs] >>>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28 >>>>>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew: >>>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs] >>>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19 >>>>>>>>>>>> sys=0.00, real=0.03 secs] >>>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew >>>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200 >>>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs] >>>>>>>>>>>> 3505808K->434291K >>>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs] >>>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs] >>>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew: >>>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs] >>>>>>>>>>>> 761971K->514584K(5201920K), >>>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs] >>>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew: >>>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs] >>>>>>>>>>>> 842264K->625681K(5201920K), >>>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs] >>>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew: >>>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs] >>>>>>>>>>>> 953361K->684121K(5201920K), >>>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs] >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> c) Promotion failure during CMS >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew: >>>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs] >>>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37 >>>>>>>>>>>> sys=0.00, real=0.05 secs] >>>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew: >>>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs] >>>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24 >>>>>>>>>>>> sys=0.01, real=0.05 secs] >>>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew: >>>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs] >>>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30 >>>>>>>>>>>> sys=0.00, real=0.04 secs] >>>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark: >>>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times: >>>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs] >>>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529: >>>>>>>>>>>> [CMS-concurrent-mark-start] >>>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew: >>>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs] >>>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50 >>>>>>>>>>>> sys=0.01, real=0.08 secs] >>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark: >>>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs] >>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: >>>>>>>>>>>> [CMS-concurrent-preclean-start] >>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean: >>>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs] >>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: >>>>>>>>>>>> [CMS-concurrent-abortable-preclean-start] >>>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239: >>>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times: >>>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs] >>>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K >>>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244: >>>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 >>>>>>>>>>>> secs] >>>>>>>>>>>> ? ? ? 3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00 >>>>>>>>>>>> sys=2.58, real=9.88 secs] >>>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak >>>>>>>>>>>> refs >>>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610 >>>>>>>>>>>> secs]703031.419: [scrub symbol& ? ? ? ? ? ?string tables, 0.0094960 >>>>>>>>>>>> secs] [1 CMS >>>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs] >>>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs] >>>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436: >>>>>>>>>>>> [CMS-concurrent-sweep-start] >>>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493: >>>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep: >>>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs] >>>>>>>>>>>> ? ? ? (concurrent mode failure): 3370829K->433437K(4833280K), >>>>>>>>>>>> 10.9594300 >>>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm : >>>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95 >>>>>>>>>>>> sys=0.00, real=10.96 secs] >>>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew: >>>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs] >>>>>>>>>>>> 761117K->517836K(5201920K), >>>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs] >>>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew: >>>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs] >>>>>>>>>>>> 845516K->557872K(5201920K), >>>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs] >>>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew: >>>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs] >>>>>>>>>>>> 885552K->603017K(5201920K), >>>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs] >>>>>>>>>>>> -------------------------------- >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> hotspot-gc-use mailing list >>>>>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> hotspot-gc-use mailing list >>>>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>>>> _______________________________________________ >>>>>>>>>> hotspot-gc-use mailing list >>>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>>> _______________________________________________ >>>>>>>>> hotspot-gc-use mailing list >>>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>>> _______________________________________________ >>>>>>>> hotspot-gc-use mailing list >>>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>>>>> _______________________________________________ >>>>>>> hotspot-gc-use mailing list >>>>>>> hotspot-gc-use at openjdk.java.net >>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use