Promotion failures: indication of CMS fragmentation?

Fri Jan 6 07:27:44 UTC 2012

On 1/5/2012 3:32 PM, Taras Tielkes wrote:
> Hi Jon,
>
> We've enabled the PrintPromotionFailure flag in our preprod
> environment, but so far, no failures yet.
> We know that the load we generate there is not representative. But
> perhaps we'll catch something, given enough patience.
>
> The flag will also be enabled in our production environment next week
> - so one way or the other, we'll get more diagnostic data soon.
> I'll also do some allocation profiling of the application in isolation
> - I know that there is abusive large byte[] and char[] allocation in
> there.
>
> I've got two questions for now:
>
> 1) From googling around on the output to expect
> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
> I see that -XX:+PrintPromotionFailure will generate output like this:
> -------
> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
> failed): 135865K->134943K(138240K), 0.1433555 secs]
> -------
> In that example line, what does the "0" stand for?

It's the index of the GC worker thread  that experienced the promotion 
failure.

> 2) Below is a snippet of (real) gc log from our production application:
> -------
> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
> 345951K->40960K(368640K), 0.0676780 secs]
> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
> sys=0.01, real=0.06 secs]
> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
> 368640K->40959K(368640K), 0.0618880 secs]
> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
> sys=0.00, real=0.06 secs]
> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
> user=0.04 sys=0.00, real=0.04 secs]
> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-preclean-start]
> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
> 2011-12-30T22:42:24.099+0100: 2136593.001:
> [CMS-concurrent-abortable-preclean-start]
>   CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
> [Times: user=5.70 sys=0.23, real=5.23 secs]
> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
> 3432839K->3423755K(5201920
> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
> refs processing, 0.0034280 secs]2136605.804: [class unloading,
> 0.0289480 secs]2136605.833: [scrub symbol&  string tables, 0.0093940
> secs] [1 CMS-remark: 3318289K(4833280K
> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
> real=7.61 secs]
> 2011-12-30T22:42:36.949+0100: 2136605.850: [CMS-concurrent-sweep-start]
> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>   (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
> secs] 3491471K->291853K(5201920K), [CMS Perm :
> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
> sys=0.00, real=10.29 secs]
> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
> -------
>
> In this case I don't know how to interpret the output.
> a) There's a promotion failure that took 7.49 secs
This is the time it took to attempt the minor collection (ParNew) and to 
do recovery
from the failure.

> b) There's a full GC that took 14.08 secs
> c) There's a concurrent mode failure that took 10.29 secs

Not sure about b) and c) because the output is mixed up with the 
concurrent-sweep
output but  I think the "concurrent mode failure" message is part of the 
"Full GC"
message.  My guess is that the 10.29 is the time for the Full GC and the 
14.08
maybe is part of the concurrent-sweep message.  Really hard to be sure.

Jon
> How are these events, and their (real) times related to each other?
>
> Thanks in advance,
> Taras
>
> On Tue, Dec 27, 2011 at 6:13 PM, Jon Masamitsu<jon.masamitsu at oracle.com>  wrote:
>> Taras,
>>
>> PrintPromotionFailure seems like it would go a long
>> way to identify the root of your promotion failures (or
>> at least eliminating some possible causes).    I think it
>> would help focus the discussion if you could send
>> the result of that experiment early.
>>
>> Jon
>>
>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>> Hi,
>>>
>>> We're running an application with the CMS/ParNew collectors that is
>>> experiencing occasional promotion failures.
>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>> I've listed the specific JVM options used below (a).
>>>
>>> The application is deployed across a handful of machines, and the
>>> promotion failures are fairly uniform across those.
>>>
>>> The first kind of failure we observe is a promotion failure during
>>> ParNew collection, I've included a snipped from the gc log below (b).
>>> The second kind of failure is a concurrrent mode failure (perhaps
>>> triggered by the same cause), see (c) below.
>>> The frequency (after running for a some weeks) is approximately once
>>> per day. This is bearable, but obviously we'd like to improve on this.
>>>
>>> Apart from high-volume request handling (which allocates a lot of
>>> small objects), the application also runs a few dozen background
>>> threads that download and process XML documents, typically in the 5-30
>>> MB range.
>>> A known deficiency in the existing code is that the XML content is
>>> copied twice before processing (once to a byte[], and later again to a
>>> String/char[]).
>>> Given that a 30 MB XML stream will result in a 60 MB
>>> java.lang.String/char[], my suspicion is that these big array
>>> allocations are causing us to run into the CMS fragmentation issue.
>>>
>>> My questions are:
>>> 1) Does the data from the GC logs provide sufficient evidence to
>>> conclude that CMS fragmentation is the cause of the promotion failure?
>>> 2) If not, what's the next step of investigating the cause?
>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get a
>>> feeling for the size of the objects that fail promotion.
>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>> case?
>>>
>>> Thanks in advance,
>>> Taras
>>>
>>> a) Current JVM options:
>>> --------------------------------
>>> -server
>>> -Xms5g
>>> -Xmx5g
>>> -Xmn400m
>>> -XX:PermSize=256m
>>> -XX:MaxPermSize=256m
>>> -XX:+PrintGCTimeStamps
>>> -verbose:gc
>>> -XX:+PrintGCDateStamps
>>> -XX:+PrintGCDetails
>>> -XX:SurvivorRatio=8
>>> -XX:+UseConcMarkSweepGC
>>> -XX:+UseParNewGC
>>> -XX:+DisableExplicitGC
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+CMSClassUnloadingEnabled
>>> -XX:+CMSScavengeBeforeRemark
>>> -XX:CMSInitiatingOccupancyFraction=68
>>> -Xloggc:gc.log
>>> --------------------------------
>>>
>>> b) Promotion failure during ParNew
>>> --------------------------------
>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>> 368640K->40959K(368640K), 0.0693460 secs]
>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>> sys=0.01, real=0.07 secs]
>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>> 368639K->31321K(368640K), 0.0511400 secs]
>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>> sys=0.00, real=0.05 secs]
>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>> 359001K->18694K(368640K), 0.0272970 secs]
>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>> sys=0.00, real=0.03 secs]
>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>> 3505808K->434291K
>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>> 327680K->40960K(368640K), 0.0949460 secs] 761971K->514584K(5201920K),
>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>> 368640K->40960K(368640K), 0.1299190 secs] 842264K->625681K(5201920K),
>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>> 368640K->40960K(368640K), 0.0870940 secs] 953361K->684121K(5201920K),
>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>> --------------------------------
>>>
>>> c) Promotion failure during CMS
>>> --------------------------------
>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>> 357228K->40960K(368640K), 0.0525110 secs]
>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>> sys=0.00, real=0.05 secs]
>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>> 366075K->37119K(368640K), 0.0479780 secs]
>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>> sys=0.01, real=0.05 secs]
>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>> 364792K->40960K(368640K), 0.0421740 secs]
>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>> sys=0.00, real=0.04 secs]
>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>> user=0.02 sys=0.00, real=0.03 secs]
>>> 2011-12-14T08:29:29.628+0100: 703018.529: [CMS-concurrent-mark-start]
>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>> 368640K->40960K(368640K), 0.0836690 secs]
>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>> sys=0.01, real=0.08 secs]
>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-preclean-start]
>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>> [CMS-concurrent-abortable-preclean-start]
>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>> user=6.68 sys=0.27, real=1.40 secs]
>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020 secs]
>>>    3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>> sys=2.58, real=9.88 secs]
>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak refs
>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>> secs]703031.419: [scrub symbol&    string tables, 0.0094960 secs] [1 CMS
>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>> 2011-12-14T08:29:42.535+0100: 703031.436: [CMS-concurrent-sweep-start]
>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>    (concurrent mode failure): 3370829K->433437K(4833280K), 10.9594300
>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>> sys=0.00, real=10.96 secs]
>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>> 327680K->40960K(368640K), 0.0799960 secs] 761117K->517836K(5201920K),
>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>> 368640K->40960K(368640K), 0.0784460 secs] 845516K->557872K(5201920K),
>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>> 368640K->40960K(368640K), 0.0784040 secs] 885552K->603017K(5201920K),
>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>> --------------------------------
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use