Promotion failures: indication of CMS fragmentation?

Tue Jan 24 10:15:39 PST 2012

Hi Jon,

Xmx is 5g, PermGen is 256m, new is 400m.

The overall tenured gen usage is at the point when I would expect the
CMS to kick in though.
Does this mean I'd have to lower the CMS initiating occupancy setting
(currently at 68%)?

In addition, are the promotion failure sizes expressed in bytes? If
so, I'm surprised to see such odd-sized (7, for example) sizes.

Thanks,
Taras

On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>
> Taras,
>
> The pattern makes sense if the tenured (cms) gen is in fact full.
> Multiple  GC workers try to get a chunk of space for
> an allocation and there is no space.
>
> Jon
>
>
> On 01/24/12 04:22, Taras Tielkes wrote:
>>
>> Hi Jon,
>>
>> While inspecting our production logs for promotion failures, I saw the
>> following one today:
>> --------
>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>> 349623K->20008K(368640K), 0.0294350 secs]
>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>> sys=0.00, real=0.03 secs]
>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>> 347688K->40960K(368640K), 0.0536700 secs]
>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>> sys=0.00, real=0.05 secs]
>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>> (0: promotion failure size = 6)  (1: promotion failure size = 6)  (2:
>> promotion failure size = 7)  (3: promotion failure size = 7)  (4:
>> promotion failure size = 9)  (5: p
>> romotion failure size = 9)  (6: promotion failure size = 6)  (7:
>> promotion failure size = 9)  (promotion failed):
>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>> --------
>>
>> It's different from the others in two ways:
>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>> 2) the very small size of the promoted object
>>
>> Do such an promotion failure pattern ring a bell? It does not make sense
>> to me.
>>
>> Thanks,
>> Taras
>>
>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>  wrote:
>>>
>>> Taras,
>>>
>>>> I assume that the large sizes for the promotion failures during ParNew
>>>> are confirming that eliminating large array allocations might help
>>>> here. Do you agree?
>>>
>>> I agree that eliminating the large array allocation will help but you
>>> are still having
>>> promotion failures when the allocation size is small (I think it was
>>> 1026).  That
>>> says that you are filling up the old (cms) generation faster than the GC
>>> can
>>> collect it.  The large arrays are aggrevating the problem but not
>>> necessarily
>>> the cause.
>>>
>>> If these are still your heap sizes,
>>>
>>>> -Xms5g
>>>> -Xmx5g
>>>> -Xmn400m
>>>
>>> Start by increasing the young gen size as may already have been
>>> suggested.  If you have a test setup where you can experiment,
>>> try doubling the young gen size to start.
>>>
>>> If you have not seen this, it might be helpful.
>>>
>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>
>>>> I'm not sure what to make of the concurrent mode
>>>
>>> The concurrent mode failure is a consequence of the promotion failure.
>>> Once the promotion failure happens the concurrent mode failure is
>>> inevitable.
>>>
>>> Jon
>>>
>>>
>>>> .
>>>
>>>
>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>
>>>> Hi Jon,
>>>>
>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>> application yesterday.
>>>> The application is running on 4 (homogenous) nodes.
>>>>
>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>> event during ParNew:
>>>>
>>>> node-002
>>>> -------
>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>> sys=0.01, real=0.03 secs]
>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>> sys=0.00, real=0.04 secs]
>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>> promotion failure size = 4281460)  (promotion failed):
>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>
>>>> node-003
>>>> -------
>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>> sys=0.00, real=0.03 secs]
>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>> sys=0.00, real=0.04 secs]
>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>> promotion failure size = 1266955)  (promotion failed):
>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>
>>>> node-004
>>>> -------
>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>> sys=0.02, real=0.06 secs]
>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>> sys=0.00, real=0.08 secs]
>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>> promotion failure size = 2788662)  (promotion failed):
>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>
>>>> On a fourth node, I've found a different event: promotion failure
>>>> during CMS, with a much smaller size:
>>>>
>>>> node-001
>>>> -------
>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>> sys=0.01, real=0.06 secs]
>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>> sys=0.13, real=0.26 secs]
>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>> [CMS-concurrent-abortable-preclean-start]
>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>> sys=0.05, real=0.12 secs]
>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>> [ParNew (promotion failure size = 1026)  (promotion failed):
>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>   3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>> sys=0.04, real=0.17 secs]
>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>> secs]48434.474: [scrub symbol&    string tables, 0.0088370 secs] [1
>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>> secs]
>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>   (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>> sys=0.00, real=8.61 secs]
>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>
>>>> I assume that the large sizes for the promotion failures during ParNew
>>>> are confirming that eliminating large array allocations might help
>>>> here. Do you agree?
>>>> I'm not sure what to make of the concurrent mode failure.
>>>>
>>>> Thanks in advance for any suggestions,
>>>> Taras
>>>>
>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>    wrote:
>>>>>
>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>
>>>>>> Hi Jon,
>>>>>>
>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>> environment, but so far, no failures yet.
>>>>>> We know that the load we generate there is not representative. But
>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>
>>>>>> The flag will also be enabled in our production environment next week
>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>> there.
>>>>>>
>>>>>> I've got two questions for now:
>>>>>>
>>>>>> 1) From googling around on the output to expect
>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>> -------
>>>>>> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>> -------
>>>>>> In that example line, what does the "0" stand for?
>>>>>
>>>>> It's the index of the GC worker thread  that experienced the promotion
>>>>> failure.
>>>>>
>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>> application:
>>>>>> -------
>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>> sys=0.01, real=0.06 secs]
>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>> sys=0.00, real=0.06 secs]
>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>> [CMS-concurrent-preclean-start]
>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>    CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>> 3432839K->3423755K(5201920
>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>> 0.0289480 secs]2136605.833: [scrub symbol&      string tables,
>>>>>> 0.0093940
>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>> real=7.61 secs]
>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>> [CMS-concurrent-sweep-start]
>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>    (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>> sys=0.00, real=10.29 secs]
>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>> -------
>>>>>>
>>>>>> In this case I don't know how to interpret the output.
>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>
>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>> to
>>>>> do recovery
>>>>> from the failure.
>>>>>
>>>>>> b) There's a full GC that took 14.08 secs
>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>
>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>> concurrent-sweep
>>>>> output but  I think the "concurrent mode failure" message is part of
>>>>> the
>>>>> "Full GC"
>>>>> message.  My guess is that the 10.29 is the time for the Full GC and
>>>>> the
>>>>> 14.08
>>>>> maybe is part of the concurrent-sweep message.  Really hard to be sure.
>>>>>
>>>>> Jon
>>>>>>
>>>>>> How are these events, and their (real) times related to each other?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Taras
>>>>>>
>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>> Masamitsu<jon.masamitsu at oracle.com>      wrote:
>>>>>>>
>>>>>>> Taras,
>>>>>>>
>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>> way to identify the root of your promotion failures (or
>>>>>>> at least eliminating some possible causes).    I think it
>>>>>>> would help focus the discussion if you could send
>>>>>>> the result of that experiment early.
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>> experiencing occasional promotion failures.
>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>
>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>
>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>> (b).
>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>> 5-30
>>>>>>>> MB range.
>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>> a
>>>>>>>> String/char[]).
>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>> failure?
>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>> a
>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>> case?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Taras
>>>>>>>>
>>>>>>>> a) Current JVM options:
>>>>>>>> --------------------------------
>>>>>>>> -server
>>>>>>>> -Xms5g
>>>>>>>> -Xmx5g
>>>>>>>> -Xmn400m
>>>>>>>> -XX:PermSize=256m
>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>> -verbose:gc
>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>> -XX:+PrintGCDetails
>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>> -XX:+UseParNewGC
>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>> -Xloggc:gc.log
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>> b) Promotion failure during ParNew
>>>>>>>> --------------------------------
>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>> 3505808K->434291K
>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>> c) Promotion failure during CMS
>>>>>>>> --------------------------------
>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>> secs]
>>>>>>>>     3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>> refs
>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>> secs]703031.419: [scrub symbol&        string tables, 0.0094960
>>>>>>>> secs] [1 CMS
>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>     (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>> 10.9594300
>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>> --------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use