Promotion failures: indication of CMS fragmentation?

Wed Jan 25 02:41:54 PST 2012

Hi Jon,

At the risk of asking a stupid question, what's the word size on x64
when using CompressedOops?

Thanks,
Taras

On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu
<jon.masamitsu at oracle.com> wrote:
>
>
> On 01/24/12 10:15, Taras Tielkes wrote:
>> Hi Jon,
>>
>> Xmx is 5g, PermGen is 256m, new is 400m.
>>
>> The overall tenured gen usage is at the point when I would expect the
>> CMS to kick in though.
>> Does this mean I'd have to lower the CMS initiating occupancy setting
>> (currently at 68%)?
>
> I don't have any quick answers as to what to try next.
>
>>
>> In addition, are the promotion failure sizes expressed in bytes? If
>> so, I'm surprised to see such odd-sized (7, for example) sizes.
>
> It's in words.
>
> Jon
>>
>> Thanks,
>> Taras
>>
>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu<jon.masamitsu at oracle.com>  wrote:
>>> Taras,
>>>
>>> The pattern makes sense if the tenured (cms) gen is in fact full.
>>> Multiple  GC workers try to get a chunk of space for
>>> an allocation and there is no space.
>>>
>>> Jon
>>>
>>>
>>> On 01/24/12 04:22, Taras Tielkes wrote:
>>>> Hi Jon,
>>>>
>>>> While inspecting our production logs for promotion failures, I saw the
>>>> following one today:
>>>> --------
>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>>>> 349623K->20008K(368640K), 0.0294350 secs]
>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>>>> sys=0.00, real=0.03 secs]
>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>>>> 347688K->40960K(368640K), 0.0536700 secs]
>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>>>> sys=0.00, real=0.05 secs]
>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>>>> (0: promotion failure size = 6)  (1: promotion failure size = 6)  (2:
>>>> promotion failure size = 7)  (3: promotion failure size = 7)  (4:
>>>> promotion failure size = 9)  (5: p
>>>> romotion failure size = 9)  (6: promotion failure size = 6)  (7:
>>>> promotion failure size = 9)  (promotion failed):
>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>>>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>> --------
>>>>
>>>> It's different from the others in two ways:
>>>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>>>> 2) the very small size of the promoted object
>>>>
>>>> Do such an promotion failure pattern ring a bell? It does not make sense
>>>> to me.
>>>>
>>>> Thanks,
>>>> Taras
>>>>
>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>   wrote:
>>>>> Taras,
>>>>>
>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>> are confirming that eliminating large array allocations might help
>>>>>> here. Do you agree?
>>>>> I agree that eliminating the large array allocation will help but you
>>>>> are still having
>>>>> promotion failures when the allocation size is small (I think it was
>>>>> 1026).  That
>>>>> says that you are filling up the old (cms) generation faster than the GC
>>>>> can
>>>>> collect it.  The large arrays are aggrevating the problem but not
>>>>> necessarily
>>>>> the cause.
>>>>>
>>>>> If these are still your heap sizes,
>>>>>
>>>>>> -Xms5g
>>>>>> -Xmx5g
>>>>>> -Xmn400m
>>>>> Start by increasing the young gen size as may already have been
>>>>> suggested.  If you have a test setup where you can experiment,
>>>>> try doubling the young gen size to start.
>>>>>
>>>>> If you have not seen this, it might be helpful.
>>>>>
>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>>> I'm not sure what to make of the concurrent mode
>>>>> The concurrent mode failure is a consequence of the promotion failure.
>>>>> Once the promotion failure happens the concurrent mode failure is
>>>>> inevitable.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>>> .
>>>>>
>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>>> Hi Jon,
>>>>>>
>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>>>> application yesterday.
>>>>>> The application is running on 4 (homogenous) nodes.
>>>>>>
>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>>>> event during ParNew:
>>>>>>
>>>>>> node-002
>>>>>> -------
>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>>>> sys=0.01, real=0.03 secs]
>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>>>> promotion failure size = 4281460)  (promotion failed):
>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>>>
>>>>>> node-003
>>>>>> -------
>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>>>> sys=0.00, real=0.03 secs]
>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>>>> promotion failure size = 1266955)  (promotion failed):
>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>>>
>>>>>> node-004
>>>>>> -------
>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>>>> sys=0.02, real=0.06 secs]
>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>>>> sys=0.00, real=0.08 secs]
>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>>>> promotion failure size = 2788662)  (promotion failed):
>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>>>
>>>>>> On a fourth node, I've found a different event: promotion failure
>>>>>> during CMS, with a much smaller size:
>>>>>>
>>>>>> node-001
>>>>>> -------
>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>>>> sys=0.01, real=0.06 secs]
>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>>>> sys=0.13, real=0.26 secs]
>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>>>> sys=0.05, real=0.12 secs]
>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>>>> [ParNew (promotion failure size = 1026)  (promotion failed):
>>>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>>>    3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>>>> sys=0.04, real=0.17 secs]
>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>>>> secs]48434.474: [scrub symbol&      string tables, 0.0088370 secs] [1
>>>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>>>> secs]
>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>>>    (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>>>> sys=0.00, real=8.61 secs]
>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>>>
>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>> are confirming that eliminating large array allocations might help
>>>>>> here. Do you agree?
>>>>>> I'm not sure what to make of the concurrent mode failure.
>>>>>>
>>>>>> Thanks in advance for any suggestions,
>>>>>> Taras
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>>     wrote:
>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>>> Hi Jon,
>>>>>>>>
>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>>>> environment, but so far, no failures yet.
>>>>>>>> We know that the load we generate there is not representative. But
>>>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>>>
>>>>>>>> The flag will also be enabled in our production environment next week
>>>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>>>> there.
>>>>>>>>
>>>>>>>> I've got two questions for now:
>>>>>>>>
>>>>>>>> 1) From googling around on the output to expect
>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>>>> -------
>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>>>> -------
>>>>>>>> In that example line, what does the "0" stand for?
>>>>>>> It's the index of the GC worker thread  that experienced the promotion
>>>>>>> failure.
>>>>>>>
>>>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>>>> application:
>>>>>>>> -------
>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>>>> sys=0.00, real=0.06 secs]
>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>     CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>>>> 3432839K->3423755K(5201920
>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol&        string tables,
>>>>>>>> 0.0093940
>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>>>> real=7.61 secs]
>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>>>     (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>>>> sys=0.00, real=10.29 secs]
>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>>>> -------
>>>>>>>>
>>>>>>>> In this case I don't know how to interpret the output.
>>>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>>>> to
>>>>>>> do recovery
>>>>>>> from the failure.
>>>>>>>
>>>>>>>> b) There's a full GC that took 14.08 secs
>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>>>> concurrent-sweep
>>>>>>> output but  I think the "concurrent mode failure" message is part of
>>>>>>> the
>>>>>>> "Full GC"
>>>>>>> message.  My guess is that the 10.29 is the time for the Full GC and
>>>>>>> the
>>>>>>> 14.08
>>>>>>> maybe is part of the concurrent-sweep message.  Really hard to be sure.
>>>>>>>
>>>>>>> Jon
>>>>>>>> How are these events, and their (real) times related to each other?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Taras
>>>>>>>>
>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>>>> Masamitsu<jon.masamitsu at oracle.com>        wrote:
>>>>>>>>> Taras,
>>>>>>>>>
>>>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>>>> way to identify the root of your promotion failures (or
>>>>>>>>> at least eliminating some possible causes).    I think it
>>>>>>>>> would help focus the discussion if you could send
>>>>>>>>> the result of that experiment early.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>>>> experiencing occasional promotion failures.
>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>>>
>>>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>>>
>>>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>>>> (b).
>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>>>> 5-30
>>>>>>>>>> MB range.
>>>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>>>> a
>>>>>>>>>> String/char[]).
>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>>>
>>>>>>>>>> My questions are:
>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>>>> failure?
>>>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>>>> a
>>>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>>>> case?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Taras
>>>>>>>>>>
>>>>>>>>>> a) Current JVM options:
>>>>>>>>>> --------------------------------
>>>>>>>>>> -server
>>>>>>>>>> -Xms5g
>>>>>>>>>> -Xmx5g
>>>>>>>>>> -Xmn400m
>>>>>>>>>> -XX:PermSize=256m
>>>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>>>> -verbose:gc
>>>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>>>> -XX:+PrintGCDetails
>>>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>>>> -XX:+UseParNewGC
>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>>>> -Xloggc:gc.log
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> b) Promotion failure during ParNew
>>>>>>>>>> --------------------------------
>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>>>> 3505808K->434291K
>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> c) Promotion failure during CMS
>>>>>>>>>> --------------------------------
>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>>>> secs]
>>>>>>>>>>      3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>>>> refs
>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>>>> secs]703031.419: [scrub symbol&          string tables, 0.0094960
>>>>>>>>>> secs] [1 CMS
>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>>>      (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>>>> 10.9594300
>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>>>> --------------------------------
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>> _______________________________________________
>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use