Promotion failures: indication of CMS fragmentation?

Wed Jan 25 22:49:49 PST 2012

On 1/25/2012 2:41 AM, Taras Tielkes wrote:
> Hi Jon,
>
> At the risk of asking a stupid question, what's the word size on x64
> when using CompressedOops?

Word size is the same with and without CompressedOops (8 bytes).  With  
CompressedOops
we can just point to words with a 32bit reference (i.e., map the 32bit 
reference to a full
64bit address).

Jon

> Thanks,
> Taras
>
> On Tue, Jan 24, 2012 at 11:21 PM, Jon Masamitsu
> <jon.masamitsu at oracle.com>  wrote:
>> On 01/24/12 10:15, Taras Tielkes wrote:
>>> Hi Jon,
>>>
>>> Xmx is 5g, PermGen is 256m, new is 400m.
>>>
>>> The overall tenured gen usage is at the point when I would expect the
>>> CMS to kick in though.
>>> Does this mean I'd have to lower the CMS initiating occupancy setting
>>> (currently at 68%)?
>> I don't have any quick answers as to what to try next.
>>
>>> In addition, are the promotion failure sizes expressed in bytes? If
>>> so, I'm surprised to see such odd-sized (7, for example) sizes.
>> It's in words.
>>
>> Jon
>>> Thanks,
>>> Taras
>>>
>>> On Tue, Jan 24, 2012 at 4:30 PM, Jon Masamitsu<jon.masamitsu at oracle.com>    wrote:
>>>> Taras,
>>>>
>>>> The pattern makes sense if the tenured (cms) gen is in fact full.
>>>> Multiple  GC workers try to get a chunk of space for
>>>> an allocation and there is no space.
>>>>
>>>> Jon
>>>>
>>>>
>>>> On 01/24/12 04:22, Taras Tielkes wrote:
>>>>> Hi Jon,
>>>>>
>>>>> While inspecting our production logs for promotion failures, I saw the
>>>>> following one today:
>>>>> --------
>>>>> 2012-01-24T08:37:26.118+0100: 1222467.411: [GC 1222467.411: [ParNew:
>>>>> 349623K->20008K(368640K), 0.0294350 secs]
>>>>> 3569266K->3239650K(5201920K), 0.0298770 secs] [Times: user=0.21
>>>>> sys=0.00, real=0.03 secs]
>>>>> 2012-01-24T08:37:27.497+0100: 1222468.790: [GC 1222468.791: [ParNew:
>>>>> 347688K->40960K(368640K), 0.0536700 secs]
>>>>> 3567330K->3284097K(5201920K), 0.0541200 secs] [Times: user=0.36
>>>>> sys=0.00, real=0.05 secs]
>>>>> 2012-01-24T08:37:28.716+0100: 1222470.009: [GC 1222470.010: [ParNew
>>>>> (0: promotion failure size = 6)  (1: promotion failure size = 6)  (2:
>>>>> promotion failure size = 7)  (3: promotion failure size = 7)  (4:
>>>>> promotion failure size = 9)  (5: p
>>>>> romotion failure size = 9)  (6: promotion failure size = 6)  (7:
>>>>> promotion failure size = 9)  (promotion failed):
>>>>> 368640K->368640K(368640K), 3.1475760 secs]1222473.157: [CMS:
>>>>> 3315844K->559299K(4833280K), 5.9647110 secs] 3611777K->559299K(
>>>>> 5201920K), [CMS Perm : 118085K->118072K(262144K)], 9.1128700 secs]
>>>>> [Times: user=10.17 sys=1.10, real=9.11 secs]
>>>>> 2012-01-24T08:37:38.601+0100: 1222479.894: [GC 1222479.895: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0635680 secs] 886979K->624773K(5201920K),
>>>>> 0.0641090 secs] [Times: user=0.37 sys=0.00, real=0.07 secs]
>>>>> 2012-01-24T08:37:40.642+0100: 1222481.936: [GC 1222481.936: [ParNew:
>>>>> 368640K->38479K(368640K), 0.0771480 secs] 952453K->659708K(5201920K),
>>>>> 0.0776360 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>> --------
>>>>>
>>>>> It's different from the others in two ways:
>>>>> 1) a "parallel" promotion failure in all 8 ParNew threads?
>>>>> 2) the very small size of the promoted object
>>>>>
>>>>> Do such an promotion failure pattern ring a bell? It does not make sense
>>>>> to me.
>>>>>
>>>>> Thanks,
>>>>> Taras
>>>>>
>>>>> On Wed, Jan 11, 2012 at 5:54 PM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>    wrote:
>>>>>> Taras,
>>>>>>
>>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>>> are confirming that eliminating large array allocations might help
>>>>>>> here. Do you agree?
>>>>>> I agree that eliminating the large array allocation will help but you
>>>>>> are still having
>>>>>> promotion failures when the allocation size is small (I think it was
>>>>>> 1026).  That
>>>>>> says that you are filling up the old (cms) generation faster than the GC
>>>>>> can
>>>>>> collect it.  The large arrays are aggrevating the problem but not
>>>>>> necessarily
>>>>>> the cause.
>>>>>>
>>>>>> If these are still your heap sizes,
>>>>>>
>>>>>>> -Xms5g
>>>>>>> -Xmx5g
>>>>>>> -Xmn400m
>>>>>> Start by increasing the young gen size as may already have been
>>>>>> suggested.  If you have a test setup where you can experiment,
>>>>>> try doubling the young gen size to start.
>>>>>>
>>>>>> If you have not seen this, it might be helpful.
>>>>>>
>>>>>> http://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
>>>>>>> I'm not sure what to make of the concurrent mode
>>>>>> The concurrent mode failure is a consequence of the promotion failure.
>>>>>> Once the promotion failure happens the concurrent mode failure is
>>>>>> inevitable.
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>>
>>>>>>> .
>>>>>> On 1/11/2012 3:00 AM, Taras Tielkes wrote:
>>>>>>> Hi Jon,
>>>>>>>
>>>>>>> We've added the -XX:+PrintPromotionFailure flag to our production
>>>>>>> application yesterday.
>>>>>>> The application is running on 4 (homogenous) nodes.
>>>>>>>
>>>>>>> In the gc logs of 3 out of 4 nodes, I've found a promotion failure
>>>>>>> event during ParNew:
>>>>>>>
>>>>>>> node-002
>>>>>>> -------
>>>>>>> 2012-01-11T09:39:14.353+0100: 102975.594: [GC 102975.594: [ParNew:
>>>>>>> 357592K->23382K(368640K), 0.0298150 secs]
>>>>>>> 3528237K->3194027K(5201920K), 0.0300860 secs] [Times: user=0.22
>>>>>>> sys=0.01, real=0.03 secs]
>>>>>>> 2012-01-11T09:39:17.489+0100: 102978.730: [GC 102978.730: [ParNew:
>>>>>>> 351062K->39795K(368640K), 0.0401170 secs]
>>>>>>> 3521707K->3210439K(5201920K), 0.0403800 secs] [Times: user=0.28
>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>> 2012-01-11T09:39:19.869+0100: 102981.110: [GC 102981.110: [ParNew (4:
>>>>>>> promotion failure size = 4281460)  (promotion failed):
>>>>>>> 350134K->340392K(368640K), 0.1378780 secs]102981.248: [CMS:
>>>>>>> 3181346K->367952K(4833280K), 4.7036230 secs] 3520778K
>>>>>>> ->367952K(5201920K), [CMS Perm : 116828K->116809K(262144K)], 4.8418590
>>>>>>> secs] [Times: user=5.10 sys=0.00, real=4.84 secs]
>>>>>>> 2012-01-11T09:39:25.264+0100: 102986.504: [GC 102986.505: [ParNew:
>>>>>>> 327680K->40960K(368640K), 0.0415470 secs] 695632K->419560K(5201920K),
>>>>>>> 0.0418770 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]
>>>>>>> 2012-01-11T09:39:26.035+0100: 102987.276: [GC 102987.276: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.0925740 secs] 747240K->481611K(5201920K),
>>>>>>> 0.0928570 secs] [Times: user=0.54 sys=0.01, real=0.09 secs]
>>>>>>>
>>>>>>> node-003
>>>>>>> -------
>>>>>>> 2012-01-10T17:48:28.369+0100: 45929.686: [GC 45929.686: [ParNew:
>>>>>>> 346950K->21342K(368640K), 0.0333090 secs]
>>>>>>> 2712364K->2386756K(5201920K), 0.0335740 secs] [Times: user=0.23
>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>> 2012-01-10T17:48:32.933+0100: 45934.250: [GC 45934.250: [ParNew:
>>>>>>> 345070K->32211K(368640K), 0.0369260 secs]
>>>>>>> 2710484K->2397625K(5201920K), 0.0372380 secs] [Times: user=0.25
>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>> 2012-01-10T17:48:34.201+0100: 45935.518: [GC 45935.518: [ParNew (0:
>>>>>>> promotion failure size = 1266955)  (promotion failed):
>>>>>>> 359891K->368640K(368640K), 0.1395570 secs]45935.658: [CMS:
>>>>>>> 2387690K->348838K(4833280K), 4.5680670 secs] 2725305K->3
>>>>>>> 48838K(5201920K), [CMS Perm : 116740K->116715K(262144K)], 4.7079640
>>>>>>> secs] [Times: user=5.03 sys=0.00, real=4.71 secs]
>>>>>>> 2012-01-10T17:48:40.572+0100: 45941.889: [GC 45941.889: [ParNew:
>>>>>>> 327680K->40960K(368640K), 0.0486510 secs] 676518K->405004K(5201920K),
>>>>>>> 0.0489930 secs] [Times: user=0.26 sys=0.00, real=0.05 secs]
>>>>>>> 2012-01-10T17:48:41.959+0100: 45943.276: [GC 45943.277: [ParNew:
>>>>>>> 360621K->40960K(368640K), 0.0833240 secs] 724666K->479857K(5201920K),
>>>>>>> 0.0836120 secs] [Times: user=0.48 sys=0.01, real=0.08 secs]
>>>>>>>
>>>>>>> node-004
>>>>>>> -------
>>>>>>> 2012-01-10T18:59:02.338+0100: 50163.649: [GC 50163.649: [ParNew:
>>>>>>> 358429K->40960K(368640K), 0.0629910 secs]
>>>>>>> 3569331K->3283304K(5201920K), 0.0632710 secs] [Times: user=0.40
>>>>>>> sys=0.02, real=0.06 secs]
>>>>>>> 2012-01-10T18:59:08.137+0100: 50169.448: [GC 50169.448: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.0819780 secs]
>>>>>>> 3610984K->3323445K(5201920K), 0.0822430 secs] [Times: user=0.40
>>>>>>> sys=0.00, real=0.08 secs]
>>>>>>> 2012-01-10T18:59:13.945+0100: 50175.256: [GC 50175.256: [ParNew (6:
>>>>>>> promotion failure size = 2788662)  (promotion failed):
>>>>>>> 367619K->364864K(368640K), 0.2024350 secs]50175.458: [CMS:
>>>>>>> 3310044K->330922K(4833280K), 4.5104170 secs]
>>>>>>> 3650104K->330922K(5201920K), [CMS Perm : 116747K->116728K(262144K)],
>>>>>>> 4.7132220 secs] [Times: user=4.99 sys=0.01, real=4.72 secs]
>>>>>>> 2012-01-10T18:59:20.539+0100: 50181.850: [GC 50181.850: [ParNew:
>>>>>>> 327680K->37328K(368640K), 0.0270660 secs] 658602K->368251K(5201920K),
>>>>>>> 0.0273800 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
>>>>>>> 2012-01-10T18:59:25.183+0100: 50186.494: [GC 50186.494: [ParNew:
>>>>>>> 363504K->15099K(368640K), 0.0388710 secs] 694427K->362063K(5201920K),
>>>>>>> 0.0391790 secs] [Times: user=0.18 sys=0.00, real=0.04 secs]
>>>>>>>
>>>>>>> On a fourth node, I've found a different event: promotion failure
>>>>>>> during CMS, with a much smaller size:
>>>>>>>
>>>>>>> node-001
>>>>>>> -------
>>>>>>> 2012-01-10T18:30:07.471+0100: 48428.764: [GC 48428.764: [ParNew:
>>>>>>> 354039K->40960K(368640K), 0.0667340 secs]
>>>>>>> 3609061K->3318149K(5201920K), 0.0670150 secs] [Times: user=0.37
>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>> 2012-01-10T18:30:08.706+0100: 48429.999: [GC 48430.000: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.2586390 secs]
>>>>>>> 3645829K->3417273K(5201920K), 0.2589050 secs] [Times: user=0.73
>>>>>>> sys=0.13, real=0.26 secs]
>>>>>>> 2012-01-10T18:30:08.974+0100: 48430.267: [GC [1 CMS-initial-mark:
>>>>>>> 3376313K(4833280K)] 3427492K(5201920K), 0.0743900 secs] [Times:
>>>>>>> user=0.07 sys=0.00, real=0.07 secs]
>>>>>>> 2012-01-10T18:30:09.049+0100: 48430.342: [CMS-concurrent-mark-start]
>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-mark:
>>>>>>> 0.933/0.960 secs] [Times: user=4.59 sys=0.13, real=0.96 secs]
>>>>>>> 2012-01-10T18:30:10.009+0100: 48431.302: [CMS-concurrent-preclean-start]
>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382: [CMS-concurrent-preclean:
>>>>>>> 0.060/0.080 secs] [Times: user=0.34 sys=0.02, real=0.08 secs]
>>>>>>> 2012-01-10T18:30:10.089+0100: 48431.382:
>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>> 2012-01-10T18:30:10.586+0100: 48431.880: [GC 48431.880: [ParNew:
>>>>>>> 368640K->40960K(368640K), 0.1214420 secs]
>>>>>>> 3744953K->3490912K(5201920K), 0.1217480 secs] [Times: user=0.66
>>>>>>> sys=0.05, real=0.12 secs]
>>>>>>> 2012-01-10T18:30:12.785+0100: 48434.078:
>>>>>>> [CMS-concurrent-abortable-preclean: 2.526/2.696 secs] [Times:
>>>>>>> user=10.72 sys=0.48, real=2.70 secs]
>>>>>>> 2012-01-10T18:30:12.787+0100: 48434.081: [GC[YG occupancy: 206521 K
>>>>>>> (368640 K)]2012-01-10T18:30:12.788+0100: 48434.081: [GC 48434.081:
>>>>>>> [ParNew (promotion failure size = 1026)  (promotion failed):
>>>>>>> 206521K->206521K(368640K), 0.1667280 secs]
>>>>>>>     3656474K->3696197K(5201920K), 0.1670260 secs] [Times: user=0.48
>>>>>>> sys=0.04, real=0.17 secs]
>>>>>>> 48434.248: [Rescan (parallel) , 0.1972570 secs]48434.445: [weak refs
>>>>>>> processing, 0.0011570 secs]48434.446: [class unloading, 0.0277750
>>>>>>> secs]48434.474: [scrub symbol&        string tables, 0.0088370 secs] [1
>>>>>>> CMS-remark: 3489675K(4833280K)] 36961
>>>>>>> 97K(5201920K), 0.4088040 secs] [Times: user=1.62 sys=0.05, real=0.41
>>>>>>> secs]
>>>>>>> 2012-01-10T18:30:13.197+0100: 48434.490: [CMS-concurrent-sweep-start]
>>>>>>> 2012-01-10T18:30:17.427+0100: 48438.720: [Full GC 48438.720:
>>>>>>> [CMS2012-01-10T18:30:21.636+0100: 48442.929: [CMS-concurrent-sweep:
>>>>>>> 7.949/8.439 secs] [Times: user=15.89 sys=1.57, real=8.44 secs]
>>>>>>>     (concurrent mode failure): 2505348K->334385K(4833280K), 8.6109050
>>>>>>> secs] 2873988K->334385K(5201920K), [CMS Perm :
>>>>>>> 117788K->117762K(262144K)], 8.6112520 secs] [Times: user=8.61
>>>>>>> sys=0.00, real=8.61 secs]
>>>>>>> 2012-01-10T18:30:26.716+0100: 48448.009: [GC 48448.010: [ParNew:
>>>>>>> 327680K->40960K(368640K), 0.0407520 secs] 662065K->394656K(5201920K),
>>>>>>> 0.0411550 secs] [Times: user=0.25 sys=0.00, real=0.04 secs]
>>>>>>> 2012-01-10T18:30:28.825+0100: 48450.118: [GC 48450.118: [ParNew:
>>>>>>> 368639K->40960K(368640K), 0.0662780 secs] 722335K->433355K(5201920K),
>>>>>>> 0.0666190 secs] [Times: user=0.35 sys=0.00, real=0.06 secs]
>>>>>>>
>>>>>>> I assume that the large sizes for the promotion failures during ParNew
>>>>>>> are confirming that eliminating large array allocations might help
>>>>>>> here. Do you agree?
>>>>>>> I'm not sure what to make of the concurrent mode failure.
>>>>>>>
>>>>>>> Thanks in advance for any suggestions,
>>>>>>> Taras
>>>>>>>
>>>>>>> On Fri, Jan 6, 2012 at 8:27 AM, Jon Masamitsu<jon.masamitsu at oracle.com>
>>>>>>>      wrote:
>>>>>>>> On 1/5/2012 3:32 PM, Taras Tielkes wrote:
>>>>>>>>> Hi Jon,
>>>>>>>>>
>>>>>>>>> We've enabled the PrintPromotionFailure flag in our preprod
>>>>>>>>> environment, but so far, no failures yet.
>>>>>>>>> We know that the load we generate there is not representative. But
>>>>>>>>> perhaps we'll catch something, given enough patience.
>>>>>>>>>
>>>>>>>>> The flag will also be enabled in our production environment next week
>>>>>>>>> - so one way or the other, we'll get more diagnostic data soon.
>>>>>>>>> I'll also do some allocation profiling of the application in isolation
>>>>>>>>> - I know that there is abusive large byte[] and char[] allocation in
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>> I've got two questions for now:
>>>>>>>>>
>>>>>>>>> 1) From googling around on the output to expect
>>>>>>>>> (http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html),
>>>>>>>>> I see that -XX:+PrintPromotionFailure will generate output like this:
>>>>>>>>> -------
>>>>>>>>> 592.079: [ParNew (0: promotion failure size = 2698)  (promotion
>>>>>>>>> failed): 135865K->134943K(138240K), 0.1433555 secs]
>>>>>>>>> -------
>>>>>>>>> In that example line, what does the "0" stand for?
>>>>>>>> It's the index of the GC worker thread  that experienced the promotion
>>>>>>>> failure.
>>>>>>>>
>>>>>>>>> 2) Below is a snippet of (real) gc log from our production
>>>>>>>>> application:
>>>>>>>>> -------
>>>>>>>>> 2011-12-30T22:42:12.684+0100: 2136581.585: [GC 2136581.585: [ParNew:
>>>>>>>>> 345951K->40960K(368640K), 0.0676780 secs]
>>>>>>>>> 3608692K->3323692K(5201920K), 0.0680220 secs] [Times: user=0.36
>>>>>>>>> sys=0.01, real=0.06 secs]
>>>>>>>>> 2011-12-30T22:42:22.984+0100: 2136591.886: [GC 2136591.886: [ParNew:
>>>>>>>>> 368640K->40959K(368640K), 0.0618880 secs]
>>>>>>>>> 3651372K->3349928K(5201920K), 0.0622330 secs] [Times: user=0.31
>>>>>>>>> sys=0.00, real=0.06 secs]
>>>>>>>>> 2011-12-30T22:42:23.052+0100: 2136591.954: [GC [1 CMS-initial-mark:
>>>>>>>>> 3308968K(4833280K)] 3350041K(5201920K), 0.0377420 secs] [Times:
>>>>>>>>> user=0.04 sys=0.00, real=0.04 secs]
>>>>>>>>> 2011-12-30T22:42:23.090+0100: 2136591.992: [CMS-concurrent-mark-start]
>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978: [CMS-concurrent-mark:
>>>>>>>>> 0.986/0.986 secs] [Times: user=2.05 sys=0.04, real=0.99 secs]
>>>>>>>>> 2011-12-30T22:42:24.076+0100: 2136592.978:
>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.000: [CMS-concurrent-preclean:
>>>>>>>>> 0.021/0.023 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
>>>>>>>>> 2011-12-30T22:42:24.099+0100: 2136593.001:
>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>      CMS: abort preclean due to time 2011-12-30T22:42:29.335+0100:
>>>>>>>>> 2136598.236: [CMS-concurrent-abortable-preclean: 5.209/5.236 secs]
>>>>>>>>> [Times: user=5.70 sys=0.23, real=5.23 secs]
>>>>>>>>> 2011-12-30T22:42:29.340+0100: 2136598.242: [GC[YG occupancy: 123870 K
>>>>>>>>> (368640 K)]2011-12-30T22:42:29.341+0100: 2136598.242: [GC 2136598.242:
>>>>>>>>> [ParNew (promotion failed): 123870K->105466K(368640K), 7.4939280 secs]
>>>>>>>>> 3432839K->3423755K(5201920
>>>>>>>>> K), 7.4942670 secs] [Times: user=9.08 sys=2.10, real=7.49 secs]
>>>>>>>>> 2136605.737: [Rescan (parallel) , 0.0644050 secs]2136605.801: [weak
>>>>>>>>> refs processing, 0.0034280 secs]2136605.804: [class unloading,
>>>>>>>>> 0.0289480 secs]2136605.833: [scrub symbol&          string tables,
>>>>>>>>> 0.0093940
>>>>>>>>> secs] [1 CMS-remark: 3318289K(4833280K
>>>>>>>>> )] 3423755K(5201920K), 7.6077990 secs] [Times: user=9.54 sys=2.10,
>>>>>>>>> real=7.61 secs]
>>>>>>>>> 2011-12-30T22:42:36.949+0100: 2136605.850:
>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>> 2011-12-30T22:42:45.006+0100: 2136613.907: [Full GC 2136613.908:
>>>>>>>>> [CMS2011-12-30T22:42:51.038+0100: 2136619.939: [CMS-concurrent-sweep:
>>>>>>>>> 12.231/14.089 secs] [Times: user=15.14 sys=5.36, real=14.08 secs]
>>>>>>>>>      (concurrent mode failure): 3141235K->291853K(4833280K), 10.2906040
>>>>>>>>> secs] 3491471K->291853K(5201920K), [CMS Perm :
>>>>>>>>> 121784K->121765K(262144K)], 10.2910040 secs] [Times: user=10.29
>>>>>>>>> sys=0.00, real=10.29 secs]
>>>>>>>>> 2011-12-30T22:42:56.281+0100: 2136625.183: [GC 2136625.183: [ParNew:
>>>>>>>>> 327680K->25286K(368640K), 0.0287220 secs] 619533K->317140K(5201920K),
>>>>>>>>> 0.0291610 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
>>>>>>>>> 2011-12-30T22:43:10.516+0100: 2136639.418: [GC 2136639.418: [ParNew:
>>>>>>>>> 352966K->26737K(368640K), 0.0586400 secs] 644820K->338758K(5201920K),
>>>>>>>>> 0.0589640 secs] [Times: user=0.31 sys=0.00, real=0.06 secs]
>>>>>>>>> -------
>>>>>>>>>
>>>>>>>>> In this case I don't know how to interpret the output.
>>>>>>>>> a) There's a promotion failure that took 7.49 secs
>>>>>>>> This is the time it took to attempt the minor collection (ParNew) and
>>>>>>>> to
>>>>>>>> do recovery
>>>>>>>> from the failure.
>>>>>>>>
>>>>>>>>> b) There's a full GC that took 14.08 secs
>>>>>>>>> c) There's a concurrent mode failure that took 10.29 secs
>>>>>>>> Not sure about b) and c) because the output is mixed up with the
>>>>>>>> concurrent-sweep
>>>>>>>> output but  I think the "concurrent mode failure" message is part of
>>>>>>>> the
>>>>>>>> "Full GC"
>>>>>>>> message.  My guess is that the 10.29 is the time for the Full GC and
>>>>>>>> the
>>>>>>>> 14.08
>>>>>>>> maybe is part of the concurrent-sweep message.  Really hard to be sure.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>> How are these events, and their (real) times related to each other?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Taras
>>>>>>>>>
>>>>>>>>> On Tue, Dec 27, 2011 at 6:13 PM, Jon
>>>>>>>>> Masamitsu<jon.masamitsu at oracle.com>          wrote:
>>>>>>>>>> Taras,
>>>>>>>>>>
>>>>>>>>>> PrintPromotionFailure seems like it would go a long
>>>>>>>>>> way to identify the root of your promotion failures (or
>>>>>>>>>> at least eliminating some possible causes).    I think it
>>>>>>>>>> would help focus the discussion if you could send
>>>>>>>>>> the result of that experiment early.
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On 12/27/2011 5:07 AM, Taras Tielkes wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We're running an application with the CMS/ParNew collectors that is
>>>>>>>>>>> experiencing occasional promotion failures.
>>>>>>>>>>> Environment is Linux 2.6.18 (x64), JVM is 1.6.0_29 in server mode.
>>>>>>>>>>> I've listed the specific JVM options used below (a).
>>>>>>>>>>>
>>>>>>>>>>> The application is deployed across a handful of machines, and the
>>>>>>>>>>> promotion failures are fairly uniform across those.
>>>>>>>>>>>
>>>>>>>>>>> The first kind of failure we observe is a promotion failure during
>>>>>>>>>>> ParNew collection, I've included a snipped from the gc log below
>>>>>>>>>>> (b).
>>>>>>>>>>> The second kind of failure is a concurrrent mode failure (perhaps
>>>>>>>>>>> triggered by the same cause), see (c) below.
>>>>>>>>>>> The frequency (after running for a some weeks) is approximately once
>>>>>>>>>>> per day. This is bearable, but obviously we'd like to improve on
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Apart from high-volume request handling (which allocates a lot of
>>>>>>>>>>> small objects), the application also runs a few dozen background
>>>>>>>>>>> threads that download and process XML documents, typically in the
>>>>>>>>>>> 5-30
>>>>>>>>>>> MB range.
>>>>>>>>>>> A known deficiency in the existing code is that the XML content is
>>>>>>>>>>> copied twice before processing (once to a byte[], and later again to
>>>>>>>>>>> a
>>>>>>>>>>> String/char[]).
>>>>>>>>>>> Given that a 30 MB XML stream will result in a 60 MB
>>>>>>>>>>> java.lang.String/char[], my suspicion is that these big array
>>>>>>>>>>> allocations are causing us to run into the CMS fragmentation issue.
>>>>>>>>>>>
>>>>>>>>>>> My questions are:
>>>>>>>>>>> 1) Does the data from the GC logs provide sufficient evidence to
>>>>>>>>>>> conclude that CMS fragmentation is the cause of the promotion
>>>>>>>>>>> failure?
>>>>>>>>>>> 2) If not, what's the next step of investigating the cause?
>>>>>>>>>>> 3) We're planning to at least add -XX:+PrintPromotionFailure to get
>>>>>>>>>>> a
>>>>>>>>>>> feeling for the size of the objects that fail promotion.
>>>>>>>>>>> Overall, it seem that -XX:PrintFLSStatistics=1 is actually the only
>>>>>>>>>>> reliable approach to diagnose CMS fragmentation. Is this indeed the
>>>>>>>>>>> case?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Taras
>>>>>>>>>>>
>>>>>>>>>>> a) Current JVM options:
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> -server
>>>>>>>>>>> -Xms5g
>>>>>>>>>>> -Xmx5g
>>>>>>>>>>> -Xmn400m
>>>>>>>>>>> -XX:PermSize=256m
>>>>>>>>>>> -XX:MaxPermSize=256m
>>>>>>>>>>> -XX:+PrintGCTimeStamps
>>>>>>>>>>> -verbose:gc
>>>>>>>>>>> -XX:+PrintGCDateStamps
>>>>>>>>>>> -XX:+PrintGCDetails
>>>>>>>>>>> -XX:SurvivorRatio=8
>>>>>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>>>>> -XX:+UseParNewGC
>>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>>>>>> -XX:+CMSClassUnloadingEnabled
>>>>>>>>>>> -XX:+CMSScavengeBeforeRemark
>>>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=68
>>>>>>>>>>> -Xloggc:gc.log
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>> b) Promotion failure during ParNew
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 2011-12-08T18:14:40.966+0100: 219729.868: [GC 219729.868: [ParNew:
>>>>>>>>>>> 368640K->40959K(368640K), 0.0693460 secs]
>>>>>>>>>>> 3504917K->3195098K(5201920K), 0.0696500 secs] [Times: user=0.39
>>>>>>>>>>> sys=0.01, real=0.07 secs]
>>>>>>>>>>> 2011-12-08T18:14:43.778+0100: 219732.679: [GC 219732.679: [ParNew:
>>>>>>>>>>> 368639K->31321K(368640K), 0.0511400 secs]
>>>>>>>>>>> 3522778K->3198316K(5201920K), 0.0514420 secs] [Times: user=0.28
>>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>>> 2011-12-08T18:14:46.945+0100: 219735.846: [GC 219735.846: [ParNew:
>>>>>>>>>>> 359001K->18694K(368640K), 0.0272970 secs]
>>>>>>>>>>> 3525996K->3185690K(5201920K), 0.0276080 secs] [Times: user=0.19
>>>>>>>>>>> sys=0.00, real=0.03 secs]
>>>>>>>>>>> 2011-12-08T18:14:49.036+0100: 219737.938: [GC 219737.938: [ParNew
>>>>>>>>>>> (promotion failed): 338813K->361078K(368640K), 0.1321200
>>>>>>>>>>> secs]219738.070: [CMS: 3167747K->434291K(4833280K), 4.8881570 secs]
>>>>>>>>>>> 3505808K->434291K
>>>>>>>>>>> (5201920K), [CMS Perm : 116893K->116883K(262144K)], 5.0206620 secs]
>>>>>>>>>>> [Times: user=5.24 sys=0.00, real=5.02 secs]
>>>>>>>>>>> 2011-12-08T18:14:54.721+0100: 219743.622: [GC 219743.623: [ParNew:
>>>>>>>>>>> 327680K->40960K(368640K), 0.0949460 secs]
>>>>>>>>>>> 761971K->514584K(5201920K),
>>>>>>>>>>> 0.0952820 secs] [Times: user=0.52 sys=0.04, real=0.10 secs]
>>>>>>>>>>> 2011-12-08T18:14:55.580+0100: 219744.481: [GC 219744.482: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.1299190 secs]
>>>>>>>>>>> 842264K->625681K(5201920K),
>>>>>>>>>>> 0.1302190 secs] [Times: user=0.72 sys=0.01, real=0.13 secs]
>>>>>>>>>>> 2011-12-08T18:14:58.050+0100: 219746.952: [GC 219746.952: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0870940 secs]
>>>>>>>>>>> 953361K->684121K(5201920K),
>>>>>>>>>>> 0.0874110 secs] [Times: user=0.48 sys=0.01, real=0.09 secs]
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>> c) Promotion failure during CMS
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 2011-12-14T08:29:26.628+0100: 703015.530: [GC 703015.530: [ParNew:
>>>>>>>>>>> 357228K->40960K(368640K), 0.0525110 secs]
>>>>>>>>>>> 3603068K->3312743K(5201920K), 0.0528120 secs] [Times: user=0.37
>>>>>>>>>>> sys=0.00, real=0.05 secs]
>>>>>>>>>>> 2011-12-14T08:29:28.864+0100: 703017.766: [GC 703017.766: [ParNew:
>>>>>>>>>>> 366075K->37119K(368640K), 0.0479780 secs]
>>>>>>>>>>> 3637859K->3317662K(5201920K), 0.0483090 secs] [Times: user=0.24
>>>>>>>>>>> sys=0.01, real=0.05 secs]
>>>>>>>>>>> 2011-12-14T08:29:29.553+0100: 703018.454: [GC 703018.455: [ParNew:
>>>>>>>>>>> 364792K->40960K(368640K), 0.0421740 secs]
>>>>>>>>>>> 3645334K->3334944K(5201920K), 0.0424810 secs] [Times: user=0.30
>>>>>>>>>>> sys=0.00, real=0.04 secs]
>>>>>>>>>>> 2011-12-14T08:29:29.600+0100: 703018.502: [GC [1 CMS-initial-mark:
>>>>>>>>>>> 3293984K(4833280K)] 3335025K(5201920K), 0.0272490 secs] [Times:
>>>>>>>>>>> user=0.02 sys=0.00, real=0.03 secs]
>>>>>>>>>>> 2011-12-14T08:29:29.628+0100: 703018.529:
>>>>>>>>>>> [CMS-concurrent-mark-start]
>>>>>>>>>>> 2011-12-14T08:29:30.718+0100: 703019.620: [GC 703019.620: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0836690 secs]
>>>>>>>>>>> 3662624K->3386039K(5201920K), 0.0839690 secs] [Times: user=0.50
>>>>>>>>>>> sys=0.01, real=0.08 secs]
>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729: [CMS-concurrent-mark:
>>>>>>>>>>> 1.108/1.200 secs] [Times: user=6.83 sys=0.23, real=1.20 secs]
>>>>>>>>>>> 2011-12-14T08:29:30.827+0100: 703019.729:
>>>>>>>>>>> [CMS-concurrent-preclean-start]
>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840: [CMS-concurrent-preclean:
>>>>>>>>>>> 0.093/0.111 secs] [Times: user=0.48 sys=0.02, real=0.11 secs]
>>>>>>>>>>> 2011-12-14T08:29:30.938+0100: 703019.840:
>>>>>>>>>>> [CMS-concurrent-abortable-preclean-start]
>>>>>>>>>>> 2011-12-14T08:29:32.337+0100: 703021.239:
>>>>>>>>>>> [CMS-concurrent-abortable-preclean: 1.383/1.399 secs] [Times:
>>>>>>>>>>> user=6.68 sys=0.27, real=1.40 secs]
>>>>>>>>>>> 2011-12-14T08:29:32.343+0100: 703021.244: [GC[YG occupancy: 347750 K
>>>>>>>>>>> (368640 K)]2011-12-14T08:29:32.343+0100: 703021.244: [GC 703021.244:
>>>>>>>>>>> [ParNew (promotion failed): 347750K->347750K(368640K), 9.8729020
>>>>>>>>>>> secs]
>>>>>>>>>>>       3692829K->3718580K(5201920K), 9.8732380 secs] [Times: user=12.00
>>>>>>>>>>> sys=2.58, real=9.88 secs]
>>>>>>>>>>> 703031.118: [Rescan (parallel) , 0.2826110 secs]703031.400: [weak
>>>>>>>>>>> refs
>>>>>>>>>>> processing, 0.0014780 secs]703031.402: [class unloading, 0.0176610
>>>>>>>>>>> secs]703031.419: [scrub symbol&            string tables, 0.0094960
>>>>>>>>>>> secs] [1 CMS
>>>>>>>>>>> -remark: 3370830K(4833280K)] 3718580K(5201920K), 10.1916910 secs]
>>>>>>>>>>> [Times: user=13.73 sys=2.59, real=10.19 secs]
>>>>>>>>>>> 2011-12-14T08:29:42.535+0100: 703031.436:
>>>>>>>>>>> [CMS-concurrent-sweep-start]
>>>>>>>>>>> 2011-12-14T08:29:42.591+0100: 703031.493: [Full GC 703031.493:
>>>>>>>>>>> [CMS2011-12-14T08:29:48.616+0100: 703037.518: [CMS-concurrent-sweep:
>>>>>>>>>>> 6.046/6.082 secs] [Times: user=6.18 sys=0.01, real=6.09 secs]
>>>>>>>>>>>       (concurrent mode failure): 3370829K->433437K(4833280K),
>>>>>>>>>>> 10.9594300
>>>>>>>>>>> secs] 3739469K->433437K(5201920K), [CMS Perm :
>>>>>>>>>>> 121702K->121690K(262144K)], 10.9597540 secs] [Times: user=10.95
>>>>>>>>>>> sys=0.00, real=10.96 secs]
>>>>>>>>>>> 2011-12-14T08:29:53.997+0100: 703042.899: [GC 703042.899: [ParNew:
>>>>>>>>>>> 327680K->40960K(368640K), 0.0799960 secs]
>>>>>>>>>>> 761117K->517836K(5201920K),
>>>>>>>>>>> 0.0804100 secs] [Times: user=0.46 sys=0.00, real=0.08 secs]
>>>>>>>>>>> 2011-12-14T08:29:54.649+0100: 703043.551: [GC 703043.551: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0784460 secs]
>>>>>>>>>>> 845516K->557872K(5201920K),
>>>>>>>>>>> 0.0787920 secs] [Times: user=0.40 sys=0.01, real=0.08 secs]
>>>>>>>>>>> 2011-12-14T08:29:56.418+0100: 703045.320: [GC 703045.320: [ParNew:
>>>>>>>>>>> 368640K->40960K(368640K), 0.0784040 secs]
>>>>>>>>>>> 885552K->603017K(5201920K),
>>>>>>>>>>> 0.0787630 secs] [Times: user=0.41 sys=0.01, real=0.07 secs]
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>>> _______________________________________________
>>>>>>>>> hotspot-gc-use mailing list
>>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>>> _______________________________________________
>>>>>>>> hotspot-gc-use mailing list
>>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>>> _______________________________________________
>>>>>>> hotspot-gc-use mailing list
>>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use