Promotion failures: indication of CMS fragmentation?

Tue Mar 20 14:12:23 PDT 2012

As Chi-ho noted, about 3-4 MB of data does get promoted per scavenge,
after having
sloshed around in your survivor spaces some 15 times. I'd venture that
whatever winnowing
of young objects was to ocur has in fact occured already within the
first 3-4 scavenges that
an object has survived, after which the drop-off in population is less
sharp. So I'd suggest
lowering the MTT to about 3, while leaving the survivor ratio intact.
That should reduce your
copying costs and bring down your scavenge pauses further, while not
adversely affecting
your promotion rates (and concomitantly the fragmentation).

One thing that was a bit puzzling about the stats below was that you'd
expect the volume
of generation X in scavenge N to be no less than the volume of
generation X+1 in scavenge N+1,
but occasionally that natural invariant does not appear to hold, which
is quite puzzling --
indicating perhaps that either ages or populations are not being
correctly tracked.

I don't know if anyone else has noticed that in their tenuring
distributions as well....

-- ramki

On Tue, Mar 20, 2012 at 9:36 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
> Hi,
>
> I've collected -XX:+PrintTenuringDistribution data from a node in our
> production environment, running -Xmx5g -Xmn400m -XX:SurvivorRatio=8.
> On one other production node, we've configured a larger new gen, and
> larger survivor spaces (-Xmx5g -Xmn800m -XX:SurvivorRatio=4).
> This node has -XX:+PrintTenuringDistribution logging as well.
>
> The node running the larger new gen and survivor spaces has not run
> into a promotion failure yet, while the ones still running the old
> config have hit a few.
> The promotion failures are typically experienced at high load periods,
> which makes sense, as allocation and promotion will experience a spike
> in those periods as well.
>
> The inherent nature of the application implies relatively long
> sessions (towards a few hours), retaining a fair amout of state up to
> an hour.
> I believe this is the main reason of the relatively high promotion
> rate we're experiencing.
>
>
> Here's a fragment of gc log from one of the nodes running the older
> (smaller) new gen, including a promotion failure:
> -------------------------
> 2012-03-15T18:32:17.785+0100: 796604.225: [GC 796604.225: [ParNew
> Desired survivor size 20971520 bytes, new threshold 8 (max 15)
> - age   1:    2927728 bytes,    2927728 total
> - age   2:    2428512 bytes,    5356240 total
> - age   3:    2696376 bytes,    8052616 total
> - age   4:    2623576 bytes,   10676192 total
> - age   5:    3365576 bytes,   14041768 total
> - age   6:    2792272 bytes,   16834040 total
> - age   7:    2233008 bytes,   19067048 total
> - age   8:    2263824 bytes,   21330872 total
> : 358709K->29362K(368640K), 0.0461460 secs]
> 3479492K->3151874K(5201920K), 0.0467320 secs] [Times: user=0.34
> sys=0.01, real=0.05 secs]
> 2012-03-15T18:32:21.546+0100: 796607.986: [GC 796607.986: [ParNew (0:
> promotion failure size = 25)  (1: promotion failure size = 25)  (2:
> promotion failure size = 25)  (3: promotion failure size = 25)  (4:
> promotion failure size = 25)  (5
> : promotion failure size = 25)  (6: promotion failure size = 341)  (7:
> promotion failure size = 25)  (promotion failed)
> Desired survivor size 20971520 bytes, new threshold 8 (max 15)
> - age   1:    3708208 bytes,    3708208 total
> - age   2:    2174384 bytes,    5882592 total
> - age   3:    2383256 bytes,    8265848 total
> - age   4:    2689912 bytes,   10955760 total
> - age   5:    2621832 bytes,   13577592 total
> - age   6:    3360440 bytes,   16938032 total
> - age   7:    2784136 bytes,   19722168 total
> - age   8:    2220232 bytes,   21942400 total
> : 357042K->356456K(368640K), 0.2734100 secs]796608.259: [CMS:
> 3124189K->516640K(4833280K), 6.8127070 secs]
> 3479554K->516640K(5201920K), [CMS Perm : 142423K->142284K(262144K)],
> 7.0867850 secs] [Times: user=7.32 sys=0.07, real=7.09 secs]
> 2012-03-15T18:32:30.279+0100: 796616.719: [GC 796616.720: [ParNew
> Desired survivor size 20971520 bytes, new threshold 1 (max 15)
> - age   1:   29721456 bytes,   29721456 total
> : 327680K->40960K(368640K), 0.0403130 secs]
> 844320K->557862K(5201920K), 0.0409070 secs] [Times: user=0.27
> sys=0.01, real=0.04 secs]
> 2012-03-15T18:32:32.701+0100: 796619.141: [GC 796619.141: [ParNew
> Desired survivor size 20971520 bytes, new threshold 15 (max 15)
> - age   1:   10310176 bytes,   10310176 total
> -------------------------
>
> For contrast, here's a gc log fragment from the single node running
> the larger new gen and larger survivor spaces:
> (the fragment is from the same point in time, with the nodes
> experiencing equal load)
> -------------------------
> 2012-03-15T18:32:12.067+0100: 797119.336: [GC 797119.336: [ParNew
> Desired survivor size 69894144 bytes, new threshold 15 (max 15)
> - age   1:    5611536 bytes,    5611536 total
> - age   2:    3731888 bytes,    9343424 total
> - age   3:    3450672 bytes,   12794096 total
> - age   4:    3314744 bytes,   16108840 total
> - age   5:    3459888 bytes,   19568728 total
> - age   6:    3334712 bytes,   22903440 total
> - age   7:    3671960 bytes,   26575400 total
> - age   8:    3841608 bytes,   30417008 total
> - age   9:    2035392 bytes,   32452400 total
> - age  10:    1975056 bytes,   34427456 total
> - age  11:    2021344 bytes,   36448800 total
> - age  12:    1520752 bytes,   37969552 total
> - age  13:    1494176 bytes,   39463728 total
> - age  14:    2355136 bytes,   41818864 total
> - age  15:    1279000 bytes,   43097864 total
> : 603473K->61640K(682688K), 0.0756570 secs]
> 3373284K->2832383K(5106368K), 0.0762090 secs] [Times: user=0.56
> sys=0.00, real=0.08 secs]
> 2012-03-15T18:32:18.200+0100: 797125.468: [GC 797125.469: [ParNew
> Desired survivor size 69894144 bytes, new threshold 15 (max 15)
> - age   1:    6101320 bytes,    6101320 total
> - age   2:    4446776 bytes,   10548096 total
> - age   3:    3701384 bytes,   14249480 total
> - age   4:    3438488 bytes,   17687968 total
> - age   5:    3295360 bytes,   20983328 total
> - age   6:    3403320 bytes,   24386648 total
> - age   7:    3323368 bytes,   27710016 total
> - age   8:    3665760 bytes,   31375776 total
> - age   9:    2427904 bytes,   33803680 total
> - age  10:    1418656 bytes,   35222336 total
> - age  11:    1955192 bytes,   37177528 total
> - age  12:    2006064 bytes,   39183592 total
> - age  13:    1520768 bytes,   40704360 total
> - age  14:    1493728 bytes,   42198088 total
> - age  15:    2354376 bytes,   44552464 total
> : 607816K->62650K(682688K), 0.0779270 secs]
> 3378559K->2834643K(5106368K), 0.0784690 secs] [Times: user=0.58
> sys=0.00, real=0.08 secs]
> -------------------------
>
> Questions:
>
> 1) From the tenuring distributions, it seems that the application
> benefits from larger new gen and survivor spaces.
> The next thing we'll try is to run with -Xmn1g -XX:SurvivorRatio=2,
> and see if the ParNew times are still acceptable.
> Does this seem a sensible approach in this context?
> Are there other variables beyond ParNew times that limit scaling the
> new gen to a large size?
>
> 2) Given the object age demographics inherent to our application, we
> can not expect to see the majority of data get collected in the new
> gen.
>
> Our approach to fight the promotion failures consists of three aspects:
> a) Lower the overall allocation rate of our application (by improving
> wasteful hotspots), to decrease overall ParNew collection frequency.
> b) Configure the new gen and survivor spaces as large as possible,
> keeping an eye on ParNew times and overall new/tenured ratio.
> c) Try to refactor the data structures that form the bulk of promoted
> data, to retain only the strictly required subgraphs.
>
> Is there anything else I can try or measure, in order to better
> understand the problem?
>
> Thanks in advance,
> Taras
>
>
> On Wed, Feb 22, 2012 at 10:51 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
>> (this time properly responding to the list alias)
>> Hi Srinivas,
>>
>> We're running 1.6.0 u29 on Linux x64. My understanding is that
>> CompressedOops is enabled by default since u23.
>>
>> At least this page seems to support that:
>> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
>>
>> Regarding the other remarks (also from Todd and Chi), I'll comment
>> later. The first thing on my list is to collect
>> PrintTenuringDistribution data now.
>>
>> Kind regards,
>> Taras
>>
>> On Wed, Feb 22, 2012 at 10:50 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
>>> Hi Srinivas,
>>>
>>> We're running 1.6.0 u29 on Linux x64. My understanding is that
>>> CompressedOops is enabled by default since u23.
>>>
>>> At least this page seems to support that:
>>> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
>>>
>>> Regarding the other remarks (also from Todd and Chi), I'll comment
>>> later. The first thing on my list is to collect
>>> PrintTenuringDistribution data now.
>>>
>>> Kind regards,
>>> Taras
>>>
>>> On Wed, Feb 22, 2012 at 12:40 AM, Srinivas Ramakrishna
>>> <ysr1729 at gmail.com> wrote:
>>>> I agree that premature promotions are almost always the first and most
>>>> important thing to fix when running
>>>> into fragmentation or overload issues with CMS. However, I can also imagine
>>>> long-lived objects with a highly
>>>> non-stationary size distribution which can also cause problems for CMS
>>>> despite best efforts to tune against
>>>> premature promotion.
>>>>
>>>> I didn't think Treas was running with MTT=0, although MTT > 0 is no recipe
>>>> for avoiding premature promotion
>>>> with bursty loads that case overflow the survivor spaces -- as you say large
>>>> survivor spaces with a low
>>>> TargetSurvivorRatio -- so as to leave plenty of space to absorb/accommodate
>>>> spiking/bursty loads  is
>>>> definitely a "best practice" for CMS (and possibly for other concurrent
>>>> collectors as well).
>>>>
>>>> One thing Taras can do to see if premature promotion might be an issue is to
>>>> look at the tenuring
>>>> threshold in his case. A rough proxy (if PrintTenuringDistribution is not
>>>> enabled) is to look at the
>>>> promotion volume per scavenge. It may be possible, if premature promotion is
>>>> a cause, to see
>>>> some kind of medium-term correlation between high promotion volume and
>>>> eventual promotion
>>>> failure despite frequent CMS collections.
>>>>
>>>> One other point which may or may not be relevant. I see that Taras is not
>>>> using CompressedOops...
>>>> Using that alone would greatly decrease memory pressure and provide more
>>>> breathing room to CMS,
>>>> which is also almost always a good idea.
>>>>
>>>> -- ramki
>>>>
>>>> On Tue, Feb 21, 2012 at 10:16 AM, Chi Ho Kwok <chkwok at digibites.nl> wrote:
>>>>>
>>>>> Hi Teras,
>>>>>
>>>>> I think you may want to look into sizing the new and especially the
>>>>> survivor spaces differently. We run something similar to what you described,
>>>>> high volume request processing with large dataset loading, and what we've
>>>>> seen at the start is that the survivor spaces are completely overloaded,
>>>>> causing premature promotions.
>>>>>
>>>>> We've configured our vm with the following goals/guideline:
>>>>>
>>>>> old space is for semi-permanent data, living for at least 30s, average ~10
>>>>> minutes
>>>>> new space contains only temporary and just loaded data
>>>>> surviving objects from new should never reach old in 1 gc, so the survivor
>>>>> space may never be 100% full
>>>>>
>>>>> With jstat -gcutil `pidof java` 2000, we see things like:
>>>>>
>>>>>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>>>>>  70.20   0.00  19.65  57.60  59.90 124808 29474.299  2498  191.110
>>>>> 29665.409
>>>>>  70.20   0.00  92.89  57.60  59.90 124808 29474.299  2498  191.110
>>>>> 29665.409
>>>>>  70.20   0.00  93.47  57.60  59.90 124808 29474.299  2498  191.110
>>>>> 29665.409
>>>>>   0.00  65.69  78.07  58.09  59.90 124809 29474.526  2498  191.110
>>>>> 29665.636
>>>>>  84.97   0.00  48.19  58.57  59.90 124810 29474.774  2498  191.110
>>>>> 29665.884
>>>>>  84.97   0.00  81.30  58.57  59.90 124810 29474.774  2498  191.110
>>>>> 29665.884
>>>>>   0.00  62.64  27.22  59.12  59.90 124811 29474.992  2498  191.110
>>>>> 29666.102
>>>>>   0.00  62.64  54.47  59.12  59.90 124811 29474.992  2498  191.110
>>>>> 29666.102
>>>>>  75.68   0.00   6.80  59.53  59.90 124812 29475.228  2498  191.110
>>>>> 29666.338
>>>>>  75.68   0.00  23.38  59.53  59.90 124812 29475.228  2498  191.110
>>>>> 29666.338
>>>>>  75.68   0.00  27.72  59.53  59.90 124812 29475.228  2498  191.110
>>>>> 29666.338
>>>>>
>>>>> If you follow the lines, you can see Eden fill up to 100% on line 4,
>>>>> surviving objects are copied into S1, S0 is collected and added 0.49% to
>>>>> Old. On line 5, another GC happened, with Eden->S0, S1->Old, etc. No objects
>>>>> is ever transferred from Eden to Old, unless there's a huge peak of
>>>>> requests.
>>>>>
>>>>> This is with a: 32GB heap, Mxn1200M, SurvivorRatio 2 (600MB Eden, 300MB
>>>>> S0, 300MB S1), MaxTenuringThreshold 1 (whatever is still alive in S0/1 on
>>>>> the second GC is copied to old, don't wait, web requests are quite bursty).
>>>>> With about 1 collection every 2-5 seconds, objects promoted to Old must live
>>>>> for at 4-10 seconds; as that's longer than an average request (50ms-1s),
>>>>> none of the temporary data ever makes it into Old, which is much more
>>>>> expensive to collect. It works even with a higher than default
>>>>> CMSInitiatingOccupancyFraction=76 to optimize for space available for the
>>>>> large data cache we have.
>>>>>
>>>>>
>>>>> With your config of 400MB Total new, with 350MB Eden, 25MB S0, 25MB S1
>>>>> (SurvivorRatio 8), no tenuring threshold, I think loads of new objects get
>>>>> copied from Eden to Old directly, causing trouble for the CMS. You can use
>>>>> jstat to get live stats and tweak until it doesn't happen. If you can't make
>>>>> changes on live that easil, try doubling the new size indeed, with a 400
>>>>> Eden, 200 S0, 200 S1 and MaxTenuringThreshold 1 setting. It's probably
>>>>> overkill, but if should solve the problem if it is caused by premature
>>>>> promotion.
>>>>>
>>>>>
>>>>> Chi Ho Kwok
>>>>>
>>>>>
>>>>> On Tue, Feb 21, 2012 at 5:55 PM, Taras Tielkes <taras.tielkes at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We've removed the "-XX:+CMSScavengeBeforeRemark" setting from 50% of
>>>>>> our production nodes.
>>>>>> After running for a few weeks, it seems that there's no impact from
>>>>>> removing this option.
>>>>>> Which is good, since it seems we can remove it from the other nodes as
>>>>>> well, simplifying our overall JVM configuration ;-)
>>>>>>
>>>>>> However, we're still seeing promotion failures on all nodes, once
>>>>>> every day or so.
>>>>>>
>>>>>> There's still the "Magic 1026": this accounts for ~60% of the
>>>>>> promotion failures that we're seeing (single ParNew thread thread,
>>>>>> 1026 failure size):
>>>>>> --------------------
>>>>>> 2012-02-06T09:13:51.806+0100: 328095.085: [GC 328095.086: [ParNew:
>>>>>> 359895K->29357K(368640K), 0.0429070 secs]
>>>>>> 3471021K->3143476K(5201920K), 0.0434950 secs] [Times: user=0.32
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-02-06T09:13:55.922+0100: 328099.201: [GC 328099.201: [ParNew:
>>>>>> 357037K->31817K(368640K), 0.0429130 secs]
>>>>>> 3471156K->3148946K(5201920K), 0.0434930 secs] [Times: user=0.31
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-02-06T09:13:59.044+0100: 328102.324: [GC 328102.324: [ParNew
>>>>>> (promotion failure size = 1026)  (promotion failed):
>>>>>> 359497K->368640K(368640K), 0.2226790 secs]328102.547: [CMS:
>>>>>> 3125609K->451515K(4833280K), 5.6225880 secs] 3476626K->4515
>>>>>> 15K(5201920K), [CMS Perm : 124373K->124353K(262144K)], 5.8459380 secs]
>>>>>> [Times: user=6.20 sys=0.01, real=5.85 secs]
>>>>>> 2012-02-06T09:14:05.243+0100: 328108.522: [GC 328108.523: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0319160 secs] 779195K->497658K(5201920K),
>>>>>> 0.0325360 secs] [Times: user=0.21 sys=0.01, real=0.03 secs]
>>>>>> 2012-02-06T09:14:07.836+0100: 328111.116: [GC 328111.116: [ParNew:
>>>>>> 368640K->32785K(368640K), 0.0744670 secs] 825338K->520234K(5201920K),
>>>>>> 0.0750390 secs] [Times: user=0.40 sys=0.02, real=0.08 secs]
>>>>>> --------------------
>>>>>> Given the 1026 word size, I'm wondering if I should be hunting for an
>>>>>> overuse of BufferedInputStream/BufferedOutoutStream, since both have
>>>>>> 8192 as a default buffer size.
>>>>>>
>>>>>> The second group of promotion failures look like this (multiple ParNew
>>>>>> threads, small failure sizes):
>>>>>> --------------------
>>>>>> 2012-02-06T09:50:15.773+0100: 328756.964: [GC 328756.964: [ParNew:
>>>>>> 356116K->29934K(368640K), 0.0461100 secs]
>>>>>> 3203863K->2880162K(5201920K), 0.0468870 secs] [Times: user=0.34
>>>>>> sys=0.01, real=0.05 secs]
>>>>>> 2012-02-06T09:50:19.153+0100: 328760.344: [GC 328760.344: [ParNew:
>>>>>> 357614K->30359K(368640K), 0.0454680 secs]
>>>>>> 3207842K->2882892K(5201920K), 0.0462280 secs] [Times: user=0.33
>>>>>> sys=0.01, real=0.05 secs]
>>>>>> 2012-02-06T09:50:22.658+0100: 328763.849: [GC 328763.849: [ParNew (1:
>>>>>> promotion failure size = 25)  (4: promotion failure size = 25)  (6:
>>>>>> promotion failure size = 25)  (7: promotion failure size = 144)
>>>>>> (promotion failed): 358039K->358358
>>>>>> K(368640K), 0.2148680 secs]328764.064: [CMS:
>>>>>> 2854709K->446750K(4833280K), 5.8368270 secs]
>>>>>> 3210572K->446750K(5201920K), [CMS Perm : 124670K->124644K(262144K)],
>>>>>> 6.0525230 secs] [Times: user=6.32 sys=0.00, real=6.05 secs]
>>>>>> 2012-02-06T09:50:29.896+0100: 328771.086: [GC 328771.087: [ParNew:
>>>>>> 327680K->22569K(368640K), 0.0227080 secs] 774430K->469319K(5201920K),
>>>>>> 0.0235020 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
>>>>>> 2012-02-06T09:50:31.076+0100: 328772.266: [GC 328772.267: [ParNew:
>>>>>> 350249K->22264K(368640K), 0.0235480 secs] 796999K->469014K(5201920K),
>>>>>> 0.0243000 secs] [Times: user=0.18 sys=0.01, real=0.02 secs]
>>>>>> --------------------
>>>>>>
>>>>>> We're going to try to double the new size on a single node, to see the
>>>>>> effects of that.
>>>>>>
>>>>>> Beyond this experiment, is there any additional data I can collect to
>>>>>> better understand the nature of the promotion failures?
>>>>>> Am I facing collecting free list statistics at this point?
>>>>>>
>>>>>> Thanks,
>>>>>> Taras
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use