Unreasonably long ParNew, CMS pause times

Thu Jul 22 11:36:50 PDT 2010

We've switched to the throughput collector with a max pause time goal, and
that seems to be behaving well. I'm still curious about the behavior that
I'm seeing with CMS, though.

I ran some tests with -XX:+PrintTenuringDistribution and without setting
MaxTenuringThreshold, and I'm still seeing the behavior. Both of these tests
were executed under constant load:

512mb heap (128mb young gen), MTT=0:
62.537: [GC 62.537: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0064317 secs] 143985K->16058K(523328K), 0.0065156
secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
95.633: [GC 95.633: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0030186 secs] 145210K->16614K(523328K), 0.0030929
secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
127.882: [GC 127.882: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0042069 secs] 145766K->17285K(523328K), 0.0042783
secs] [Times: user=0.01 sys=0.02, real=0.00 secs]
158.086: [GC 158.086: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0045986 secs] 146437K->17955K(523328K), 0.0046896
secs] [Times: user=0.02 sys=0.02, real=0.01 secs]

Rising to
5021.415: [GC 5021.415: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0604269 secs] 225694K->97083K(523328K), 0.0605133
secs] [Times: user=0.10 sys=0.61, real=0.06 secs]
5053.611: [GC 5053.611: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0569357 secs] 226235K->97636K(523328K), 0.0570316
secs] [Times: user=0.10 sys=0.60, real=0.05 secs]
5084.401: [GC 5084.401: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 129152K->0K(130112K), 0.0591064 secs] 226788K->98088K(523328K), 0.0591840
secs] [Times: user=0.09 sys=0.62, real=0.06 secs]

Similarly, 512mb heap, 128mb young gen, and no MTT:
83.708: [GC 83.708: [ParNew
Desired survivor size 3702784 bytes, new threshold 15 (max 15)
- age   1:     626008 bytes,     626008 total
- age   2:     736336 bytes,    1362344 total
: 120669K->1759K(123840K), 0.0034469 secs] 135502K->16593K(517056K),
0.0035234 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
109.627: [GC 109.627: [ParNew
Desired survivor size 3702784 bytes, new threshold 15 (max 15)
- age   1:     634688 bytes,     634688 total
- age   2:     268624 bytes,     903312 total
- age   3:     736016 bytes,    1639328 total
: 118367K->2275K(123840K), 0.0036776 secs] 133201K->17109K(517056K),
0.0037455 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
137.495: [GC 137.495: [ParNew
Desired survivor size 3702784 bytes, new threshold 15 (max 15)
- age   1:     479728 bytes,     479728 total
- age   2:     269064 bytes,     748792 total
- age   3:     267904 bytes,    1016696 total
- age   4:     735952 bytes,    1752648 total
: 118883K->2396K(123840K), 0.0040557 secs] 133717K->17230K(517056K),
0.0041302 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
165.090: [GC 165.090: [ParNew
Desired survivor size 3702784 bytes, new threshold 15 (max 15)
- age   1:     403856 bytes,     403856 total
- age   2:     267736 bytes,     671592 total
- age   3:     268616 bytes,     940208 total
- age   4:     267904 bytes,    1208112 total
- age   5:     729072 bytes,    1937184 total
: 119004K->2668K(123840K), 0.0046744 secs] 133838K->17501K(517056K),
0.0047473 secs] [Times: user=0.04 sys=0.02, real=0.00 secs]

Rising to
4981.917: [GC 4981.918: [ParNew
Desired survivor size 3702784 bytes, new threshold 14 (max 15)
- age   1:     533872 bytes,     533872 total
- age   2:     269336 bytes,     803208 total
- age   3:     268048 bytes,    1071256 total
- age   4:     268272 bytes,    1339528 total
- age   5:     265880 bytes,    1605408 total
- age   6:     241704 bytes,    1847112 total
- age   7:     241112 bytes,    2088224 total
- age   8:     239680 bytes,    2327904 total
- age   9:     233632 bytes,    2561536 total
- age  10:     231040 bytes,    2792576 total
- age  11:     231040 bytes,    3023616 total
- age  12:     232256 bytes,    3255872 total
- age  13:     232256 bytes,    3488128 total
- age  14:     231040 bytes,    3719168 total
: 122652K->4440K(123840K), 0.0654827 secs] 173869K->56113K(517056K),
0.0655671 secs] [Times: user=0.17 sys=0.62, real=0.07 secs]
5009.679: [GC 5009.679: [ParNew
Desired survivor size 3702784 bytes, new threshold 14 (max 15)
- age   1:     673136 bytes,     673136 total
- age   2:     271704 bytes,     944840 total
- age   3:     269232 bytes,    1214072 total
- age   4:     268088 bytes,    1482160 total
- age   5:     264528 bytes,    1746688 total
- age   6:     244280 bytes,    1990968 total
- age   7:     238320 bytes,    2229288 total
- age   8:     241112 bytes,    2470400 total
- age   9:     233416 bytes,    2703816 total
- age  10:     231040 bytes,    2934856 total
- age  11:     231040 bytes,    3165896 total
- age  12:     231040 bytes,    3396936 total
- age  13:     232256 bytes,    3629192 total
- age  14:     232256 bytes,    3861448 total
: 121048K->5058K(123840K), 0.0675964 secs] 172721K->56957K(517056K),
0.0677232 secs] [Times: user=0.17 sys=0.66, real=0.06 secs]
5037.742: [GC 5037.742: [ParNew
Desired survivor size 3702784 bytes, new threshold 14 (max 15)
- age   1:     582296 bytes,     582296 total
- age   2:     268128 bytes,     850424 total
- age   3:     271528 bytes,    1121952 total
- age   4:     269192 bytes,    1391144 total
- age   5:     264952 bytes,    1656096 total
- age   6:     242496 bytes,    1898592 total
- age   7:     240752 bytes,    2139344 total
- age   8:     238320 bytes,    2377664 total
- age   9:     234776 bytes,    2612440 total
- age  10:     231040 bytes,    2843480 total
- age  11:     231040 bytes,    3074520 total
- age  12:     231040 bytes,    3305560 total
- age  13:     231040 bytes,    3536600 total
- age  14:     232256 bytes,    3768856 total
: 121666K->5991K(123840K), 0.0649960 secs] 173565K->58116K(517056K),
0.0650795 secs] [Times: user=0.17 sys=0.64, real=0.06 secs]

I'll look into why 232k lives through seven minutes of collections, but in
both cases a full collection (triggered through JMX) brought the heap down
to ~12mb and removed the "accumulated" ParNew time. Any idea why I'm seeing
the slow increase in ParNew collection times and/or more tests that I should
be running to diagnose it? I can provide the full logs if you'd like, I've
abbreviated them here to avoid sending long messages to the entire list.

On Wed, Jul 21, 2010 at 2:16 PM, Y. S. Ramakrishna <
y.s.ramakrishna at oracle.com> wrote:

>
>
> On 07/21/10 13:41, Eagle Kessler wrote:
>
>> Checking the configs, it looks like we are explicitly setting MTT to 0:
>>
>> # Min, max, total JVM size (-Xms -Xmx)
>> JVM_SIZE="-server -Xms2g -Xmx2g"
>>
>> # New Generation Sizes (-XX:NewSize -XX:MaxNewSize)
>>
>> JVM_SIZE_NEW="-XX:NewSize=512m -XX:MaxNewSize=512m"
>>
>> # Perm Generation Sizes (-XX:PermSize -XX:MaxPermSize)
>> JVM_SIZE_PERM="-XX:PermSize=128m -XX:MaxPermSize=128m"
>>
>> # Type of Garbage Collector to use
>>
>> JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"
>>
>> JVM_GC_OPTS="-XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=128"
>>
>> I agree that in this case we definitely want to be using survivor spaces
>> (I'm unfortunately not in charge of the default GC settings yet). However, I
>> didn't know that running without survivor spaces could cause this kind of
>> behavior. Why does running without survivor spaces cause such a  large
>> performance issue?
>>
>
> See below.
>
>
>
>> Regardless, I'll ask that -XX:+PrintTenuringDistribution be added to the
>> configs, along with a non-zero MTT, and see if the issue persists. The
>>
>
> And of course SurvivorRatio a more reasonable value like 6 or 8,
> depending on expected survival rate etc.
>
>
>  rising ParNew times seem like they would be unrelated to the tenuring
>> threshold, though, wouldn't they?
>>
>
> No, it's related in the sense that MTT=0 was resulting in very very
> short-lived
> data to promote into the old gen. This can cause object dempgraphics (size
> and
> age dirstribution) to be very non-stationary, and confuse the heuristics
> for sizing the free list inventory. It of course also places large pressure
> on the old gen allocator, increases fragmentation, increases mutation rates
> and so on.
>
> The only thing it might not affect much is the initial mark pauses,
> which will probably stay as they were before.
>
> -- ramki
>
>
>> On Wed, Jul 21, 2010 at 12:40 PM, Y. S. Ramakrishna <
>> y.s.ramakrishna at oracle.com <mailto:y.s.ramakrishna at oracle.com>> wrote:
>>
>>    oh, and the "(max 0)" seems to imply you have somehow ended up
>>    asking for
>>    MaxTenuringThreshold=0 which can be disastrous. It is possible that the
>>    default settings for CMS on certain very old JVM's did indeed result in
>>    MTT=0 (but memory is fading of that era), so if using such older
>> vintage
>>    of JVM's, explicitly setting MTT and SurvivorRatio higher is a good
>>    idea.
>>
>>    -- ramki
>>
>>
>>    On 07/21/10 12:34, Y. S. Ramakrishna wrote:
>>
>>        Yes, like what i think Matt is getting at, i'd configure
>>        sufficiently
>>        large survivor spaces. Even if you expect most of your objects
>>        to die young,
>>        you'd want survivor spaces large enough to keep at least age 1
>>        objects in
>>        the survivor space. If as you state no medium- to ling-lived
>>        state is
>>        retained, your data is mostly short-lived and you'll be able to
>>        do without
>>        any promotion at all. Your problem here is that somehow your
>>        survivor
>>        spaces may have disappeared. (+PrintHeapAtGC will tell you, and
>>        of course examining yr JVM options should shed more light on
>>        that apparent
>>        disappearance.)
>>
>>        -- ramki
>>
>>        On 07/21/10 10:09, Eagle Kessler wrote:
>>
>>            Hello all,
>>             I've got a web service that I'm seeing some very strange
>>            behavior on around it's garbage collection, and was
>>            wondering if someone here could explain why it might be
>>            happening. The service itself is fairly simple: Take in data
>>            from a few sources and merge them with the existing data in
>>            the database. It stores nearly no state while doing this,
>>            and indeed heap dumps taken 1, 24, and 72 hours after boot
>>            indicate that we have a consistent ~12mb of live data (in a
>>            2GB heap, but I hope that's not what is causing this).
>>
>>            The GC logs, though, don't look very happy at all. After our
>>            start up period, they settle into a fairly consistent pattern:
>>
>>            1041.159: [GC 1041.159: [ParNew
>>            Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>
>>            : 516224K->0K(520256K), 0.0170322 secs]
>>            537266K->22659K(2093120K), 0.0171330 secs] [Times: user=0.04
>>            sys=0.01, real=0.02 secs] 1606.500: [GC 1606.500: [ParNew
>>            Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>
>>            : 516224K->0K(520256K), 0.0173235 secs]
>>            538883K->24214K(2093120K), 0.0174138 secs] [Times: user=0.04
>>            sys=0.03, real=0.02 secs] 2040.773: [GC 2040.773: [ParNew
>>            Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>
>>            : 516224K->0K(520256K), 0.0196319 secs]
>>            540438K->25737K(2093120K), 0.0197275 secs] [Times: user=0.05
>>            sys=0.02, real=0.02 secs]
>>            Which we would be very nice if it kept going like that.
>>            However, by the first time the CMS collector runs, things
>>            aren't working nearly as well:
>>
>>            214182.439: [GC 214182.439: [ParNew
>>            Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>            : 516224K->0K(520256K), 1.0146996 secs]
>>            1297278K->782575K(2093120K), 1.0148799 secs] [Times:
>>            user=1.21 sys=0.58, real=1.01 secs]
>>            214247.437: [GC 214247.438: [ParNew
>>            Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>            : 516224K->0K(520256K), 1.2310274 secs]
>>            1298799K->784188K(2093120K), 1.2311534 secs] [Times:
>>            user=1.46 sys=0.69, real=1.23 secs]
>>            214313.642: [GC 214313.642: [ParNew
>>            Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>            : 516224K->0K(520256K), 1.2183258 secs]
>>            1300412K->785710K(2093120K), 1.2184848 secs] [Times:
>>            user=1.45 sys=0.65, real=1.22 secs]
>>
>>            The increasing sys time is a bit worrying, but it seems like
>>            the actual GC time is rising as well, even though we aren't
>>            collecting any more young-gen garbage. At this point, CMS
>>            went off
>>
>>            214380.695: [GC [1 CMS-initial-mark: 787188K(1572864K)]
>>            787195K(2093120K), 1.8929842 secs] [Times: user=1.50
>>            sys=0.02, real=1.89 secs]
>>            214382.589: [CMS-concurrent-mark-start]
>>            214383.056: [CMS-concurrent-mark: 0.467/0.467 secs] [Times:
>>            user=1.81 sys=0.01, real=0.47 secs] 214383.056:
>>            [CMS-concurrent-preclean-start]
>>            214383.064: [CMS-concurrent-preclean: 0.008/0.008 secs]
>>            [Times: user=0.01 sys=0.00, real=0.01 secs]
>>            214383.064: [CMS-concurrent-abortable-preclean-start]
>>             CMS: abort preclean due to time 214388.133:
>>            [CMS-concurrent-abortable-preclean: 0.242/5.069 secs]
>>            [Times: user=5.02 sys=0.02, real=5.07 secs] 214388.159:
>>            [GC[YG occupancy: 51943 K (520256 K)]214388.159: [Rescan
>>            (parallel) , 1.5403455 secs]214389.699: [weak refs
>>            processing, 0.0050170 secs] [1 CMS-remark:
>>            787188K(1572864K)] 839132K(2093120K), 1.5458536 secs]
>>            [Times: user=1.80 sys=0.71, real=1.55 secs]
>>            214389.705: [CMS-concurrent-sweep-start]
>>            214390.439: [CMS-concurrent-sweep: 0.671/0.734 secs] [Times:
>>            user=1.35 sys=0.00, real=0.73 secs] 214390.439:
>>            [CMS-concurrent-reset-start]
>>            214390.621: [CMS-concurrent-reset: 0.183/0.183 secs] [Times:
>>            user=0.20 sys=0.02, real=0.18 secs]
>>
>>            It seems like a) initial-mark shouldn't take 1.8 seconds, b)
>>            if we really do only have 12mb of live data, CMS should have
>>            collected a lot more than it did (the next ParNew collection
>>            reported ~545MB of old gen in use), and c) 50% heap usage
>>            with very little promotion seems very early for the
>>            collector to go off.
>>
>>            The next CMS cycle is at 434,973 seconds, by which point the
>>            young gen collections are taking 3 seconds (user 3.59, sys
>>            1.60, real 3.09). The initial mark takes 4.82 seconds (user
>>            3.82, sys 0.02, real 4.82), and sweeps down to 1.1gb of used
>>            old gen. I haven't yet confirmed it, but given the previous
>>            heap dumps I'd guess that they will claim 12mb of live
>>            objects and 1.1gb of dead objects. The current young gen
>>            collections (at 497,601 seconds) are taking ~3.7 seconds
>>            (4.33 user, 2.03 sys) Any idea what could be going on here?
>>            We're running JDK 1.6_16.
>>
>>            --             Eagle Kessler
>>
>>
>>
>>  ------------------------------------------------------------------------
>>
>>            _______________________________________________
>>            hotspot-gc-use mailing list
>>            hotspot-gc-use at openjdk.java.net
>>            <mailto:hotspot-gc-use at openjdk.java.net>
>>
>>            http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>        _______________________________________________
>>        hotspot-gc-use mailing list
>>        hotspot-gc-use at openjdk.java.net
>>        <mailto:hotspot-gc-use at openjdk.java.net>
>>
>>        http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>
>>
>> --
>> Eagle Kessler
>>
>

-- 
Eagle Kessler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100722/643576e8/attachment-0001.html