Unreasonably long ParNew, CMS pause times

Wed Jul 21 13:41:18 PDT 2010

Checking the configs, it looks like we are explicitly setting MTT to 0:

# Min, max, total JVM size (-Xms -Xmx)
JVM_SIZE="-server -Xms2g -Xmx2g"

# New Generation Sizes (-XX:NewSize -XX:MaxNewSize)
JVM_SIZE_NEW="-XX:NewSize=512m -XX:MaxNewSize=512m"

# Perm Generation Sizes (-XX:PermSize -XX:MaxPermSize)
JVM_SIZE_PERM="-XX:PermSize=128m -XX:MaxPermSize=128m"

# Type of Garbage Collector to use
JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"

JVM_GC_OPTS="-XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=128"

I agree that in this case we definitely want to be using survivor spaces
(I'm unfortunately not in charge of the default GC settings yet). However, I
didn't know that running without survivor spaces could cause this kind of
behavior. Why does running without survivor spaces cause such a  large
performance issue?

Regardless, I'll ask that -XX:+PrintTenuringDistribution be added to the
configs, along with a non-zero MTT, and see if the issue persists. The
rising ParNew times seem like they would be unrelated to the tenuring
threshold, though, wouldn't they?

On Wed, Jul 21, 2010 at 12:40 PM, Y. S. Ramakrishna <
y.s.ramakrishna at oracle.com> wrote:

> oh, and the "(max 0)" seems to imply you have somehow ended up asking for
> MaxTenuringThreshold=0 which can be disastrous. It is possible that the
> default settings for CMS on certain very old JVM's did indeed result in
> MTT=0 (but memory is fading of that era), so if using such older vintage
> of JVM's, explicitly setting MTT and SurvivorRatio higher is a good idea.
>
> -- ramki
>
>
> On 07/21/10 12:34, Y. S. Ramakrishna wrote:
>
>> Yes, like what i think Matt is getting at, i'd configure sufficiently
>> large survivor spaces. Even if you expect most of your objects to die
>> young,
>> you'd want survivor spaces large enough to keep at least age 1 objects in
>> the survivor space. If as you state no medium- to ling-lived state is
>> retained, your data is mostly short-lived and you'll be able to do without
>> any promotion at all. Your problem here is that somehow your survivor
>> spaces may have disappeared. (+PrintHeapAtGC will tell you, and
>> of course examining yr JVM options should shed more light on that apparent
>> disappearance.)
>>
>> -- ramki
>>
>> On 07/21/10 10:09, Eagle Kessler wrote:
>>
>>> Hello all,
>>>  I've got a web service that I'm seeing some very strange behavior on
>>> around it's garbage collection, and was wondering if someone here could
>>> explain why it might be happening. The service itself is fairly simple: Take
>>> in data from a few sources and merge them with the existing data in the
>>> database. It stores nearly no state while doing this, and indeed heap dumps
>>> taken 1, 24, and 72 hours after boot indicate that we have a consistent
>>> ~12mb of live data (in a 2GB heap, but I hope that's not what is causing
>>> this).
>>>
>>> The GC logs, though, don't look very happy at all. After our start up
>>> period, they settle into a fairly consistent pattern:
>>>
>>> 1041.159: [GC 1041.159: [ParNew
>>> Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>>
>>> : 516224K->0K(520256K), 0.0170322 secs] 537266K->22659K(2093120K),
>>> 0.0171330 secs] [Times: user=0.04 sys=0.01, real=0.02 secs] 1606.500: [GC
>>> 1606.500: [ParNew
>>> Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>>
>>> : 516224K->0K(520256K), 0.0173235 secs] 538883K->24214K(2093120K),
>>> 0.0174138 secs] [Times: user=0.04 sys=0.03, real=0.02 secs] 2040.773: [GC
>>> 2040.773: [ParNew
>>> Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>>
>>> : 516224K->0K(520256K), 0.0196319 secs] 540438K->25737K(2093120K),
>>> 0.0197275 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
>>> Which we would be very nice if it kept going like that. However, by the
>>> first time the CMS collector runs, things aren't working nearly as well:
>>>
>>> 214182.439: [GC 214182.439: [ParNew
>>> Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>> : 516224K->0K(520256K), 1.0146996 secs] 1297278K->782575K(2093120K),
>>> 1.0148799 secs] [Times: user=1.21 sys=0.58, real=1.01 secs]
>>> 214247.437: [GC 214247.438: [ParNew
>>> Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>> : 516224K->0K(520256K), 1.2310274 secs] 1298799K->784188K(2093120K),
>>> 1.2311534 secs] [Times: user=1.46 sys=0.69, real=1.23 secs]
>>> 214313.642: [GC 214313.642: [ParNew
>>> Desired survivor size 2064384 bytes, new threshold 0 (max 0)
>>> : 516224K->0K(520256K), 1.2183258 secs] 1300412K->785710K(2093120K),
>>> 1.2184848 secs] [Times: user=1.45 sys=0.65, real=1.22 secs]
>>>
>>> The increasing sys time is a bit worrying, but it seems like the actual
>>> GC time is rising as well, even though we aren't collecting any more
>>> young-gen garbage. At this point, CMS went off
>>>
>>> 214380.695: [GC [1 CMS-initial-mark: 787188K(1572864K)]
>>> 787195K(2093120K), 1.8929842 secs] [Times: user=1.50 sys=0.02, real=1.89
>>> secs]
>>> 214382.589: [CMS-concurrent-mark-start]
>>> 214383.056: [CMS-concurrent-mark: 0.467/0.467 secs] [Times: user=1.81
>>> sys=0.01, real=0.47 secs] 214383.056: [CMS-concurrent-preclean-start]
>>> 214383.064: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times: user=0.01
>>> sys=0.00, real=0.01 secs]
>>> 214383.064: [CMS-concurrent-abortable-preclean-start]
>>>  CMS: abort preclean due to time 214388.133:
>>> [CMS-concurrent-abortable-preclean: 0.242/5.069 secs] [Times: user=5.02
>>> sys=0.02, real=5.07 secs] 214388.159: [GC[YG occupancy: 51943 K (520256
>>> K)]214388.159: [Rescan (parallel) , 1.5403455 secs]214389.699: [weak refs
>>> processing, 0.0050170 secs] [1 CMS-remark: 787188K(1572864K)]
>>> 839132K(2093120K), 1.5458536 secs] [Times: user=1.80 sys=0.71, real=1.55
>>> secs]
>>> 214389.705: [CMS-concurrent-sweep-start]
>>> 214390.439: [CMS-concurrent-sweep: 0.671/0.734 secs] [Times: user=1.35
>>> sys=0.00, real=0.73 secs] 214390.439: [CMS-concurrent-reset-start]
>>> 214390.621: [CMS-concurrent-reset: 0.183/0.183 secs] [Times: user=0.20
>>> sys=0.02, real=0.18 secs]
>>>
>>> It seems like a) initial-mark shouldn't take 1.8 seconds, b) if we really
>>> do only have 12mb of live data, CMS should have collected a lot more than it
>>> did (the next ParNew collection reported ~545MB of old gen in use), and c)
>>> 50% heap usage with very little promotion seems very early for the collector
>>> to go off.
>>>
>>> The next CMS cycle is at 434,973 seconds, by which point the young gen
>>> collections are taking 3 seconds (user 3.59, sys 1.60, real 3.09). The
>>> initial mark takes 4.82 seconds (user 3.82, sys 0.02, real 4.82), and sweeps
>>> down to 1.1gb of used old gen. I haven't yet confirmed it, but given the
>>> previous heap dumps I'd guess that they will claim 12mb of live objects and
>>> 1.1gb of dead objects. The current young gen collections (at 497,601
>>> seconds) are taking ~3.7 seconds (4.33 user, 2.03 sys) Any idea what could
>>> be going on here? We're running JDK 1.6_16.
>>>
>>> --
>>> Eagle Kessler
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>

-- 
Eagle Kessler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100721/99e34714/attachment-0001.html