Unreasonably long ParNew, CMS pause times

Wed Jul 21 10:09:25 PDT 2010

Hello all,
  I've got a web service that I'm seeing some very strange behavior on
around it's garbage collection, and was wondering if someone here could
explain why it might be happening. The service itself is fairly simple: Take
in data from a few sources and merge them with the existing data in the
database. It stores nearly no state while doing this, and indeed heap dumps
taken 1, 24, and 72 hours after boot indicate that we have a consistent
~12mb of live data (in a 2GB heap, but I hope that's not what is causing
this).

The GC logs, though, don't look very happy at all. After our start up
period, they settle into a fairly consistent pattern:

1041.159: [GC 1041.159: [ParNew
Desired survivor size 2064384 bytes, new threshold 0 (max 0)
: 516224K->0K(520256K), 0.0170322 secs] 537266K->22659K(2093120K),
0.0171330 secs] [Times: user=0.04 sys=0.01, real=0.02 secs]
1606.500: [GC 1606.500: [ParNew
Desired survivor size 2064384 bytes, new threshold 0 (max 0)
: 516224K->0K(520256K), 0.0173235 secs] 538883K->24214K(2093120K),
0.0174138 secs] [Times: user=0.04 sys=0.03, real=0.02 secs]
2040.773: [GC 2040.773: [ParNew
Desired survivor size 2064384 bytes, new threshold 0 (max 0)
: 516224K->0K(520256K), 0.0196319 secs] 540438K->25737K(2093120K),
0.0197275 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]

Which we would be very nice if it kept going like that. However, by the
first time the CMS collector runs, things aren't working nearly as well:

214182.439: [GC 214182.439: [ParNew
Desired survivor size 2064384 bytes, new threshold 0 (max 0)
: 516224K->0K(520256K), 1.0146996 secs] 1297278K->782575K(2093120K),
1.0148799 secs] [Times: user=1.21 sys=0.58, real=1.01 secs]
214247.437: [GC 214247.438: [ParNew
Desired survivor size 2064384 bytes, new threshold 0 (max 0)
: 516224K->0K(520256K), 1.2310274 secs] 1298799K->784188K(2093120K),
1.2311534 secs] [Times: user=1.46 sys=0.69, real=1.23 secs]
214313.642: [GC 214313.642: [ParNew
Desired survivor size 2064384 bytes, new threshold 0 (max 0)
: 516224K->0K(520256K), 1.2183258 secs] 1300412K->785710K(2093120K),
1.2184848 secs] [Times: user=1.45 sys=0.65, real=1.22 secs]

The increasing sys time is a bit worrying, but it seems like the actual GC
time is rising as well, even though we aren't collecting any more young-gen
garbage. At this point, CMS went off

214380.695: [GC [1 CMS-initial-mark: 787188K(1572864K)]
787195K(2093120K), 1.8929842 secs] [Times: user=1.50 sys=0.02,
real=1.89 secs]
214382.589: [CMS-concurrent-mark-start]
214383.056: [CMS-concurrent-mark: 0.467/0.467 secs] [Times: user=1.81
sys=0.01, real=0.47 secs]
214383.056: [CMS-concurrent-preclean-start]
214383.064: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times:
user=0.01 sys=0.00, real=0.01 secs]
214383.064: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 214388.133:
[CMS-concurrent-abortable-preclean: 0.242/5.069 secs] [Times:
user=5.02 sys=0.02, real=5.07 secs]
214388.159: [GC[YG occupancy: 51943 K (520256 K)]214388.159: [Rescan
(parallel) , 1.5403455 secs]214389.699: [weak refs processing,
0.0050170 secs] [1 CMS-remark: 787188K(1572864K)] 839132K(2093120K),
1.5458536 secs] [Times: user=1.80 sys=0.71, real=1.55 secs]
214389.705: [CMS-concurrent-sweep-start]
214390.439: [CMS-concurrent-sweep: 0.671/0.734 secs] [Times: user=1.35
sys=0.00, real=0.73 secs]
214390.439: [CMS-concurrent-reset-start]
214390.621: [CMS-concurrent-reset: 0.183/0.183 secs] [Times: user=0.20
sys=0.02, real=0.18 secs]

It seems like a) initial-mark shouldn't take 1.8 seconds, b) if we really do
only have 12mb of live data, CMS should have collected a lot more than it
did (the next ParNew collection reported ~545MB of old gen in use), and c)
50% heap usage with very little promotion seems very early for the collector
to go off.

The next CMS cycle is at 434,973 seconds, by which point the young gen
collections are taking 3 seconds (user 3.59, sys 1.60, real 3.09). The
initial mark takes 4.82 seconds (user 3.82, sys 0.02, real 4.82), and sweeps
down to 1.1gb of used old gen. I haven't yet confirmed it, but given the
previous heap dumps I'd guess that they will claim 12mb of live objects and
1.1gb of dead objects. The current young gen collections (at 497,601
seconds) are taking ~3.7 seconds (4.33 user, 2.03 sys) Any idea what could
be going on here? We're running JDK 1.6_16.

-- 
Eagle Kessler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100721/dbea6214/attachment.html