Excessive concurrent-abortable-preclean wall time?

Thu Oct 15 13:43:47 PDT 2009

A bit of background:
We had a service up (1.4GB in-memory database cache with short-lived
requests against it) running happily in 2GB of heap (512MB Young
Generation). That was ticking along perfectly acceptably, promoting
around 1MB of young-generation objects every minute, which were
cleaned out by periodic CMS cycles.

Seventeen days in, we hit a CMS concurrent mode failure, paused for
ten seconds to do a compacting collection, and ops decided (based on
their response-time monitors) that the service had gone down and
restarted it. When they saw the concurrent mode failure in the logs,
they decided to bump it up to 3GB of heap (1GB Young Generation) based
on I'm not entirely sure what and asked us to look into why we hit the
compacting collection.

Looking at the logs, it seemed that each young generation caught
around 1MB of live transient data and promoted it, and those
promotions eventually caused enough fragmentation to force a
compacting collection. The old generation was definitely not growing
between collections or anything like that, but fragmentation as the
cause is a guess on my part.

We asked ops to add -XX:MaxTenuringThreshold=2 and
-XX:SurvivorRatio=128, in the hopes that the addition of the tenuring
threshold would prevent the live transient data from being promoted.
When that change was applied, though, we began seeing constant, and
extremely poorly-performing, CMS collections:

778.842: [GC [1 CMS-initial-mark: 1403910K(2097152K)]
1955787K(3137664K), 0.3734370 secs]
779.216: [CMS-concurrent-mark-start]
782.234: [CMS-concurrent-mark: 3.017/3.018 secs]
782.234: [CMS-concurrent-preclean-start]
782.242: [CMS-concurrent-preclean: 0.008/0.008 secs]
782.242: [CMS-concurrent-abortable-preclean-start]
856.748: [GC 856.749: [ParNew
Desired survivor size 4128768 bytes, new threshold 2 (max 2)
- age   1:    1159440 bytes,    1159440 total
- age   2:     147320 bytes,    1306760 total
: 1033744K->1301K(1040512K), 0.0085134 secs]
2437655K->1405297K(3137664K), 0.0087125 secs]
933.642: [CMS-concurrent-abortable-preclean: 12.932/151.400 secs]
933.642: [GC[YG occupancy: 517886 K (1040512 K)]933.643: [Rescan
(parallel) , 0.2396928 secs]933.882: [weak refs processing, 0.0026955
secs] [1 CMS-remark: 1403995K(2097152K)] 1921882K(3137664K), 0.2425640
secs]
933.885: [CMS-concurrent-sweep-start]
934.685: [CMS-concurrent-sweep: 0.799/0.799 secs]
934.685: [CMS-concurrent-reset-start]
934.700: [CMS-concurrent-reset: 0.015/0.015 secs]
936.702: [GC [1 CMS-initial-mark: 1403961K(2097152K)]
1955265K(3137664K), 0.3780262 secs]
937.081: [CMS-concurrent-mark-start]
940.148: [CMS-concurrent-mark: 3.067/3.067 secs]
940.148: [CMS-concurrent-preclean-start]
940.155: [CMS-concurrent-preclean: 0.007/0.008 secs]
940.155: [CMS-concurrent-abortable-preclean-start]
1012.714: [GC 1012.714: [ParNew
Desired survivor size 4128768 bytes, new threshold 2 (max 2)
- age   1:    1300328 bytes,    1300328 total
- age   2:     153424 bytes,    1453752 total
: 1033749K->1436K(1040512K), 0.0187176 secs]
2437711K->1405484K(3137664K), 0.0188932 secs]
1097.952: [CMS-concurrent-abortable-preclean: 13.428/157.797 secs]
1097.954: [GC[YG occupancy: 518322 K (1040512 K)]1097.954: [Rescan
(parallel) , 0.2218981 secs]1098.176: [weak refs processing, 0.0026654
secs] [1 CMS-remark: 1404048K(2097152K)] 1922370K(3137664K), 0.2247333
secs]
1098.180: [CMS-concurrent-sweep-start]
1098.939: [CMS-concurrent-sweep: 0.759/0.759 secs]
1098.939: [CMS-concurrent-reset-start]
1098.954: [CMS-concurrent-reset: 0.015/0.015 secs]

Those two cycles are representative of the entire log (attached), the
main complaints being the constant CMS cycles and the lengthy
CMS-concurrent-abortable-preclean wall time.
We've since added -XX:CMSInitiatingOccupancyFraction=85
-XX:+UseCMSInitiatingOccupancyOnly to prevent the constant cycles,
which may have fixed the lengthy preclean time as well - we haven't
hit a full GC yet, but I can share that data when it is available. The
system is a dual quad core AMD server, running Solaris and Java
1.5.0_14 64-bit server.

This all leads down to three questions:
1) Why did we start seeing constant CMS cycles with the increased heap
and addition of MaxTenuringThreshold?
2) Why did CMS-concurrent-abortable-preclean begin taking so much wall
time when CMS was running constantly?
3) Was adding CMSInitiatingOccupancyFraction and
UseCMSInitiatingOccupancyOnly the right way to solve this problem? Is
there a better one?

The original options were:
  -server -Xms2g -Xmx2g
-XX:NewSize=512m -XX:MaxNewSize=512m
-XX:PermSize=128m -XX:MaxPermSize=128m
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=128

The poor-performance options were:
-server -Xms3g -Xmx3g
-XX:NewSize=1024m -XX:MaxNewSize=1024m
-XX:PermSize=128m -XX:MaxPermSize=128m
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=128

The current (working) options are
-server -Xms3g -Xmx3g
-XX:NewSize=1024m -XX:MaxNewSize=1024m
-XX:PermSize=128m -XX:MaxPermSize=128m
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=128
-XX:CMSInitiatingOccupancyFraction=85 -XX:+UseCMSInitiatingOccupancyOnly

-- 
Jacob Kessler
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poorGC.log
Type: application/octet-stream
Size: 17929 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20091015/868c16c0/attachment-0001.obj