How can we cut down those two CMS STW times?
Bartek Markocki
bartosz.markocki at gmail.com
Thu Jan 12 13:56:38 UTC 2012
Hi all,
We have a backend type of application which primary purpose is to
cache user specific graphs of objects. The graphs are relatively
small in size however the rate at which they can change (based on
users' requests) is key here.
Our main challenge was to figure out JVM settings that will handle the
peak memory allocation at the level of 4.5GB/s. To make things a bit
more challenging ;) we have a limited number of RAM on the box (as
there are multiple applications co-located on the box and the box has
just 64GB of RAM).
After a decent amount of testing we figured out the following settings
work for us:
-Xms6G -Xmx6G -Xmn3G -Xss256k -XX:MaxPermSize=512m -XX:PermSize=512m
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:TargetSurvivorRatio=90 -XX:SurvivorRatio=8
-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+CMSScavengeBeforeRemark
-XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
We run Java6 update 27 (64bit, server) on Solaris10.
The above settings work for us with exception of one CMS-initial marks
and one CMS-remark. By working I mean the STW is less than 1 second
for any STW pause.
There is one case when CMS-initial mark took 3.44 seconds. Here is the
extract from the log showing this situation:
90516.053: [GC 90516.053: [ParNew: 2633949K->154547K(2831168K),
0.1729255 secs] 4874963K->2395755K(5976896K), 0.1734674 secs] [Times:
user=3.07 sys=0.01, real=0.17 secs]
90516.846: [GC 90516.846: [ParNew: 2671155K->106975K(2831168K),
0.2183780 secs] 4906534K->2365720K(5976896K), 0.2188906 secs] [Times:
user=3.62 sys=0.05, real=0.22 secs]
90517.684: [GC 90517.684: [ParNew: 2623583K->106936K(2831168K),
0.0690728 secs] 4833212K->2316857K(5976896K), 0.0695870 secs] [Times:
user=1.20 sys=0.01, real=0.07 secs]
90517.976: [CMS-concurrent-sweep: 4.574/5.767 secs] [Times:
user=121.01 sys=1.90, real=5.77 secs]
90517.976: [CMS-concurrent-reset-start]
90518.112: [CMS-concurrent-reset: 0.136/0.136 secs] [Times: user=2.76
sys=0.05, real=0.14 secs]
90520.117: [GC [1 CMS-initial-mark: 2209921K(3145728K)]
4768007K(5976896K), 3.4458003 secs] [Times: user=3.45 sys=0.00,
real=3.45 secs]
90523.564: [CMS-concurrent-mark-start]
90523.623: [GC 90523.623: [ParNew: 2623544K->119747K(2831168K),
0.1848339 secs] 4833465K->2329818K(5976896K), 0.1853529 secs] [Times:
user=3.29 sys=0.01, real=0.19 secs]
90526.087: [CMS-concurrent-mark: 2.314/2.523 secs] [Times: user=18.11
sys=0.18, real=2.52 secs]
90526.087: [CMS-concurrent-preclean-start]
90526.155: [CMS-concurrent-preclean: 0.058/0.068 secs] [Times:
user=0.16 sys=0.00, real=0.07 secs]
90526.155: [CMS-concurrent-abortable-preclean-start]
90531.301: [GC 90531.301: [ParNew: 2636355K->45254K(2831168K),
0.0206247 secs] 4846426K->2255579K(5976896K), 0.0211745 secs] [Times:
user=0.33 sys=0.00, real=0.02 secs]
CMS: abort preclean due to time 90531.470:
[CMS-concurrent-abortable-preclean: 5.271/5.315 secs] [Times:
user=18.85 sys=0.26, real=5.32 secs]
90531.476: [GC[YG occupancy: 662977 K (2831168 K)]90531.476: [GC
90531.476: [ParNew: 662977K->21990K(2831168K), 0.0342782 secs]
2873302K->2232487K(5976896K), 0.0347927 secs] [Times: user=0.40
sys=0.01, real=0.04 secs]
90531.511: [Rescan (parallel) , 0.0074306 secs]90531.519: [weak refs
processing, 0.0000864 secs]90531.519: [class unloading, 0.0350356
secs]90531.554: [scrub symbol & string tables, 0.0266258 secs] [1
CMS-remark: 2210497K(3145728K)] 2232487K(5976896K), 0.1197919 secs]
[Times: user=0.59 sys=0.01, real=0.12 secs]
90531.597: [CMS-concurrent-sweep-start]
90532.212: [GC 90532.212: [ParNew: 2538598K->14216K(2831168K),
0.0162798 secs] 4744071K->2219741K(5976896K), 0.0167729 secs] [Times:
user=0.26 sys=0.00, real=0.02 secs]
90532.865: [GC 90532.865: [ParNew: 2530824K->18587K(2831168K),
0.0192318 secs] 4732677K->2220478K(5976896K), 0.0197659 secs] [Times:
user=0.31 sys=0.00, real=0.02 secs]
90533.500: [GC 90533.500: [ParNew: 2535195K->20886K(2831168K),
0.0206055 secs] 4731793K->2217494K(5976896K), 0.0211439 secs] [Times:
user=0.33 sys=0.00, real=0.02 secs]
Of course almost immediately one can notice that young generation was
almost full during that time, so what happened should not be a
surprise. After some googling I found that a similar topic was
discussed on this group in 2010 – with indication that it is caused by
the 6412968 bug (CMS: Long initial mark). We tried suggested
workarounds and found out that they cannot be applied in our case
(limited number of available RAM) and sooner or later we hit the
promotion or/and concurrent mode failure with even worse STW time.
Unfortunately, as in 2010, bugs.sun.com does not show the bug so I
cannot check if there was any update for the bug, so here comes my
first question: was there any update for the bug (what’s the status
of the bug)?
The next problem that we faced was related to another CMS related bug
(6990419). After we applied the suggested workaround (to enable
scavenge before remark) the problem was almost completely removed with
one exception. There is the following CMS remark:
199848.296: [GC 199848.296: [ParNew: 2689127K->119235K(2831168K),
0.0906522 secs] 4868292K->2309941K(5976896K), 0.0912736 secs] [Times:
user=1.22 sys=0.02, real=0.09 secs]
199853.617: [GC 199853.617: [ParNew: 2635843K->91628K(2831168K),
0.1040602 secs] 4826549K->2311078K(5976896K), 0.1046178 secs] [Times:
user=1.15 sys=0.01, real=0.10 secs]
199853.726: [GC [1 CMS-initial-mark: 2219449K(3145728K)]
2311170K(5976896K), 0.1208219 secs] [Times: user=0.12 sys=0.00,
real=0.12 secs]
199853.847: [CMS-concurrent-mark-start]
199856.405: [CMS-concurrent-mark: 2.547/2.557 secs] [Times: user=18.49
sys=0.35, real=2.56 secs]
199856.405: [CMS-concurrent-preclean-start]
199856.438: [CMS-concurrent-preclean: 0.031/0.033 secs] [Times:
user=0.13 sys=0.00, real=0.03 secs]
199856.439: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 199861.899:
[CMS-concurrent-abortable-preclean: 5.443/5.460 secs] [Times:
user=27.67 sys=1.14, real=5.46 secs]
199861.903: [GC[YG occupancy: 1353639 K (2831168 K)]199861.903:
[Rescan (parallel) , 1.4282026 secs]199863.332: [weak refs processing,
0.0019473 secs]199863.334: [class unloading, 0.0365617
secs]199863.370: [scrub symbol & string tables, 0.0267902 secs] [1
CMS-remark: 2219449K(3145728K)] 3573089K(5976896K), 1.5099836 secs]
[Times: user=12.20 sys=0.17, real=1.51 secs]
199863.414: [CMS-concurrent-sweep-start]
199863.420: [GC 199863.421: [ParNew: 1355519K->53699K(2831168K),
0.1129519 secs] 3574969K->2308972K(5976896K), 0.1138995 secs] [Times:
user=1.10 sys=0.01, real=0.11 secs]
199865.857: [CMS-concurrent-sweep: 2.324/2.443 secs] [Times:
user=10.15 sys=0.61, real=2.44 secs]
199865.857: [CMS-concurrent-reset-start]
199865.888: [CMS-concurrent-reset: 0.031/0.031 secs] [Times: user=0.05
sys=0.00, real=0.03 secs]
199893.779: [GC 199893.780: [ParNew: 2570307K->58285K(2831168K),
0.0397922 secs] 4620179K->2108197K(5976896K), 0.0403072 secs] [Times:
user=0.68 sys=0.00, real=0.04 secs]
199906.510: [GC 199906.510: [ParNew: 2574893K->55484K(2831168K),
0.0390212 secs] 4624805K->2105432K(5976896K), 0.0395148 secs] [Times:
user=0.67 sys=0.01, real=0.04 secs]
There are two things to notice here:
1. The time of this rescan was 20 times longer than any other rescan
time (1.4 seconds comparing to 58 ms)
2. There was no minor GC before CMS-remark even though it was
explicitly requested.
The question here is: is that something already covered by the
6990419 for which workaround simply does not work or something else?
Thank you,
Bartek
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
More information about the hotspot-gc-dev
mailing list