Long remark due to young generation occupancy

Mon Jan 13 02:50:00 PST 2014

Hi,

This is a topic which has been discussed before, but I think I have some
new findings. We're experiencing problems with CMS pauses.

Settings we are using.

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=68
-XX:MaxTenuringThreshold=0
-XX:+UseParNewGC
-XX:+ScavengeBeforeFullGC
-XX:CMSWaitDuration=30000
-Xmx2048M
-Xms2048M
-Xmn1024M

Note that MaxTenuringThreshold is 0. This is only done during test to
provoke the CMS to run more frequently (otherwise it runs once every
day...). Due to this, promotion to old generation is around 400K to 1M per
second.

We have an allocation rate of roughly 1G per second, meaning that YGC runs
once every second.

We're running JDK7u17.

This is a log entry when running with above settings. This entry is the
typical example to all of the CMS collections in this test.

*2014-01-13T09:31:52.504+0100: 661.675: [GC [1 CMS-initial-mark:
524986K(1048576K)] 526507K(2096192K), 0.0023550 secs] [Times: user=0.00
sys=0.00, real=0.01 secs]*2014-01-13T09:31:52.506+0100: 661.677:
[CMS-concurrent-mark-start]
2014-01-13T09:31:52.644+0100: 661.815: [CMS-concurrent-mark: 0.138/0.138
secs] [Times: user=1.96 sys=0.11, real=0.13 secs]
2014-01-13T09:31:52.644+0100: 661.815: [CMS-concurrent-preclean-start]
2014-01-13T09:31:52.655+0100: 661.826: [CMS-concurrent-preclean:
0.010/0.011 secs] [Times: user=0.14 sys=0.02, real=0.02 secs]
2014-01-13T09:31:52.655+0100: 661.826:
[CMS-concurrent-abortable-preclean-start]
2014-01-13T09:31:53.584+0100: 662.755: [GC 662.755: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 1046656K->0K(1047616K), 0.0039870 secs] 1571642K->525579K(2096192K),
0.0043310 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2014-01-13T09:31:54.146+0100: 663.317: [CMS-concurrent-abortable-preclean:
0.831/1.491 secs] [Times: user=16.76 sys=1.54, real=1.49 secs]

*2014-01-13T09:31:54.148+0100: 663.319: [GC[YG occupancy: 552670 K (1047616
K)]663.319: [Rescan (parallel) , 0.2000060 secs]663.519: [weak refs
processing, 0.0008740 secs]663.520: [scrub string table, 0.0006940 secs] [1
CMS-remark: 525579K(1048576K)] 1078249K(2096192K), 0.2017690 secs] [Times:
user=3.53 sys=0.01, real=0.20 secs]*2014-01-13T09:31:54.350+0100: 663.521:
[CMS-concurrent-sweep-start]
2014-01-13T09:31:54.846+0100: 664.017: [GC 664.017: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 1046656K->0K(1047616K), 0.0033500 secs] 1330075K->284041K(2096192K),
0.0034660 secs] [Times: user=0.04 sys=0.00, real=0.00 secs]
2014-01-13T09:31:55.020+0100: 664.191: [CMS-concurrent-sweep: 0.665/0.670
secs] [Times: user=7.77 sys=0.71, real=0.67 secs]
2014-01-13T09:31:55.020+0100: 664.191: [CMS-concurrent-reset-start]
2014-01-13T09:31:55.023+0100: 664.194: [CMS-concurrent-reset: 0.003/0.003
secs] [Times: user=0.03 sys=0.00, real=0.00 secs]

The initial pause is fine. Then I investigated how to reduce the remark
phase, and activated -XX:+CMSScavengeBeforeRemark. That flag partly solves
this issue (not active in the log above), but I've seen cases when it does
not scavenge (I suspect JNI critical section), which is bad and generates
yet again long remark pause. And yet again the pause is correlated towards
the occupancy in young.

So instead, I tried setting...

-XX:CMSScheduleRemarkEdenPenetration=0
-XX:CMSScheduleRemarkEdenSizeThreshold=0

This is a log entry with the settings at the top plus the two above...

*2014-01-13T10:18:25.757+0100: 590.198: [GC [1 CMS-initial-mark:
524654K(1048576K)] 526646K(2096192K), 0.0029130 secs] [Times: user=0.00
sys=0.00, real=0.01 secs]*2014-01-13T10:18:25.760+0100: 590.201:
[CMS-concurrent-mark-start]
2014-01-13T10:18:25.904+0100: 590.345: [CMS-concurrent-mark: 0.144/0.144
secs] [Times: user=1.98 sys=0.15, real=0.14 secs]
2014-01-13T10:18:25.904+0100: 590.346: [CMS-concurrent-preclean-start]
2014-01-13T10:18:25.912+0100: 590.354: [CMS-concurrent-preclean:
0.008/0.008 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
2014-01-13T10:18:25.912+0100: 590.354:
[CMS-concurrent-abortable-preclean-start]
2014-01-13T10:18:26.836+0100: 591.278: [GC 591.278: [ParNew
Desired survivor size 491520 bytes, new threshold 0 (max 0)
: 1046656K->0K(1047616K), 0.0048160 secs] 1571310K->525477K(2096192K),
0.0049240 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2014-01-13T10:18:26.842+0100: 591.283: [CMS-concurrent-abortable-preclean:
0.608/0.929 secs] [Times: user=10.77 sys=0.97, real=0.93 secs]

*2014-01-13T10:18:26.843+0100: 591.285: [GC[YG occupancy: 20938 K (1047616
K)]591.285: [Rescan (parallel) , 0.0024770 secs]591.287: [weak refs
processing, 0.0007760 secs]591.288: [scrub string table, 0.0006440 secs] [1
CMS-remark: 525477K(1048576K)] 546415K(2096192K), 0.0040480 secs] [Times:
user=0.03 sys=0.00, real=0.00 secs]*2014-01-13T10:18:26.848+0100: 591.289:
[CMS-concurrent-sweep-start]
2014-01-13T10:18:27.573+0100: 592.015: [CMS-concurrent-sweep: 0.726/0.726
secs] [Times: user=8.50 sys=0.76, real=0.73 secs]
2014-01-13T10:18:27.573+0100: 592.015: [CMS-concurrent-reset-start]
2014-01-13T10:18:27.576+0100: 592.017: [CMS-concurrent-reset: 0.003/0.003
secs] [Times: user=0.03 sys=0.01, real=0.00 secs]

This means that when I set these two, CMS STWs go from ~200ms to below 10ms.

I'm leaning towards activating...

-XX:CMSScheduleRemarkEdenPenetration=0
-XX:CMSScheduleRemarkEdenSizeThreshold=0
-XX:CMSMaxAbortablePrecleanTime=30000

What I have seen with these flags is that as soon as a young is completely
collected during abortable preclean, the remark is scheduled and since it
can start when eden is nearly empty, it is ridicously fast. In case it
takes a long time for preclean to catch a young collection, it is also fine
because no promotion is being made. We can live with the pause of young
plus a consecutive remark (for us, young is ~10ms).

So, to the question - is there any obvious drawbacks with the three
settings above? Why does eden have to be 50% (default) in order for a
remark to be scheduled (besides spreading the pause)? It does only seem to
do harm. Any reason?

-XX:+CMSScavengeBeforeRemark I'm thinking to avoid since it can't be
completely trusted. Usually it helps, but that is not good enough since the
pauses get irregular in case it fails. And with these settings above, it
will only add to the CMS pause.

Best Regards,

Gustav Åkesson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140113/18b50992/attachment.html