GC_Locker turning CMS into a stop-the-world collector

Sun Jan 17 07:29:57 PST 2010

Hi Ramki,

On Sun, Jan 17, 2010 at 9:45 AM, Y. Srinivas Ramakrishna
<Y.S.Ramakrishna at sun.com> wrote:
> Hi Chi Ho --
>
> Anyway, a test case and the info I requested above should help us
> diagnose the actual problem you are running into, speedily, and
> thence find a workaround or fix.

I've managed to create a test case. Sorry for the size tho, it's a
tricky problem to reproduce, so the test case it kinda large at 280
lines. The basic idea is: two threads are running concurrently,
putting and getting data out of a direct byte buffer. We know that it
will invoke a lock_critical. There's also a memory consumer thread to
keep CMS busy, but it's throttled, sleeps between simulated file loads
for 100ms. The last thread will monitor the rest, and warn when
something unusual happens.

On my system, this will normally produce lines like "Mem: 0, executed
516 direct buffer operations, 2 load operations."; or good for copying
about 2GB/second.

But shortly after starting this stress test, the stats thread will
start writing to System.err; when the amount of direct buffer ops is
zero in a period, it will mark the current time to remember when the
'stop-the-world-for-nio' phase started, or print how long it has
blocked. This generates logs like:

Situation normal:

Mem: 109, executed 643 direct buffer operations, 2 load operations.
11.780: [GC 11.780: [ParNew: 196608K->65536K(196608K), 0.1822832 secs]
399349K->401204K(2031616K), 0.1823253 secs] [Times: user=0.28
sys=0.05, real=0.18 secs]
Mem: 183, executed 335 direct buffer operations, 1 load operations.
11.963: [GC [1 CMS-initial-mark: 335668K(1835008K)] 403551K(2031616K),
0.0126970 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
11.976: [CMS-concurrent-mark-start]
12.127: [CMS-concurrent-mark: 0.151/0.151 secs] [Times: user=0.16
sys=0.00, real=0.15 secs]
12.127: [CMS-concurrent-preclean-start]
12.132: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00
sys=0.00, real=0.00 secs]
12.132: [CMS-concurrent-abortable-preclean-start]

Blockage started:

Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked,
started block timer ~~*
0 direct buffer operations, 3 load operations.
Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 250ms ~~
0 direct buffer operations, 2 load operations.
Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 500ms ~~
0 direct buffer operations, 3 load operations.
Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 750ms ~~
0 direct buffer operations, 2 load operations.
~~ DirectByteBuffer.get or put calls are blocked since 1263740408693
for 1000ms ~~
Mem: 183, executed 0 direct buffer operations, 3 load operations.
Mem: 183, executed 0 direct buffer operations, 2 load operations.
~~ DirectByteBuffer.get or put calls are blocked since 1263740408693
for 1250ms ~~
Mem: 183, executed 0~~ DirectByteBuffer.get or put calls are blocked
since 1263740408693 for 1500ms ~~
 direct buffer operations, 3 load operations.
13.931: [CMS-concurrent-abortable-preclean: 0.304/1.799 secs] [Times:
user=0.31 sys=0.00, real=1.80 secs]
13.932: [GC[YG occupancy: 137843 K (196608 K)]13.932: [Rescan
(parallel) , 0.0134340 secs]13.945: [weak refs processing, 0.0000059
secs] [1 CMS-remark: 335668K(1835008K)] 473512K(2031616K), 0.0135139
secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
13.946: [CMS-concurrent-sweep-start]
Mem: 183, executed 0 direct buffer operations, 2 load operations.
~~ DirectByteBuffer.get or put calls are blocked since 1263740408693
for 1750ms ~~
13.988: [CMS-concurrent-sweep: 0.043/0.043 secs] [Times: user=0.03
sys=0.00, real=0.04 secs]
13.988: [CMS-concurrent-reset-start]
13.997: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.03
sys=0.00, real=0.01 secs]

*: eclipse doesn't interleave stdout and stderr perfectly; the line
with blocked messages (~~'s) are written to stderr.

And back to normal again.

Mem: 183, executed 524 direct buffer operations, 2 load operations.

Note that the threads without nio calls are continuing as normal, but
the buffer operations are stuck during the whole CMS collection. That
is bad, because that means SocketChannelImpl.read/write can be stuck
in a web server for a long, long time. CPU utilization is spread
between several threads, but it doesn't approach 50% on this dual core
system, so there's no CPU time starvation.

Test case is attached, VM parameters: -ea -server -Xms2G -Xmx2G
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:MaxNewSize=256m
-XX:NewSize=256m -XX:+PrintGCDetails -XX:+PrintGC
-XX:CMSInitiatingOccupancyFraction=80
-XX:SurvivorRatio=2 -XX:+ExplicitGCInvokesConcurrent
-XX:+PrintGCTimeStamps; or basically what we run in prod with a
smaller heap.

If you use different heap sizes, you may have to tune GCStatus's
parameters; it's a quick hack, so the cache size isn't really stable;
in production, it's meant to LRU out 20%, wait for old gen usage to
drop before it can throw out another 20%. What I didn't get is why CMS
is active at low heap usage already - it started collecting pretty
much continuously right at the start of the test. Side effect from gc
locker initiated gc?

(also CC-ed directly because I'm not sure the attached file will get through)

Chi Ho Kwok
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GCLockerTest.java
Type: application/octet-stream
Size: 8244 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100117/498c2019/attachment-0001.obj