GC_Locker turning CMS into a stop-the-world collector

Sun Jan 17 05:22:18 PST 2010

Hi Ramki,

On Sun, Jan 17, 2010 at 9:45 AM, Y. Srinivas Ramakrishna
<Y.S.Ramakrishna at sun.com> wrote:
> Hi Chi Ho --
>
> What's the version of the JDK you are using? Could you share a
> GC log using the JVM options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,
> a list of the full set of JVM options you used, and the periods when you see
> the
> long stalls. The intent of the design is that full gc's be avoided
> as much as possible and certainly not stop-world full gc's when using
> CMS except under very special circumstances. Perhaps looking at the
> shape of yr heap and its occupancy when you see the stop-world
> full gc happen that causes the long stall will allow us to see if
> you are occasionally running into that situation, and suggest
> a suitable workaround, or a fix within the JVM.

JDK 6u17 at the moment. Full options are: -ea -server -Xms16G -Xmx16G
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:MaxNewSize=768m
-XX:NewSize=768m -Xloggc:/var/log/wol-server/gc.log
-XX:+PrintGCDetails -XX:+PrintGC -XX:CMSInitiatingOccupancyFraction=78
-XX:SurvivorRatio=2 -XX:+ExplicitGCInvokesConcurrent

We've been running with -XX:+PrintGCDetails -XX:+PrintGC
-XX:+ExplicitGCInvokesConcurrent and logging it to a file, but sadly,
I didn't think of saving it - every time the service is restarted, the
log is overwritten. But I've checked - there are no concurrent mode
failures (adjusted CMS threshold down every time that happened in the
past), just "weird" stalls without a cause. Just found and added
PrintGCTimeStamps now, with only times relative from the start it's
hard to find a specific time, especially when the time goes
6-digits...

There's even a "watchdog" thread that guards against this, basically,
it's a thread that wakes up every 200ms and reports when it's been
stalled, so any real stop the world collections will show up in the
error log.

Since the switch from nio to normal sockets, the nagios log has been
clean. It used to be full of 1 line high response time service alerts
and followed by an OK in 0.0001s message.

> That having been said, we know of a different problem that can, with
> the current design cause what appear to be stalls (but are really
> slow-path allocations which can be quite slow) when Eden has been
> exhausted and JNI critical sections are held. There may be ways of
> avoiding that situation by a suitable small tweak of the current
> design for managing the interaction of JNI critical sections
> and GC.
>
> Anyway, a test case and the info I requested above should help us
> diagnose the actual problem you are running into, speedily, and
> thence find a workaround or fix.

I'll try my best to make a small test case. Seems like copying into /
out of direct buffers will do a lock_critical, so I'll just have
multiple threads just doing that constantly, while holding like 2G of
data in the old gen, one thread allocating random stuff and holding
onto it to simulate garbage production, and report when all buffer
copying threads are stalled for longer than $number ms per
buffer.get(). I'll have some time later today to try this.

The app is kinda weird in how it uses memory - it loads many large
datasets (up to 50M ram each) into the heap, keeps it there and LRU
20% out when the old gen occupancy is getting near the the CMS
threshold, we use the MXBeans to measure it. While loading the data
set, it generates another 50M of transient garbage which shouldn't
leave eden. It's the number-crunching service powering
worldoflogs.com.

Chi Ho Kwok