GC_Locker turning CMS into a stop-the-world collector

Sun Jan 17 08:45:22 UTC 2010

Hi Chi Ho --

What's the version of the JDK you are using? Could you share a
GC log using the JVM options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,
a list of the full set of JVM options you used, and the periods when you see the
long stalls. The intent of the design is that full gc's be avoided
as much as possible and certainly not stop-world full gc's when using
CMS except under very special circumstances. Perhaps looking at the
shape of yr heap and its occupancy when you see the stop-world
full gc happen that causes the long stall will allow us to see if
you are occasionally running into that situation, and suggest
a suitable workaround, or a fix within the JVM.

That having been said, we know of a different problem that can, with
the current design cause what appear to be stalls (but are really
slow-path allocations which can be quite slow) when Eden has been
exhausted and JNI critical sections are held. There may be ways of
avoiding that situation by a suitable small tweak of the current
design for managing the interaction of JNI critical sections
and GC.

Anyway, a test case and the info I requested above should help us
diagnose the actual problem you are running into, speedily, and
thence find a workaround or fix.

-- ramki

Chi Ho Kwok wrote:
> Hmm, where to begin...
> 
> Today, I started tracing a weird, periodic performance problem with a
> jetty based server. The app uses quite a bit of memory, so it's
> running with a CMS collector and a 16G heap. We've noticed some weird,
> random 20-30 seconds latency, which occurs about once an hour, so I've
> added some diagnostics and most importantly, a way to call Thread.
> getAllStackTraces() remotely - and captured a live trace of the 'bug'.
> 
> So I've got almost every thread (~60) on the system blocked on
>       at java.nio.Bits.copyToByteArray(Bits.java:?)
>       at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
>       at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:209)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>       at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131)
>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290)
>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
>       at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>       at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> 
> And a few other threads are blocked via different paths on copyTo/FromByteArray.
> 
> What's happening?!
> 
> Looking through the code, tracing from copyToByteArray to
> jni_GetPrimitiveArrayCritical, GC_locker::lock_critical(thread) in
> jni.cpp:2546 looks like the cause of all evil. But the bug doesn't
> trigger on every GC, nope, so this alone won't cause everything to get
> stuck. We have a minor collection every few seconds and a major one
> every minute, but the problem is more rare, it occurs about once per
> hour, but can be absent for up to half a day when you're lucky. But it
> has the annoying tendency to occur when it's unusually busy.
> 
> To understand what the heck is happening, I've opened
> gcLocker.cpp/hpp/.inline.hpp, and tried to find the exact condition
> how this can happen. lock_critical() has two paths - a fast path when
> there's enough memory and needs_gc() returns false, or a slow path
> which *blocks the thread* on JNICritical_lock->wait() if needs_gc() or
> _doing_gc is true. It's the same for unlock_critical, if needs_gc() is
> set, it calls the _slow() version.
> 
> All the puzzle pieces are here now, all I needed to do is bring them
> all together in a scenario:
> 
> 1. A thread enters a critical section, calls lock_critical(), uses the fast path
> 2. needs_gc() goes from false to true, because old generation
> utilization is above the collection threshold. CMS background
> collection won't start, yet, because GC_Locker's _jni_lock_count is >
> 0.*
> 3. Thread exits critical section. needs_gc() is true, so it uses
> jni_unlock_slow()
> 4. In jni_unlock_slow(), this is the last thread out. That will set
> _doing_gc to true, and block all future attempts on lock_critical()
> until that flag is cleared **
> 5. Do a full collection by calling
> Universe::heap()->collect(GCCause::_gc_locker). This take "forever"
> ***
> 6. Set _doing_gc to false again via clear_needs_gc()
> 7. Rejoice! JNICritical_lock->notify_all() is called! And watch the
> load go to 30+ because the server has been doing nothing for ages.
> 
> *: speculation. Didn't actually find the line calling set_needs_gc in
> vm/gc_implementation/*, but I assume it's set from there, somewhere.
> **: I've seen a few threads that were blocked on a copyToByteArray in
> a MappedByteBuffer that has been .load()-ed, so it must be a lock. A
> simple memcpy cannot take this long.
> ***: speculation. Didn't actually measure the time, but because every
> single thread in the system is trapped in JNICritical_lock->wait(), I
> assume this part takes forever.
> 
> 
> So the conclusion is: using a channel to handle HTTP requests combined
> with a large heap is suicidal at the moment. Every channel.read() call
> will block when GC_Locker initiates a CMS collection, blocking every
> single request until the foreground collection is done. That's because
> SocketChannelImpl.read() calls copyToByteArray internally, and it will
> block during a GC_Locker initiated CMS-collection.
> 
> 
> I'm just a lowly java programmer, so I've no idea how to fix this. My
> workaround now is just to stop using a SelectChannelConnector and
> switch back to a ServerSocketConnector in Jetty, praying that the
> other nio calls I have in the code (MappedByteBuffer.read() is
> semi-frequently used) won't turn CMS into a very slow, almost serial
> stop the world collector.
> 
> 
> I kinda doubt that this is intended, so, should I file this as a bug?
> I've (mis)filed it already as #1695140, but the analysis is wrong
> there, blaming it on "some kind of spinlock?", and as I thought it was
> unrelated to gc, no jvm arguments were attached. Well, it's a lock,
> but it's quite a bit more complex than I assumed.
> 
> 
> Any help would be appreciated. I've got no experience at all in
> hacking jvm's, but I can try to create a test case to trigger this
> situation manually and try different builds of the jvm.
> 
> 
> Attached is the full stack trace of all threads during the 'lag
> spike'. Note that the stack dump is triggered over http, using java.io
> / streams; that one kept working while all requests on a channel were
> blocked. "InstrumentedConnector" is just simple a subclass of the
> default jetty SelectChannelConnector, with extra stats to help
> figuring out what's happening.
> 
> 
> Chi Ho Kwok
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use