the amazing tales of the search for the invisible man! or, where's my gc root

Tue Apr 21 01:49:22 UTC 2009

kirk,

to be clear my tests have shown the following:

MacOSX (64 bit)
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)
256MB default GC:  starts at 8 secs per test then goes to 30 secs, then 
60+ then throws OOME after approx 50 runs
256MB CMS GC () : does not throw OOME but slow down 10x and spend +80% 
in GC after only a few runs
256MB Parallel GC: throws OOME after approx 40 runs

Linux: (quite old java version - 32 bit)
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)

128MB default GC does not throw OOME and does not slow down but is slow 
from the beginning (30 secs per test)

IBM JDK (32 bit):
java version "1.6.0"
Java(TM) SE Runtime Environment (build pxi3260sr4-20090219_01(SR4))
IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux x86-32 
jvmxi3260-20090215_29883 (JIT enabled, AOT enabled)
J9VM - 20090215_029883_lHdSMr
JIT  - r9_20090213_2028
GC   - 20090213_AA)
JCL  - 20090218_01

128MB default does not throw OOME and does not slow down. No

Observations:
* There are definite tipping points at which the GC suddenly starts 
working a lot harder in the Sun JVM tests and the performance becomes 
significantly worse.
* All heap dumps from the Sun JVMs contain large sections of heap with 
no discoverable heap root. All are produced by HeapDumpOnOOME or after 
pressing the big red Full GC button plenty of times.

cheers,
jed.

kirk wrote:
> Ok, new information. You are using CMS. By default, CMS doesn't 
> collect perm-space. The class-loader related bits that I saw would be 
> in perm-space. IBM heap is structured very differently and it doesn't 
> contain a perm-space. Your conformation of no leak in IBM JVM points 
> directly to a perm-space leak. Read Frank Kieviet's blogs about perm 
> space. 
> http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java
>
> I am on the tipping point about CMS not collecting perm-space by 
> default and perm-space leaks in general. But I can charge nice sums of 
> money fixing them and look like Scotty in the process so I shalln't 
> bite the hand that feeds.. ;-) In your case I'll just be looking for 
> more beer!
>
> I was chatting to Tony about the number of int[] arrays in the 
> profile. He mentioned that holes in heap maybe represented as int[] 
> with no GC root. This can be a bit confusing at first glance.
>
> JiraPluginManager and DefaultPicoContainer.. hummm. I need to take 
> another look.
> Ok for the Default PicoContainer I checked the first 10 of 89 
> instances and was able to locate a GC root for each of them. The first 
> instance has one GC root as a Java frame. Tracing back you get right 
> into apache commons classloaders once again. There are a number of 
> weak links buy more importantly, there are a number of array slots 
> that also keep a grip on these objects. Strangely enough, the first 
> one that I randomly selected had no GC root. IMHO, this object should 
> be collectable and since it hasn't been there are one of two 
> possibilities. It is either in perm space or it is a dead object that 
> wasn't collected prior to the heap dump being taken. A heap dump 
> contains *all* objects, both dead and alive. Best to run GC twice 
> prior to dumping heap.
>
> 8 instances of JiraPluginManager. 6 have no GC roots and should be 
> collectable. The first instance is listed as a JavaFrame and the last 
> is tied up with apache commons classloading.
>
> See if you can set the CMS perm space sweeping flag, this the 
> System.gc() twice and then re-take a heap dump and send it. I'll have 
> another look
>
> Regards,
> Kirk
>
>
> Jed Wesley-Smith wrote:
>> Hi Kirk,
>>
>> Having had a chance to look more into your findings I think we are 
>> getting somewhere, but I still have some questions.
>>
>> The java.lang.reflect.Method count definitely looks like a problem, 
>> but I it is a by-product of one of the things I was using to 
>> eliminate (noisy) references - specifically the weak references that 
>> were coming java.lang.reflect.Proxy.loaderToCache and 
>> java.lang.reflect.Proxy. Unfortunately removing the caching meant 
>> that these proxies get recreated and hence the Method objects get 
>> recreated. I removed this code and rerun the tests and can confirm 
>> that the number of Method objects is now much smaller (10k).
>>
>> Of more specific interest to us is the number of objects that have no 
>> GC root. I was hoping that the NetBeans analyzer might be able to 
>> identify a root that all the other analyzers had failed to find. For 
>> instance if I look at the instances of the "JiraPluginManager" I see 
>> over a dozen instances (there should normally only be one and never 
>> be more than two - while the plugin system is reloading). If I ask 
>> the NetBeans profiler (like every other profiler) for the nearest GC 
>> root it says that none is found. There should only be about 50MB of 
>> reachable heap but there is ~80MB dark heap that will not be GC'd by 
>> the Sun JVM. Another interesting case in point is the hundreds of 
>> DefaultPicoContainer instances with no GC root. These guys have 
>> fairly tightly circular reference chains with the various 
>> ComponentAdapter instances, but will not die. Also check out the 
>> ASTReference instances. These seem to have some of the tightest
>>
>> We have definitely established that there is no memory leak under the 
>> IBM JDK. The aforementioned Proxy cache clearing was causing problems 
>> with heap growth but once we removed that it went away.
>>
>> Our stab in the dark theory about this is that there is somehow just 
>> too much densely circular referenced heap chains living in the old 
>> gen that are somehow causing the GC algorithm to choke. The fact that 
>> the CMS collector helps to some extent does seem to support this 
>> theory. On the converse to this theory is that reducing the 
>> application's workload to zero and repeatedly asking it to full GC it 
>> will not release any further heap.
>>
>> We have looked at all finalizers and the most that any of them do is 
>> to call JarFile.close(). Following is the list of objects that may be 
>> finalized:
>>
>> 1    com.sun.crypto.provider.PBEKey
>> 71    java.io.FileInputStream
>> 10    java.io.FileOutputStream
>> 7    java.lang.ClassLoader$NativeLibrary
>> 14    java.net.SocksSocketImpl
>> 22    java.util.Timer$1
>> 1    java.util.concurrent.ScheduledThreadPoolExecutor
>> 2    java.util.concurrent.ThreadPoolExecutor
>> 227    java.util.jar.JarFile
>> 288    java.util.zip.Inflater
>> 143    org.apache.felix.framework.cache.JarContent
>> 2    org.apache.log4j.ConsoleAppender
>> 6    org.apache.log4j.RollingFileAppender
>> 1    org.hsqldb.Database
>> 8    org.hsqldb.jdbc.jdbcConnection
>> 1    org.hsqldb.persist.NIOLockFile
>> 1    org.hsqldb.scriptio.ScriptWriterText
>> 86    sun.net.www.protocol.jar.URLJarFile
>>
>> None of them contain references back to any of the application.
>>
>> If anyone can help us understand why the JiraPluginManager and other 
>> instances listed above remain under the Sun JVM it would be most 
>> appreciated.
>>
>> cheers,
>> jed.
>>
>> kirk wrote:
>>> Hi Jed,
>>>
>>> I've had a quick look at the heap dump. I'm having a little trouble 
>>> understanding what is in there. What I can see is a large number of 
>>> java.lang.reflect.Method objects being held. There seems to be two 
>>> competing patterns of references holding onto these objects. I've 
>>> attached some screenshots rather than use words.
>>>
>>> The scary thing is that the references include ClassLoader.scl, 
>>> JDK12Hooks.systemClassLoader as well as Apache commons logging 
>>> LogFactory. With this type of the complex entanglement it would seem 
>>> unlikely that these objects would ever be collected. The other 
>>> pattern also includes the spiders web of references. It also 
>>> includes UberspecImpl and a whole bunch of static collections. IME, 
>>> static collections are involved in the vast majority of leaks I've 
>>> diagnosed.
>>>
>>> Interestingly enough the a portion of the 2nd largest consumer of 
>>> memory is also tangled up in the JDK12Hooks. Random sampling leads 
>>> me to AST parse trees and "no reference". Looks like much of this is 
>>> tied up with Velocity. In fact the largest consumer of memory at 24% 
>>> is char[]. I'm failing to find anything that is not tied up with 
>>> Velocity (AST parsing).
>>>
>>> Needs more investigation. Be interesting to run a test with 
>>> generations turned on. NetBeans generations is a true count unlike 
>>> that provided by YourKit.
>>>
>>> Regards,
>>> Kirk
>>>
>>>
>>> Jed Wesley-Smith wrote:
>>>> Classes as well. We end up getting an OOME although the profilers 
>>>> report only a third of the heap is reachable.
>>>>
>>>> Although I indicated we saw this on the IBM jdk analysis of that 
>>>> dump showed a completely different issue that apparently may not be 
>>>> a problem (due to reflection optimisation on that jdk) - the dead 
>>>> objects appear to have been correctly cleared. We are reproducing 
>>>> this to verify.
>>>>
>>>> Additionally we tried running with -client on the sun jvms as we 
>>>> saw a bug that might have caused it reported against server only 
>>>> but without success.
>>>>
>>>> cheers,
>>>> jed.
>>>>
>>>> On 16/04/2009, at 12:51 AM, Tony Printezis 
>>>> <Antonios.Printezis at sun.com> wrote:
>>>>
>>>>> OK, I'll bite.
>>>>>
>>>>> When you say: "a large section of memory (a plugin framework)" do 
>>>>> you mean only objects in the young / old gen, or also classes in 
>>>>> the perm gen?
>>>>>
>>>>> How do you know that said memory is not being reclaimed? Do you 
>>>>> eventually get an OOM?
>>>>>
>>>>> Given that it happens with two different JVMs (I assume you use 
>>>>> HotSpot on Linux and Mac, as well as the IBM JDK), it's unlikely 
>>>>> to be a GC bug, as both JVMs would need to have the same bug. Not 
>>>>> impossible, but unlikely, IMHO.
>>>>>
>>>>> Tony
>>>>>
>>>>> Jed Wesley-Smith wrote:
>>>>>> all,
>>>>>>
>>>>>> I am writing to this list in some desperation hoping for some 
>>>>>> expert advice. We (the JIRA development team at Atlassian) have 
>>>>>> been hunting memory leaks for some weeks and in the process have 
>>>>>> tracked down and removed every possible reference to a large 
>>>>>> section of memory (a plugin framework) that we could find. 
>>>>>> Starting with all strong references and proceeding to remove soft 
>>>>>> and weak references - even things like clearing the 
>>>>>> java.lang.reflect.Proxy cache - and even Finalizer references 
>>>>>> until both YourKit, Eclipse MAT, JProfiler and jhat all report 
>>>>>> that the memory in question is dead and should be collectable, 
>>>>>> but inexplicably _the JVM still holds on to it_. There are no JNI 
>>>>>> Global references either, yet this memory remains uncollectable!
>>>>>>
>>>>>> This happens for the 1.5 and 1.6 JVMs on Linux and Mac, and the 
>>>>>> IBM 1.6 JDK on Linux.
>>>>>>
>>>>>> So my question is, how on earth do I search for what is 
>>>>>> referencing this uncollectable memory? Are there any other tools 
>>>>>> that can help find why this memory is not collected? Can I query 
>>>>>> the VM directly somehow?
>>>>>>
>>>>>> I fear this is a JVM GC bug as no known memory analysis tool can 
>>>>>> find the heap root (i.e. according to "the rules" there is no 
>>>>>> heap root). Are there any known GC memory leaks caused by 
>>>>>> ClassLoaders being dropped for instance?
>>>>>>
>>>>>> The application is creating and disposing of a lot of 
>>>>>> ClassLoaders via OSGi (Apache Felix) with Spring OSGi. It creates 
>>>>>> a lot of java.lang.reflect.Proxy class instances.
>>>>>>
>>>>>> We have written this up and added an example heap dump here:
>>>>>> http://jira.atlassian.com/browse/JRA-16932
>>>>>>
>>>>>> Having come to the end of our tethers here, if anyone can help in 
>>>>>> any way it would be massively appreciated.
>>>>>>
>>>>>> cheers,
>>>>>> Jed Wesley-Smith
>>>>>> JIRA Team @ Atlassian
>>
>>
>