Java 8 + Docker container - CMS collector leaves around instances that have no GC roots

Tue Dec 4 03:57:58 UTC 2018

Hello Leo and Jeremy,

Thank you both for pointing me to those docs. They did help - the
ElasticSearch doc, I hadn't seen before during my search and the docs at
redhat.com developer blogs, although I had read them before a while
back, it was still useful since I found some new JVM options that I
could experiment with to get more details for the issue at hand.

Apologies for the delayed response though. I wanted to run a bunch of
experiments with various different configs to figure out what's really
going on, instead of speculating what might be going on.

After using all available/relevant JVM options and tuning the heap max
sizes and the cgroup limits and using OS level commands (that I know
off) to track the resource usage, we still ended up seeing the docker
container hitting the limit and then being killed. All the relevant
tracking tools (jconsole, the native hotspot memory tracking, the
direct/mapped buffer usage, heap usage) all kept showing that the usage
was well below the allowed limits. Yet the cgroups memory.usage_in_bytes
kept increasing over days and ultimately kept hitting the limit set at
cgroups level. At this point, it looked like we were either looking at
the wrong info or we weren't really using the right tools to figure out
what's really consuming this memory. After a bit more searching, we
finally found these issues[1][2] that match exactly to what we are
seeing (right down to the exact version of OS and docker and the nature
of configuration). So it looks like it's a known issue with docker + the
kernel version in use and apparently no known workaround (other than
downgrading to a version of docker that doesn't hit this). There appears
to be a commit[3] that has been done upstream but isn't yet released. We
will evaluate how to either try and patch/test that change or figure out
some other way (may be not set a --memory limit for now) to get past this.

Thank you all again for the helpful replies.

[1] https://github.com/opencontainers/runc/issues/1725

[2] https://github.com/moby/moby/issues/37722
[3]
https://github.com/opencontainers/runc/commit/6a2c15596845f6ff5182e2022f38a65e5dfa88eb

-Jaikiran

On 26/11/18 3:56 PM, jwhiting at redhat.com wrote:
> Hi Jaikiran
>  Have a look at some blog posts by old friends :) These blog posts
> might be helpful (along with the other replies you received) to
> diagnose the root cause of the issue. In particular native memory
> tracking.
>
> https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
> https://developers.redhat.com/blog/2017/04/04/openjdk-and-containers/
>
> Regards,
> Jeremy
>
> On Fri, 2018-11-23 at 19:25 +0530, Jaikiran Pai wrote:
>> Hi,
>>
>> I'm looking for some inputs in debugging a high memory usage issue
>> (and
>> subsequently the process being killed) in one of the applications I
>> deal
>> with. Given that from what I have looked into this issue so far, this
>> appears to be something to do with the CMS collector, so I hope this
>> is
>> the right place to this question.
>>
>> A bit of a background - The application that I'm dealing with is
>> ElasticSearch server version 1.7.5. We use Java 8:
>>
>> java version "1.8.0_172"
>> Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)
>>
>> To add to the complexity in debugging this issue, this runs as a
>> docker
>> container on docker version 18.03.0-ce on a CentOS 7 host VM kernel
>> version 3.10.0-693.5.2.el7.x86_64.
>>
>> We have been noticing that this container/process keeps getting
>> killed
>> by the oom-killer every few days. The dmesg logs suggest that the
>> process has hit the "limits" set on the docker cgroups level. After
>> debugging this over past day or so, I've reached a point where I
>> can't
>> make much sense of the data I'm looking at. The JVM process is
>> started
>> using the following params (of relevance):
>>
>> java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC ....
>>
>> As you can see it uses CMS collector with 75% of tenured/old gen for
>> initiating the GC.
>>
>> After a few hours/days of running I notice that even though the CMS
>> collector does run almost every hour or so, there are huge number of
>> objects _with no GC roots_ that never get collected. These objects
>> internally seem to hold on to ByteBuffer(s) which (from what I see)
>> as a
>> result never get released and the non-heap memory keeps building up,
>> till the process gets killed. To give an example, here's the jmap
>> -histo
>> output (only relevant parts):
>>
>>    1:        861642      196271400  [B
>>    2:        198776       28623744 
>> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
>>    3:        676722       21655104 
>> org.apache.lucene.store.ByteArrayDataInput
>>    4:        202398       19430208 
>> org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTerm
>> State
>>    5:        261819       18850968 
>> org.apache.lucene.util.fst.FST$Arc
>>    6:        178661       17018376  [C
>>    7:         31452       16856024  [I
>>    8:        203911        8049352  [J
>>    9:         85700        5484800  java.nio.DirectByteBufferR
>>   10:        168935        5405920 
>> java.util.concurrent.ConcurrentHashMap$Node
>>   11:         89948        5105328  [Ljava.lang.Object;
>>   12:        148514        4752448 
>> org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
>>
>> ....
>>
>> Total       5061244      418712248
>>
>> This above output is without the "live" option. Running jmap
>> -histo:live
>> returns something like (again only relevant parts):
>>
>>   13:         31753        1016096 
>> org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
>>   ...
>>   44:           887         127728 
>> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
>>   ...
>>   50:          3054          97728 
>> org.apache.lucene.store.ByteArrayDataInput
>>   ...
>>   59:           888          85248 
>> org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTerm
>> State
>>
>>   Total       1177783      138938920
>>
>>
>> Notice the vast difference between the live and non-live instances of
>> the same class. This isn't just in one "snapshot". I have been
>> monitoring this for more than a day and this pattern continues. Even
>> taking heap dumps and using tools like visualvm shows that these
>> instances have "no GC root" and I have even checked the gc log files
>> to
>> see that the CMS collector does occasionally run. However these
>> objects
>> never seem to get collected.
>>
>> I realize this data may not be enough to narrow down the issue, but
>> what
>> I am looking for is some kind of help/input/hints/suggestions on what
>> I
>> should be trying to figure out why these instances aren't GCed. Is
>> this
>> something that's expected in certain situations?
>>
>> -Jaikiran
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use