Java 8 + Docker container - CMS collector leaves around instances that have no GC roots

Jaikiran Pai jai.forums2013 at gmail.com
Fri Nov 23 13:55:30 UTC 2018


Hi,

I'm looking for some inputs in debugging a high memory usage issue (and
subsequently the process being killed) in one of the applications I deal
with. Given that from what I have looked into this issue so far, this
appears to be something to do with the CMS collector, so I hope this is
the right place to this question.

A bit of a background - The application that I'm dealing with is
ElasticSearch server version 1.7.5. We use Java 8:

java version "1.8.0_172"
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)

To add to the complexity in debugging this issue, this runs as a docker
container on docker version 18.03.0-ce on a CentOS 7 host VM kernel
version 3.10.0-693.5.2.el7.x86_64.

We have been noticing that this container/process keeps getting killed
by the oom-killer every few days. The dmesg logs suggest that the
process has hit the "limits" set on the docker cgroups level. After
debugging this over past day or so, I've reached a point where I can't
make much sense of the data I'm looking at. The JVM process is started
using the following params (of relevance):

java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC ....

As you can see it uses CMS collector with 75% of tenured/old gen for
initiating the GC.

After a few hours/days of running I notice that even though the CMS
collector does run almost every hour or so, there are huge number of
objects _with no GC roots_ that never get collected. These objects
internally seem to hold on to ByteBuffer(s) which (from what I see) as a
result never get released and the non-heap memory keeps building up,
till the process gets killed. To give an example, here's the jmap -histo
output (only relevant parts):

   1:        861642      196271400  [B
   2:        198776       28623744 
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
   3:        676722       21655104 
org.apache.lucene.store.ByteArrayDataInput
   4:        202398       19430208 
org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
   5:        261819       18850968  org.apache.lucene.util.fst.FST$Arc
   6:        178661       17018376  [C
   7:         31452       16856024  [I
   8:        203911        8049352  [J
   9:         85700        5484800  java.nio.DirectByteBufferR
  10:        168935        5405920 
java.util.concurrent.ConcurrentHashMap$Node
  11:         89948        5105328  [Ljava.lang.Object;
  12:        148514        4752448 
org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference

....

Total       5061244      418712248

This above output is without the "live" option. Running jmap -histo:live
returns something like (again only relevant parts):

  13:         31753        1016096 
org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
  ...
  44:           887         127728 
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
  ...
  50:          3054          97728 
org.apache.lucene.store.ByteArrayDataInput
  ...
  59:           888          85248 
org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState

  Total       1177783      138938920


Notice the vast difference between the live and non-live instances of
the same class. This isn't just in one "snapshot". I have been
monitoring this for more than a day and this pattern continues. Even
taking heap dumps and using tools like visualvm shows that these
instances have "no GC root" and I have even checked the gc log files to
see that the CMS collector does occasionally run. However these objects
never seem to get collected.

I realize this data may not be enough to narrow down the issue, but what
I am looking for is some kind of help/input/hints/suggestions on what I
should be trying to figure out why these instances aren't GCed. Is this
something that's expected in certain situations?

-Jaikiran









More information about the hotspot-gc-use mailing list