From java at elyograg.org  Fri Nov  2 21:23:07 2018
From: java at elyograg.org (Shawn Heisey)
Date: Fri, 2 Nov 2018 15:23:07 -0600
Subject: Java 8 adding -XX:MaxNewSize to parameters
Message-ID: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>

I have a Solr install with the following info on the process list:

solr????? 4426???? 1 15 Sep27 ???????? 5-17:59:51 java -server -Xms7943m 
-Xmx7943m -XX:+UseG1GC -XX:+PerfDisableSharedMem 
-XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m 
-XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 
-XX:+UseLargePages -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime 
-Xloggc:/data/solr-home/logs/solr_gc.log -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.local.only=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.port=3000 
-Dcom.sun.management.jmxremote.rmi.port=3000 
-Djava.rmi.server.hostname=RWVM-SolrDev03 
-Dsolr.log.dir=/data/solr-home/logs -Djetty.port=8983 -DSTOP.PORT=7983 
-DSTOP.KEY=solrrocks -Dhost=RWVM-SolrDev03 -Duser.timezone=UTC 
-Djetty.home=/data/solr/server -Dsolr.solr.home=/data/solr-home/data 
-Dsolr.data.home= -Dsolr.install.dir=/data/solr 
-Dsolr.default.confdir=/data/solr/server/solr/configsets/_default/conf 
-Dlog4j.configuration=file:/data/solr-home/log4j.properties -Xss256k 
-Ddisable.configEdit=true -Xss256k -Dsolr.jetty.https.port=8983 
-Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 
8983 /data/solr-home/logs -jar start.jar --module=http

Process started Sep27, with G1 tuning, but no parameters setting 
specific generation sizes.

Here's the info at the top of a GC log written by that process:

2018-10-29 05:47:53 GC log file created /data/solr-home/logs/solr_gc.log.6
Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE 
(1.8.0_162-b12), built on Dec 19 2017 21:15:48 by "java_re" with gcc 
4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 16268024k(163252k free), swap 
2098172k(2056676k free)
CommandLine flags: -XX:+AggressiveOpts -XX:CICompilerCount=3 
-XX:ConcGCThreads=1 -XX:G1HeapRegionSize=8388608 
-XX:GCLogFileSize=20971520 -XX:InitialHeapSize=8329887744 
-XX:InitiatingHeapOccupancyPercent=75 -XX:+ManagementServer 
-XX:MarkStackSize=4194304 -XX:MaxGCPauseMillis=250 
-XX:MaxHeapSize=8329887744 -XX:MaxNewSize=4991221760 
-XX:MinHeapDeltaBytes=8388608 -XX:NumberOfGCLogFiles=9 
-XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 
/data/solr-home/logs -XX:+ParallelRefProcEnabled 
-XX:+PerfDisableSharedMem -XX:+PrintGC 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops 
-XX:+UseFastUnorderedTimeStamps -XX:+UseG1GC -XX:+UseGCLogFileRotation 
-XX:+UseLargePages

As you can see, the gc log says that -XX:MaxNewSize is among the 
parameters.? The problem came to my attention when I put the gc log into 
the gceasy.io website and it said "don't set your generation sizes 
explicitly, G1 works better when they aren't explicit."? Which I thought 
was weird, because I didn't know of anywhere that the sizes were being 
set.? I don't WANT to set the generation sizes explicitly.

I have confirmed by cross-referencing pids, checking which pid has the 
logfile open, and watching for new data written to the same log that it 
is indeed the same process, so I know the commandline arguments from the 
ps listing are from the right process.

Here's the version info for the java version that's running:

[solr at RWVM-SolrDev03 logs]$ java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

I cannot reproduce the behavior manually on a Windows 10 machine,? 
either with the same version that's on the Linux machine or using 
8u_191.? The machine where I see this happening is running CentOS 7.? 
Hopefully I will be in a position to restart the service on the machine 
in the near future so I can check whether the issue is persistent.

I will attempt to reproduce on another Linux machine.? Does this seem 
like a bug?

Thanks,
Shawn


From java at elyograg.org  Fri Nov  2 21:48:20 2018
From: java at elyograg.org (Shawn Heisey)
Date: Fri, 2 Nov 2018 15:48:20 -0600
Subject: Java 8 adding -XX:MaxNewSize to parameters
In-Reply-To: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
Message-ID: <7acd72cb-9022-f24d-2b47-e75b7220aad1@elyograg.org>

On 11/2/2018 3:23 PM, Shawn Heisey wrote:
> I will attempt to reproduce on another Linux machine.? Does this seem 
> like a bug?

The problem is not reproducing on another Linux machine (Ubuntu 18) with 
an earlier version of Java:

Process listing:

solr???? 10709???? 1 20 15:44 ???????? 00:00:15 
/usr/lib/jvm/java-8-oracle/bin/java -server -Xms3072m -Xmx3072m 
-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled 
-XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=250 
-XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages 
-XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime 
-Xloggc:/storage0/solr740/logs/solr_gc.log -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M 
-Dsolr.log.dir=/storage0/solr740/logs -Djetty.port=8900 -DSTOP.PORT=7900 
-DSTOP.KEY=solrrocks -Duser.timezone=UTC 
-Djetty.home=/storage0/solr7/server 
-Dsolr.solr.home=/storage0/solr740/data -Dsolr.data.home= 
-Dsolr.install.dir=/storage0/solr7 
-Dsolr.default.confdir=/storage0/solr7/server/solr/configsets/_default/conf 
-Dlog4j.configurationFile=file:/storage0/solr740/log4j2.xml -Xss256k 
-Dsolr.jetty.https.port=8900 -Dsolr.log.muteconsole 
-XX:OnOutOfMemoryError=/storage0/solr7/bin/oom_solr.sh 8900 
/storage0/solr740/logs -jar start.jar --module=http

In the GC log:

Java HotSpot(TM) 64-Bit Server VM (25.171-b11) for linux-amd64 JRE 
(1.8.0_171-b11), built on Mar 28 2018 17:07:08 by "java_re" with gcc 
4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 24628836k(9793172k free), swap 
8000508k(8000508k free)
CommandLine flags: -XX:+AggressiveOpts -XX:G1HeapRegionSize=8388608 
-XX:GCLogFileSize=20971520 -XX:InitialHeapSize=3221225472 
-XX:InitiatingHeapOccupancyPercent=75 -XX:MaxGCPauseMillis=250 
-XX:MaxHeapSize=3221225472 -XX:NumberOfGCLogFiles=9 
-XX:OnOutOfMemoryError=/storage0/solr7/bin/oom_solr.sh 8900 
/storage0/solr740/logs -XX:+ParallelRefProcEnabled 
-XX:+PerfDisableSharedMem -XX:+PrintGC 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC 
-XX:+UseGCLogFileRotation -XX:+UseLargePages

I don't have a lot of other combinations I can easily try.

Thanks,
Shawn


From java at elyograg.org  Fri Nov  2 22:35:50 2018
From: java at elyograg.org (Shawn Heisey)
Date: Fri, 2 Nov 2018 16:35:50 -0600
Subject: Java 8 adding -XX:MaxNewSize to parameters
In-Reply-To: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
Message-ID: <549777c1-ee66-e362-3c04-09549d628471@elyograg.org>

On 11/2/2018 3:23 PM, Shawn Heisey wrote:
> Process started Sep27, with G1 tuning, but no parameters setting 
> specific generation sizes.

Another followup.

Here's the GC logfiles currently found in the logs directory:

-rw-rw-r--. 1 solr solr 20973298 Oct? 3 04:08 solr_gc.log.0
-rw-rw-r--. 1 solr solr 20974540 Oct? 8 06:33 solr_gc.log.1
-rw-rw-r--. 1 solr solr 20971712 Oct 14 16:22 solr_gc.log.2
-rw-rw-r--. 1 solr solr 20971895 Oct 20 05:42 solr_gc.log.3
-rw-rw-r--. 1 solr solr 20971661 Oct 24 15:18 solr_gc.log.4
-rw-rw-r--. 1 solr solr 20974231 Oct 29 05:47 solr_gc.log.5
-rw-rw-r--. 1 solr solr 17246462 Nov? 2 18:15 solr_gc.log.6.current

Only files numbered 1 through 6 contain the -XX:MaxNewSize parameter.? 
The info in the previous message was obtained from the last logfile, the 
one with current in the name.

The first logfile, solr_gc.log.0, does NOT contain that parameter.? 
Here's the top of the logfile and the first log entry, showing 
consistency between the first log and the the info output by the ps 
command -- but the -XX:MaxNewSize parameter is not present:

Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE 
(1.8.0_162-b12), built on Dec 19 2017 21:15:48 by "java_re" with gcc 
4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 16268024k(8771152k free), swap 
2098172k(2098096k free)
CommandLine flags: -XX:+AggressiveOpts -XX:G1HeapRegionSize=8388608 
-XX:GCLogFileSize=20971520 -XX:InitialHeapSize=8328839168 
-XX:InitiatingHeapOccupancyPercent=75 -XX:+ManagementServer 
-XX:MaxGCPauseMillis=250 -XX:MaxHeapSize=8328839168 
-XX:NumberOfGCLogFiles=9 
-XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 
/data/solr-home/logs -XX:+ParallelRefProcEnabled 
-XX:+PerfDisableSharedMem -XX:+PrintGC 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC 
-XX:+UseGCLogFileRotation -XX:+UseLargePages
2018-09-27T01:11:19.136-0400: 0.518: Total time for which application 
threads were stopped: 0.0001694 seconds, Stopping threads took: 
0.0000184 seconds

So that means that in the first GC logfile written by Java, that 
parameter wasn't there, but then it was added for the rest of the 
logfiles.? Which is REALLY odd.

My best guess is that G1GC eventually decided on the size of the eden 
generation that it wanted to use, and then added the explicit 
commandline option for that size to the active arguments.? I had thought 
that G1 was capable of adjusting the generation sizes on the fly at any 
time ... but the five logs that contain the -XX:MaxNewSize option all 
have it set to the same value, which is 4991221760 bytes.? It has been 
set to that value for at least a month.

Thanks,
Shawn


From java at elyograg.org  Sat Nov  3 14:18:55 2018
From: java at elyograg.org (Shawn Heisey)
Date: Sat, 3 Nov 2018 08:18:55 -0600
Subject: Java 8 adding -XX:MaxNewSize to parameters
In-Reply-To: <549777c1-ee66-e362-3c04-09549d628471@elyograg.org>
References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
 <549777c1-ee66-e362-3c04-09549d628471@elyograg.org>
Message-ID: <d67ad32c-7b56-1c15-c6a6-3aef0138909b@elyograg.org>

On 11/2/2018 4:35 PM, Shawn Heisey wrote:
> The first logfile, solr_gc.log.0, does NOT contain that parameter. 
> Here's the top of the logfile and the first log entry, showing 
> consistency between the first log and the the info output by the ps 
> command

A followup to my followup.? I seem to be the only person speaking!

What I wrote here might be a little bit confusing.? The consistency I 
was talking about here was with the "Sep27" start time indicator on the 
ps command output.

Thanks,
Shawn


From thomas.schatzl at oracle.com  Mon Nov  5 12:58:56 2018
From: thomas.schatzl at oracle.com (Thomas Schatzl)
Date: Mon, 05 Nov 2018 13:58:56 +0100
Subject: Java 8 adding -XX:MaxNewSize to parameters
In-Reply-To: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
Message-ID: <10d9f555e8345aafc10a0c6c99682a038d77af16.camel@oracle.com>

Hi Shawn,

On Fri, 2018-11-02 at 15:23 -0600, Shawn Heisey wrote:
> I have a Solr install with the following info on the process list:
> 
> solr      4426     1 15 Sep27 ?        5-17:59:51 java -server
> -Xms7943m 
> -Xmx7943m -XX:+UseG1GC -XX:+PerfDisableSharedMem 
> -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m 
> -XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 
> -XX:+UseLargePages -XX:+AggressiveOpts -verbose:gc
> -XX:+PrintHeapAtGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime 
> -Xloggc:/data/solr-home/logs/solr_gc.log -XX:+UseGCLogFileRotation 
> -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M 
> 
> [...]
> 
> Process started Sep27, with G1 tuning, but no parameters setting 
> specific generation sizes.
> 
> Here's the info at the top of a GC log written by that process:
> 
> 2018-10-29 05:47:53 GC log file created /data/solr-
> home/logs/solr_gc.log.6
> Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE 
> (1.8.0_162-b12), built on Dec 19 2017 21:15:48 by "java_re" with gcc 
> 4.3.0 20080428 (Red Hat 4.3.0-8)
> Memory: 4k page, physical 16268024k(163252k free), swap 
> 2098172k(2056676k free)
> CommandLine flags: -XX:+AggressiveOpts -XX:CICompilerCount=3 
> -XX:ConcGCThreads=1 -XX:G1HeapRegionSize=8388608 
> -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=8329887744 
> -XX:InitiatingHeapOccupancyPercent=75 -XX:+ManagementServer 
> -XX:MarkStackSize=4194304 -XX:MaxGCPauseMillis=250 
> -XX:MaxHeapSize=8329887744 -XX:MaxNewSize=4991221760 
> -XX:MinHeapDeltaBytes=8388608 -XX:NumberOfGCLogFiles=9 
> -XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 
> /data/solr-home/logs -XX:+ParallelRefProcEnabled 
> -XX:+PerfDisableSharedMem -XX:+PrintGC 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps 
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
> -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 
> -XX:+UseCompressedClassPointers -XX:+UseCompressedOops 
> -XX:+UseFastUnorderedTimeStamps -XX:+UseG1GC
> -XX:+UseGCLogFileRotation 
> -XX:+UseLargePages
> 
> As you can see, the gc log says that -XX:MaxNewSize is among the 
> parameters.  The problem came to my attention when I put the gc log
> into the gceasy.io website and it said "don't set your generation
> sizes explicitly, G1 works better when they aren't explicit."  Which
> I thought was weird, because I didn't know of anywhere that the sizes
> were being set.  I don't WANT to set the generation sizes explicitly.
> 
> I have confirmed by cross-referencing pids, checking which pid has
> the logfile open, and watching for new data written to the same log
> that it is indeed the same process, so I know the commandline
> arguments from the ps listing are from the right process.
> 
> Here's the version info for the java version that's running:
> 
> [solr at RWVM-SolrDev03 logs]$ java -version
> java version "1.8.0_162"
> Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
> 
> I cannot reproduce the behavior manually on a Windows 10 machine,  
> either with the same version that's on the Linux machine or using 
> 8u_191.  The machine where I see this happening is running CentOS
> 7.  
> Hopefully I will be in a position to restart the service on the
> machine in the near future so I can check whether the issue is
> persistent.
> 
> I will attempt to reproduce on another Linux machine.  Does this
> seem like a bug?

I remember that is a (known) bug in printing command line flags. This
may also occur the other way around, i.e. flags specified on the
command line are not printed.

This has no impact on flags actually used.

I spent some time coming up with an existing bug number for that, but
failed.

Looking at the code that prints these flags, it actually prints any
command line flags modified so far (including e.g. ergonomics), which
depending on the time the method is called, adds or removes some flags
in the output.

Thanks,
  Thomas


From elyograg at elyograg.org  Sat Nov  3 14:17:22 2018
From: elyograg at elyograg.org (Shawn Heisey)
Date: Sat, 3 Nov 2018 08:17:22 -0600
Subject: Java 8 adding -XX:MaxNewSize to parameters
In-Reply-To: <549777c1-ee66-e362-3c04-09549d628471@elyograg.org>
References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org>
 <549777c1-ee66-e362-3c04-09549d628471@elyograg.org>
Message-ID: <b2ab71df-1007-e268-0ecf-aa8a6dbafe2f@elyograg.org>

On 11/2/2018 4:35 PM, Shawn Heisey wrote:
> The first logfile, solr_gc.log.0, does NOT contain that parameter. 
> Here's the top of the logfile and the first log entry, showing 
> consistency between the first log and the the info output by the ps 
> command

A followup to my followup.? I seem to be the only person speaking!

What I wrote here might be a little bit confusing.? The consistency I 
was talking about here was with the "Sep27" start time indicator on the 
ps command output.

Thanks,
Shawn


From jai.forums2013 at gmail.com  Fri Nov 23 13:55:30 2018
From: jai.forums2013 at gmail.com (Jaikiran Pai)
Date: Fri, 23 Nov 2018 19:25:30 +0530
Subject: Java 8 + Docker container - CMS collector leaves around instances
 that have no GC roots
Message-ID: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>

Hi,

I'm looking for some inputs in debugging a high memory usage issue (and
subsequently the process being killed) in one of the applications I deal
with. Given that from what I have looked into this issue so far, this
appears to be something to do with the CMS collector, so I hope this is
the right place to this question.

A bit of a background - The application that I'm dealing with is
ElasticSearch server version 1.7.5. We use Java 8:

java version "1.8.0_172"
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)

To add to the complexity in debugging this issue, this runs as a docker
container on docker version 18.03.0-ce on a CentOS 7 host VM kernel
version 3.10.0-693.5.2.el7.x86_64.

We have been noticing that this container/process keeps getting killed
by the oom-killer every few days. The dmesg logs suggest that the
process has hit the "limits" set on the docker cgroups level. After
debugging this over past day or so, I've reached a point where I can't
make much sense of the data I'm looking at. The JVM process is started
using the following params (of relevance):

java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC ....

As you can see it uses CMS collector with 75% of tenured/old gen for
initiating the GC.

After a few hours/days of running I notice that even though the CMS
collector does run almost every hour or so, there are huge number of
objects _with no GC roots_ that never get collected. These objects
internally seem to hold on to ByteBuffer(s) which (from what I see) as a
result never get released and the non-heap memory keeps building up,
till the process gets killed. To give an example, here's the jmap -histo
output (only relevant parts):

?? 1:??????? 861642????? 196271400? [B
?? 2:??????? 198776?????? 28623744?
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
?? 3:??????? 676722?????? 21655104?
org.apache.lucene.store.ByteArrayDataInput
?? 4:??????? 202398?????? 19430208?
org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
?? 5:??????? 261819?????? 18850968? org.apache.lucene.util.fst.FST$Arc
?? 6:??????? 178661?????? 17018376? [C
?? 7:???????? 31452?????? 16856024? [I
?? 8:??????? 203911??????? 8049352? [J
?? 9:???????? 85700??????? 5484800? java.nio.DirectByteBufferR
? 10:??????? 168935??????? 5405920?
java.util.concurrent.ConcurrentHashMap$Node
? 11:???????? 89948??????? 5105328? [Ljava.lang.Object;
? 12:??????? 148514??????? 4752448?
org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference

....

Total?????? 5061244????? 418712248

This above output is without the "live" option. Running jmap -histo:live
returns something like (again only relevant parts):

? 13:???????? 31753??????? 1016096?
org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
? ...
? 44:?????????? 887???????? 127728?
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
? ...
? 50:????????? 3054????????? 97728?
org.apache.lucene.store.ByteArrayDataInput
? ...
? 59:?????????? 888????????? 85248?
org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState

? Total?????? 1177783????? 138938920


Notice the vast difference between the live and non-live instances of
the same class. This isn't just in one "snapshot". I have been
monitoring this for more than a day and this pattern continues. Even
taking heap dumps and using tools like visualvm shows that these
instances have "no GC root" and I have even checked the gc log files to
see that the CMS collector does occasionally run. However these objects
never seem to get collected.

I realize this data may not be enough to narrow down the issue, but what
I am looking for is some kind of help/input/hints/suggestions on what I
should be trying to figure out why these instances aren't GCed. Is this
something that's expected in certain situations?

-Jaikiran


From leo.korinth at oracle.com  Fri Nov 23 17:07:14 2018
From: leo.korinth at oracle.com (Leo Korinth)
Date: Fri, 23 Nov 2018 18:07:14 +0100
Subject: Java 8 + Docker container - CMS collector leaves around instances
 that have no GC roots
In-Reply-To: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>
References: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>
Message-ID: <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com>

Hi!

On 23/11/2018 14:55, Jaikiran Pai wrote:
> Hi,
> 
> I'm looking for some inputs in debugging a high memory usage issue (and
> subsequently the process being killed) in one of the applications I deal
> with. Given that from what I have looked into this issue so far, this
> appears to be something to do with the CMS collector, so I hope this is
> the right place to this question.
> 
> A bit of a background - The application that I'm dealing with is
> ElasticSearch server version 1.7.5. We use Java 8:
> 
> java version "1.8.0_172"
> Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)
> 
> To add to the complexity in debugging this issue, this runs as a docker
> container on docker version 18.03.0-ce on a CentOS 7 host VM kernel
> version 3.10.0-693.5.2.el7.x86_64.
> 
> We have been noticing that this container/process keeps getting killed
> by the oom-killer every few days. The dmesg logs suggest that the
> process has hit the "limits" set on the docker cgroups level. After
> debugging this over past day or so, I've reached a point where I can't
> make much sense of the data I'm looking at. The JVM process is started
> using the following params (of relevance):
> 
> java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC ....

What is your limit? You will need _more_ than 6 gig of memory; exactly 
how much is hard to say. You will probably find the limit faster if you 
use Xms==Xmx and maybe not need days of running the application.

>
> As you can see it uses CMS collector with 75% of tenured/old gen for
> initiating the GC.
> 
> After a few hours/days of running I notice that even though the CMS
> collector does run almost every hour or so, there are huge number of
> objects _with no GC roots_ that never get collected. These objects
> internally seem to hold on to ByteBuffer(s) which (from what I see) as a
> result never get released and the non-heap memory keeps building up,
> till the process gets killed. To give an example, here's the jmap -histo
> output (only relevant parts):
> 
>  ?? 1:??????? 861642????? 196271400? [B
>  ?? 2:??????? 198776?????? 28623744
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
>  ?? 3:??????? 676722?????? 21655104
> org.apache.lucene.store.ByteArrayDataInput
>  ?? 4:??????? 202398?????? 19430208
> org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
>  ?? 5:??????? 261819?????? 18850968? org.apache.lucene.util.fst.FST$Arc
>  ?? 6:??????? 178661?????? 17018376? [C
>  ?? 7:???????? 31452?????? 16856024? [I
>  ?? 8:??????? 203911??????? 8049352? [J
>  ?? 9:???????? 85700??????? 5484800? java.nio.DirectByteBufferR
>  ? 10:??????? 168935??????? 5405920
> java.util.concurrent.ConcurrentHashMap$Node
>  ? 11:???????? 89948??????? 5105328? [Ljava.lang.Object;
>  ? 12:??????? 148514??????? 4752448
> org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
> 
> ....
> 
> Total?????? 5061244????? 418712248
> 
> This above output is without the "live" option. Running jmap -histo:live
> returns something like (again only relevant parts):
> 
>  ? 13:???????? 31753??????? 1016096
> org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
>  ? ...
>  ? 44:?????????? 887???????? 127728
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
>  ? ...
>  ? 50:????????? 3054????????? 97728
> org.apache.lucene.store.ByteArrayDataInput
>  ? ...
>  ? 59:?????????? 888????????? 85248
> org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
> 
>  ? Total?????? 1177783????? 138938920
> 
> 
> Notice the vast difference between the live and non-live instances of
> the same class. This isn't just in one "snapshot". I have been
> monitoring this for more than a day and this pattern continues. Even
> taking heap dumps and using tools like visualvm shows that these
> instances have "no GC root" and I have even checked the gc log files to
> see that the CMS collector does occasionally run. However these objects
> never seem to get collected.

What makes you believe they never get collected?

> 
> I realize this data may not be enough to narrow down the issue, but what
> I am looking for is some kind of help/input/hints/suggestions on what I
> should be trying to figure out why these instances aren't GCed. Is this
> something that's expected in certain situations?

Being killed by the oom-killer suggests that your limit is too low 
and/or your -Xmx is too large. If an increasing number of objects does 
not get collected, you would get an exception (not being killed by the 
oom-killer). What likely is happening is that your java heap slowly 
grows (not unusual with CMS that does not do compaction of old objects) 
and that the memory consumed by your docker image exceeds your limit. 
How big the limit should be is hard to tell, but it _must_ be larger 
than your "-Xmx" (the JVM is using more memory than the java heap, so 
this would be true even without the addition of docker).

Thanks,
Leo


> -Jaikiran
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 

From jai.forums2013 at gmail.com  Sat Nov 24 01:35:31 2018
From: jai.forums2013 at gmail.com (Jaikiran Pai)
Date: Sat, 24 Nov 2018 07:05:31 +0530
Subject: Java 8 + Docker container - CMS collector leaves around instances
 that have no GC roots
In-Reply-To: <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com>
References: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>
 <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com>
Message-ID: <3b85eaab-5c31-2568-522b-59ec8bfcdfc2@gmail.com>

Hello Leo,

Thank you for responding. Replies inline.


On 23/11/18 10:37 PM, Leo Korinth wrote:
> Hi!
>
> On 23/11/2018 14:55, Jaikiran Pai wrote:
>>
>>
>> java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC ....
>
> What is your limit? 

The docker cgroups limit (set via --memory) is set to 8G. Here's the
docker stats output of that container right now (been running for around
16 hours now):

CONTAINER ID??????? NAME???????????????????????????? CPU %??????????????
MEM USAGE / LIMIT?? MEM %?????????????? NET I/O???????????? BLOCK
I/O?????????? PIDS
da5ee21b1d??????? elasticsearch?? 94.62%????????????? 3.657GiB /
8GiB???? 45.71%????????????? 0B / 0B???????????? 1.21MB / 18.2GB???? 227
> You will need _more_ than 6 gig of memory; exactly how much is hard to
> say. You will probably find the limit faster if you use Xms==Xmx and
> maybe not need days of running the application.
That's interesting and useful - I'll give that a try. That will
certainly help speed up the investigation.

>
>>
>> Notice the vast difference between the live and non-live instances of
>> the same class. This isn't just in one "snapshot". I have been
>> monitoring this for more than a day and this pattern continues. Even
>> taking heap dumps and using tools like visualvm shows that these
>> instances have "no GC root" and I have even checked the gc log files to
>> see that the CMS collector does occasionally run. However these objects
>> never seem to get collected.
>
> What makes you believe they never get collected?
I have gc logs enabled on this setup. I notice that the Full GC gets
triggered once in a while. However, even after that Full GC completes, I
still see these vast amount of "non-live" objects staying around in the
heap (via jmap -histo output as well as a heap dump using visualvm).
This is the "total" in heap via jmap -histo:

??? ??? ?? #instances ? #bytes? ???
Total????? 22880189???? 1644503392 (around 1.5 G)

and this is the -histo:live

Total?????? 1292631????? 102790184 (around 98M, not even 100M)


Some of these non-live objects hold on to the ByteBuffer(s) which keep
filling up then non-heap memory too (right now the non-heap "mapped"
ByteBuffer memory as shown in the JMX MBean is around 2.5G). The Full GC
log message looks like this:

2018-11-24T00:57:00.665+0000: 59655.295: [Full GC (Heap Inspection
Initiated GC))
?2018-11-24T00:57:00.665+0000: 59655.295: [CMS:
711842K->101527K(989632K), 0.5322
0752 secs] 1016723K->101527K(1986432K), [Metaspace:
48054K->48054K(1093632K)], 00
.5325692 secs] [Times: user=0.53 sys=0.00, real=0.53 secs]

The jmap -heap output is this:

Server compiler detected.
JVM version is 25.172-b11

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
?? MinHeapFreeRatio???????? = 40
?? MaxHeapFreeRatio???????? = 70
?? MaxHeapSize????????????? = 6442450944 (6144.0MB)
?? NewSize????????????????? = 1134100480 (1081.5625MB)
?? MaxNewSize?????????????? = 1134100480 (1081.5625MB)
?? OldSize????????????????? = 1013383168 (966.4375MB)
?? NewRatio???????????????? = 2
?? SurvivorRatio??????????? = 8
?? MetaspaceSize??????????? = 21807104 (20.796875MB)
?? CompressedClassSpaceSize = 1073741824 (1024.0MB)
?? MaxMetaspaceSize???????? = 17592186044415 MB
?? G1HeapRegionSize???????? = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
?? capacity = 1020723200 (973.4375MB)
?? used???? = 900988112 (859.2492218017578MB)
?? free???? = 119735088 (114.18827819824219MB)
?? 88.26958297802969% used
Eden Space:
?? capacity = 907345920 (865.3125MB)
?? used???? = 879277776 (838.5446319580078MB)
?? free???? = 28068144 (26.767868041992188MB)
?? 96.9065663512324% used
From Space:
?? capacity = 113377280 (108.125MB)
?? used???? = 21710336 (20.70458984375MB)
?? free???? = 91666944 (87.42041015625MB)
?? 19.148753612716764% used
To Space:
?? capacity = 113377280 (108.125MB)
?? used???? = 0 (0.0MB)
?? free???? = 113377280 (108.125MB)
?? 0.0% used
concurrent mark-sweep generation:
?? capacity = 1013383168 (966.4375MB)
?? used???? = 338488264 (322.8075637817383MB)
?? free???? = 674894904 (643.6299362182617MB)
?? 33.40180443968061% used

14231 interned Strings occupying 2258280 bytes.

So based on these various logs and data, I notice that they aren't being
collected. At the time when the process gets chosen to be killed, even
then I see these vast non-live objects holding on. I don't have too much
knowledge in the GC area, so a genuine question - Is it just a wrong
expectation that whenever a Full GC runs, it is supposed to clear out
the non-live objects? Or is it a usual thing that the collector doesn't
choose them for collection for some reason?
>
>>
>> I realize this data may not be enough to narrow down the issue, but what
>> I am looking for is some kind of help/input/hints/suggestions on what I
>> should be trying to figure out why these instances aren't GCed. Is this
>> something that's expected in certain situations?
>
> Being killed by the oom-killer suggests that your limit is too low
> and/or your -Xmx is too large. If an increasing number of objects does
> not get collected, you would get an exception (not being killed by the
> oom-killer). What likely is happening is that your java heap slowly
> grows (not unusual with CMS that does not do compaction of old
> objects) and that the memory consumed by your docker image exceeds
> your limit.
You are right. The heap usage (keeping into consideration the
uncollected non-live objects) does indeed grow very slowly. There
however aren't too many (live) objects on heap really (even after days
when this process is about to be killed). I'll have to read up on the
compaction that you mentioned about CMS.

Here's one example output from dmesg when this process was killed
previously:

[Wed Aug? 8 03:21:13 2018] java invoked oom-killer: gfp_mask=0xd0,
order=0, oom_score_adj=-600
[Wed Aug? 8 03:21:13 2018] java
cpuset=390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5
mems_allowed=0
[Wed Aug? 8 03:21:13 2018] CPU: 7 PID: 15827 Comm: java Not tainted
3.10.0-693.5.2.el7.x86_64 #1
...
[Wed Aug? 8 03:21:13 2018]? ffff88081472dee0 00000000fe941297
ffff8808062cfcb8 ffffffff816a3e51
[Wed Aug? 8 03:21:13 2018]? ffff8808062cfd48 ffffffff8169f246
ffff88080d6d3500 0000000000000001
[Wed Aug? 8 03:21:13 2018]? 0000000000000000 0000000000000000
ffff8808062cfcf8 0000000000000046
[Wed Aug? 8 03:21:13 2018] Call Trace:
[Wed Aug? 8 03:21:13 2018]? [<ffffffff816a3e51>] dump_stack+0x19/0x1b
[Wed Aug? 8 03:21:13 2018]? [<ffffffff8169f246>] dump_header+0x90/0x229
[Wed Aug? 8 03:21:13 2018]? [<ffffffff81185ef6>] ?
find_lock_task_mm+0x56/0xc0
[Wed Aug? 8 03:21:13 2018]? [<ffffffff811f14f8>] ?
try_get_mem_cgroup_from_mm+0x28/0x60
[Wed Aug? 8 03:21:13 2018]? [<ffffffff811863a4>]
oom_kill_process+0x254/0x3d0
[Wed Aug? 8 03:21:13 2018]? [<ffffffff812b7eec>] ? selinux_capable+0x1c/0x40
[Wed Aug? 8 03:21:13 2018]? [<ffffffff811f5216>]
mem_cgroup_oom_synchronize+0x546/0x570
[Wed Aug? 8 03:21:13 2018]? [<ffffffff811f4690>] ?
mem_cgroup_charge_common+0xc0/0xc0
[Wed Aug? 8 03:21:13 2018]? [<ffffffff81186c34>]
pagefault_out_of_memory+0x14/0x90
[Wed Aug? 8 03:21:13 2018]? [<ffffffff8169d60e>] mm_fault_error+0x68/0x12b
[Wed Aug? 8 03:21:13 2018]? [<ffffffff816b02b1>] __do_page_fault+0x391/0x450
[Wed Aug? 8 03:21:13 2018]? [<ffffffff816b03a5>] do_page_fault+0x35/0x90
[Wed Aug? 8 03:21:13 2018]? [<ffffffff816ac5c8>] page_fault+0x28/0x30
[Wed Aug? 8 03:21:13 2018] Task in
/docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5
killed as a result of limit of
/docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5
[Wed Aug? 8 03:21:13 2018] memory: usage 8388608kB, limit 8388608kB,
failcnt 426669
[Wed Aug? 8 03:21:13 2018] memory+swap: usage 8388608kB, limit
8388608kB, failcnt 152
[Wed Aug? 8 03:21:13 2018] kmem: usage 5773512kB, limit
9007199254740988kB, failcnt 0
[Wed Aug? 8 03:21:13 2018] Memory cgroup stats for
/docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5:
cache:14056KB rss:2601040KB rss_huge:2484224KB mapped_file:6428KB
swap:0KB inactive_anon:0KB active_anon:142640KB inactive_file:4288KB
active_file:3324KB unevictable:2464692KB
[Wed Aug? 8 03:21:13 2018] [ pid ]?? uid? tgid total_vm????? rss nr_ptes
swapents oom_score_adj name
[Wed Aug? 8 03:21:13 2018] [15582]? 1000 15582? 3049827?? 654465???
1446??????? 0????????? -600 java
[Wed Aug? 8 03:21:13 2018] Memory cgroup out of memory: Kill process
12625 (java) score 0 or sacrifice child
[Wed Aug? 8 03:21:13 2018] Killed process 15582 (java)
total-vm:12199308kB, anon-rss:2599320kB, file-rss:18540kB, shmem-rss:0kB


> How big the limit should be is hard to tell, but it _must_ be larger
> than your "-Xmx" (the JVM is using more memory than the java heap, so
> this would be true even without the addition of docker).
Agreed and that's where the ByteBuffer(s) are coming into picture
(around 2.5 GB right now and growing steadily).

-Jaikiran

From leo.korinth at oracle.com  Mon Nov 26 10:03:26 2018
From: leo.korinth at oracle.com (Leo Korinth)
Date: Mon, 26 Nov 2018 11:03:26 +0100
Subject: Java 8 + Docker container - CMS collector leaves around instances
 that have no GC roots
In-Reply-To: <3b85eaab-5c31-2568-522b-59ec8bfcdfc2@gmail.com>
References: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>
 <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com>
 <3b85eaab-5c31-2568-522b-59ec8bfcdfc2@gmail.com>
Message-ID: <6ebac3fe-2f69-2ed3-980d-6369a2151005@oracle.com>

Hi!

> 
> The docker cgroups limit (set via --memory) is set to 8G. Here's the
> docker stats output of that container right now (been running for around
> 16 hours now):

Okay, so you have about 2 gig to use outside the java heap. The docker 
image will use some space. The JVM will use memory (off java heap) for 
compiler caches, threads, class metadata, mmaped files (_mmap is used by 
elastic_) and other things.

I can not give you good help on how much this is, but a quick look at: 
https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html 
suggest that you should only give maximum half the memory to the java 
heap. If you give 6 gig to the heap, they seem to suggest 12 gig of 
memory (instead of 8).

Now, I do not know much about elastic search to say this is enough, and 
that guide does not seem to mention docker. I would suggest that you try 
to find a configuration that also _limits_ the use of file cache for 
elastic. If elastic does not understand that it is running under docker, 
it _might_ use huge file caches. Try to limit the file caches and give 
the docker image x gig for java heap, y gig for caches and some extra 
for slack.

Hope this helps.
Thanks, Leo


> 
> CONTAINER ID??????? NAME???????????????????????????? CPU %
> MEM USAGE / LIMIT?? MEM %?????????????? NET I/O???????????? BLOCK
> I/O?????????? PIDS
> da5ee21b1d??????? elasticsearch?? 94.62%????????????? 3.657GiB /
> 8GiB???? 45.71%????????????? 0B / 0B???????????? 1.21MB / 18.2GB???? 227
>> You will need _more_ than 6 gig of memory; exactly how much is hard to
>> say. You will probably find the limit faster if you use Xms==Xmx and
>> maybe not need days of running the application.
> That's interesting and useful - I'll give that a try. That will
> certainly help speed up the investigation.
> 
>>
>>>
>>> Notice the vast difference between the live and non-live instances of
>>> the same class. This isn't just in one "snapshot". I have been
>>> monitoring this for more than a day and this pattern continues. Even
>>> taking heap dumps and using tools like visualvm shows that these
>>> instances have "no GC root" and I have even checked the gc log files to
>>> see that the CMS collector does occasionally run. However these objects
>>> never seem to get collected.
>>
>> What makes you believe they never get collected?
> I have gc logs enabled on this setup. I notice that the Full GC gets
> triggered once in a while. However, even after that Full GC completes, I
> still see these vast amount of "non-live" objects staying around in the
> heap (via jmap -histo output as well as a heap dump using visualvm).
> This is the "total" in heap via jmap -histo:
> 
>  ??? ??? ?? #instances ? #bytes
> Total????? 22880189???? 1644503392 (around 1.5 G)
> 
> and this is the -histo:live
> 
> Total?????? 1292631????? 102790184 (around 98M, not even 100M)
> 
> 
> Some of these non-live objects hold on to the ByteBuffer(s) which keep
> filling up then non-heap memory too (right now the non-heap "mapped"
> ByteBuffer memory as shown in the JMX MBean is around 2.5G). The Full GC
> log message looks like this:
> 
> 2018-11-24T00:57:00.665+0000: 59655.295: [Full GC (Heap Inspection
> Initiated GC))
>  ?2018-11-24T00:57:00.665+0000: 59655.295: [CMS:
> 711842K->101527K(989632K), 0.5322
> 0752 secs] 1016723K->101527K(1986432K), [Metaspace:
> 48054K->48054K(1093632K)], 00
> .5325692 secs] [Times: user=0.53 sys=0.00, real=0.53 secs]
> 
> The jmap -heap output is this:
> 
> Server compiler detected.
> JVM version is 25.172-b11
> 
> using parallel threads in the new generation.
> using thread-local object allocation.
> Concurrent Mark-Sweep GC
> 
> Heap Configuration:
>  ?? MinHeapFreeRatio???????? = 40
>  ?? MaxHeapFreeRatio???????? = 70
>  ?? MaxHeapSize????????????? = 6442450944 (6144.0MB)
>  ?? NewSize????????????????? = 1134100480 (1081.5625MB)
>  ?? MaxNewSize?????????????? = 1134100480 (1081.5625MB)
>  ?? OldSize????????????????? = 1013383168 (966.4375MB)
>  ?? NewRatio???????????????? = 2
>  ?? SurvivorRatio??????????? = 8
>  ?? MetaspaceSize??????????? = 21807104 (20.796875MB)
>  ?? CompressedClassSpaceSize = 1073741824 (1024.0MB)
>  ?? MaxMetaspaceSize???????? = 17592186044415 MB
>  ?? G1HeapRegionSize???????? = 0 (0.0MB)
> 
> Heap Usage:
> New Generation (Eden + 1 Survivor Space):
>  ?? capacity = 1020723200 (973.4375MB)
>  ?? used???? = 900988112 (859.2492218017578MB)
>  ?? free???? = 119735088 (114.18827819824219MB)
>  ?? 88.26958297802969% used
> Eden Space:
>  ?? capacity = 907345920 (865.3125MB)
>  ?? used???? = 879277776 (838.5446319580078MB)
>  ?? free???? = 28068144 (26.767868041992188MB)
>  ?? 96.9065663512324% used
>  From Space:
>  ?? capacity = 113377280 (108.125MB)
>  ?? used???? = 21710336 (20.70458984375MB)
>  ?? free???? = 91666944 (87.42041015625MB)
>  ?? 19.148753612716764% used
> To Space:
>  ?? capacity = 113377280 (108.125MB)
>  ?? used???? = 0 (0.0MB)
>  ?? free???? = 113377280 (108.125MB)
>  ?? 0.0% used
> concurrent mark-sweep generation:
>  ?? capacity = 1013383168 (966.4375MB)
>  ?? used???? = 338488264 (322.8075637817383MB)
>  ?? free???? = 674894904 (643.6299362182617MB)
>  ?? 33.40180443968061% used
> 
> 14231 interned Strings occupying 2258280 bytes.
> 
> So based on these various logs and data, I notice that they aren't being
> collected. At the time when the process gets chosen to be killed, even
> then I see these vast non-live objects holding on. I don't have too much
> knowledge in the GC area, so a genuine question - Is it just a wrong
> expectation that whenever a Full GC runs, it is supposed to clear out
> the non-live objects? Or is it a usual thing that the collector doesn't
> choose them for collection for some reason?
>>
>>>
>>> I realize this data may not be enough to narrow down the issue, but what
>>> I am looking for is some kind of help/input/hints/suggestions on what I
>>> should be trying to figure out why these instances aren't GCed. Is this
>>> something that's expected in certain situations?
>>
>> Being killed by the oom-killer suggests that your limit is too low
>> and/or your -Xmx is too large. If an increasing number of objects does
>> not get collected, you would get an exception (not being killed by the
>> oom-killer). What likely is happening is that your java heap slowly
>> grows (not unusual with CMS that does not do compaction of old
>> objects) and that the memory consumed by your docker image exceeds
>> your limit.
> You are right. The heap usage (keeping into consideration the
> uncollected non-live objects) does indeed grow very slowly. There
> however aren't too many (live) objects on heap really (even after days
> when this process is about to be killed). I'll have to read up on the
> compaction that you mentioned about CMS.
> 
> Here's one example output from dmesg when this process was killed
> previously:
> 
> [Wed Aug? 8 03:21:13 2018] java invoked oom-killer: gfp_mask=0xd0,
> order=0, oom_score_adj=-600
> [Wed Aug? 8 03:21:13 2018] java
> cpuset=390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5
> mems_allowed=0
> [Wed Aug? 8 03:21:13 2018] CPU: 7 PID: 15827 Comm: java Not tainted
> 3.10.0-693.5.2.el7.x86_64 #1
> ...
> [Wed Aug? 8 03:21:13 2018]? ffff88081472dee0 00000000fe941297
> ffff8808062cfcb8 ffffffff816a3e51
> [Wed Aug? 8 03:21:13 2018]? ffff8808062cfd48 ffffffff8169f246
> ffff88080d6d3500 0000000000000001
> [Wed Aug? 8 03:21:13 2018]? 0000000000000000 0000000000000000
> ffff8808062cfcf8 0000000000000046
> [Wed Aug? 8 03:21:13 2018] Call Trace:
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff816a3e51>] dump_stack+0x19/0x1b
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff8169f246>] dump_header+0x90/0x229
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff81185ef6>] ?
> find_lock_task_mm+0x56/0xc0
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff811f14f8>] ?
> try_get_mem_cgroup_from_mm+0x28/0x60
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff811863a4>]
> oom_kill_process+0x254/0x3d0
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff812b7eec>] ? selinux_capable+0x1c/0x40
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff811f5216>]
> mem_cgroup_oom_synchronize+0x546/0x570
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff811f4690>] ?
> mem_cgroup_charge_common+0xc0/0xc0
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff81186c34>]
> pagefault_out_of_memory+0x14/0x90
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff8169d60e>] mm_fault_error+0x68/0x12b
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff816b02b1>] __do_page_fault+0x391/0x450
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff816b03a5>] do_page_fault+0x35/0x90
> [Wed Aug? 8 03:21:13 2018]? [<ffffffff816ac5c8>] page_fault+0x28/0x30
> [Wed Aug? 8 03:21:13 2018] Task in
> /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5
> killed as a result of limit of
> /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5
> [Wed Aug? 8 03:21:13 2018] memory: usage 8388608kB, limit 8388608kB,
> failcnt 426669
> [Wed Aug? 8 03:21:13 2018] memory+swap: usage 8388608kB, limit
> 8388608kB, failcnt 152
> [Wed Aug? 8 03:21:13 2018] kmem: usage 5773512kB, limit
> 9007199254740988kB, failcnt 0
> [Wed Aug? 8 03:21:13 2018] Memory cgroup stats for
> /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5:
> cache:14056KB rss:2601040KB rss_huge:2484224KB mapped_file:6428KB
> swap:0KB inactive_anon:0KB active_anon:142640KB inactive_file:4288KB
> active_file:3324KB unevictable:2464692KB
> [Wed Aug? 8 03:21:13 2018] [ pid ]?? uid? tgid total_vm????? rss nr_ptes
> swapents oom_score_adj name
> [Wed Aug? 8 03:21:13 2018] [15582]? 1000 15582? 3049827?? 654465
> 1446??????? 0????????? -600 java
> [Wed Aug? 8 03:21:13 2018] Memory cgroup out of memory: Kill process
> 12625 (java) score 0 or sacrifice child
> [Wed Aug? 8 03:21:13 2018] Killed process 15582 (java)
> total-vm:12199308kB, anon-rss:2599320kB, file-rss:18540kB, shmem-rss:0kB
> 
> 
>> How big the limit should be is hard to tell, but it _must_ be larger
>> than your "-Xmx" (the JVM is using more memory than the java heap, so
>> this would be true even without the addition of docker).
> Agreed and that's where the ByteBuffer(s) are coming into picture
> (around 2.5 GB right now and growing steadily).
> 
> -Jaikiran
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 

From jwhiting at redhat.com  Mon Nov 26 10:26:41 2018
From: jwhiting at redhat.com (jwhiting at redhat.com)
Date: Mon, 26 Nov 2018 10:26:41 +0000
Subject: Java 8 + Docker container - CMS collector leaves around
 instances that have no GC roots
In-Reply-To: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>
References: <cb68748e-0654-f859-4d9c-ac9400cb6eb7@gmail.com>
Message-ID: <5604aaaf008ce26e313e5e7ad7fa1ae4844afbac.camel@redhat.com>

Hi Jaikiran
 Have a look at some blog posts by old friends :) These blog posts
might be helpful (along with the other replies you received) to
diagnose the root cause of the issue. In particular native memory
tracking.

https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
https://developers.redhat.com/blog/2017/04/04/openjdk-and-containers/

Regards,
Jeremy

On Fri, 2018-11-23 at 19:25 +0530, Jaikiran Pai wrote:
> Hi,
> 
> I'm looking for some inputs in debugging a high memory usage issue
> (and
> subsequently the process being killed) in one of the applications I
> deal
> with. Given that from what I have looked into this issue so far, this
> appears to be something to do with the CMS collector, so I hope this
> is
> the right place to this question.
> 
> A bit of a background - The application that I'm dealing with is
> ElasticSearch server version 1.7.5. We use Java 8:
> 
> java version "1.8.0_172"
> Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode)
> 
> To add to the complexity in debugging this issue, this runs as a
> docker
> container on docker version 18.03.0-ce on a CentOS 7 host VM kernel
> version 3.10.0-693.5.2.el7.x86_64.
> 
> We have been noticing that this container/process keeps getting
> killed
> by the oom-killer every few days. The dmesg logs suggest that the
> process has hit the "limits" set on the docker cgroups level. After
> debugging this over past day or so, I've reached a point where I
> can't
> make much sense of the data I'm looking at. The JVM process is
> started
> using the following params (of relevance):
> 
> java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC ....
> 
> As you can see it uses CMS collector with 75% of tenured/old gen for
> initiating the GC.
> 
> After a few hours/days of running I notice that even though the CMS
> collector does run almost every hour or so, there are huge number of
> objects _with no GC roots_ that never get collected. These objects
> internally seem to hold on to ByteBuffer(s) which (from what I see)
> as a
> result never get released and the non-heap memory keeps building up,
> till the process gets killed. To give an example, here's the jmap
> -histo
> output (only relevant parts):
> 
>    1:        861642      196271400  [B
>    2:        198776       28623744 
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
>    3:        676722       21655104 
> org.apache.lucene.store.ByteArrayDataInput
>    4:        202398       19430208 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTerm
> State
>    5:        261819       18850968 
> org.apache.lucene.util.fst.FST$Arc
>    6:        178661       17018376  [C
>    7:         31452       16856024  [I
>    8:        203911        8049352  [J
>    9:         85700        5484800  java.nio.DirectByteBufferR
>   10:        168935        5405920 
> java.util.concurrent.ConcurrentHashMap$Node
>   11:         89948        5105328  [Ljava.lang.Object;
>   12:        148514        4752448 
> org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
> 
> ....
> 
> Total       5061244      418712248
> 
> This above output is without the "live" option. Running jmap
> -histo:live
> returns something like (again only relevant parts):
> 
>   13:         31753        1016096 
> org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference
>   ...
>   44:           887         127728 
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
>   ...
>   50:          3054          97728 
> org.apache.lucene.store.ByteArrayDataInput
>   ...
>   59:           888          85248 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTerm
> State
> 
>   Total       1177783      138938920
> 
> 
> Notice the vast difference between the live and non-live instances of
> the same class. This isn't just in one "snapshot". I have been
> monitoring this for more than a day and this pattern continues. Even
> taking heap dumps and using tools like visualvm shows that these
> instances have "no GC root" and I have even checked the gc log files
> to
> see that the CMS collector does occasionally run. However these
> objects
> never seem to get collected.
> 
> I realize this data may not be enough to narrow down the issue, but
> what
> I am looking for is some kind of help/input/hints/suggestions on what
> I
> should be trying to figure out why these instances aren't GCed. Is
> this
> something that's expected in certain situations?
> 
> -Jaikiran
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-- 
-- 
Jeremy Whiting
Senior Software Engineer, Middleware Performance Team
Red Hat

------------------------------------------------------------
Registered Address: Red Hat UK Ltd, Peninsular House, 30 Monument
Street, London. United Kingdom.
Registered in England and Wales under Company Registration No.
03798903. Directors: Directors:Michael Cunningham (US), Michael
O'Neill(Ireland), Eric Shander (US)