From java at elyograg.org Fri Nov 2 21:23:07 2018 From: java at elyograg.org (Shawn Heisey) Date: Fri, 2 Nov 2018 15:23:07 -0600 Subject: Java 8 adding -XX:MaxNewSize to parameters Message-ID: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> I have a Solr install with the following info on the process list: solr????? 4426???? 1 15 Sep27 ???????? 5-17:59:51 java -server -Xms7943m -Xmx7943m -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/data/solr-home/logs/solr_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=3000 -Dcom.sun.management.jmxremote.rmi.port=3000 -Djava.rmi.server.hostname=RWVM-SolrDev03 -Dsolr.log.dir=/data/solr-home/logs -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dhost=RWVM-SolrDev03 -Duser.timezone=UTC -Djetty.home=/data/solr/server -Dsolr.solr.home=/data/solr-home/data -Dsolr.data.home= -Dsolr.install.dir=/data/solr -Dsolr.default.confdir=/data/solr/server/solr/configsets/_default/conf -Dlog4j.configuration=file:/data/solr-home/log4j.properties -Xss256k -Ddisable.configEdit=true -Xss256k -Dsolr.jetty.https.port=8983 -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 /data/solr-home/logs -jar start.jar --module=http Process started Sep27, with G1 tuning, but no parameters setting specific generation sizes. Here's the info at the top of a GC log written by that process: 2018-10-29 05:47:53 GC log file created /data/solr-home/logs/solr_gc.log.6 Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE (1.8.0_162-b12), built on Dec 19 2017 21:15:48 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Memory: 4k page, physical 16268024k(163252k free), swap 2098172k(2056676k free) CommandLine flags: -XX:+AggressiveOpts -XX:CICompilerCount=3 -XX:ConcGCThreads=1 -XX:G1HeapRegionSize=8388608 -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=8329887744 -XX:InitiatingHeapOccupancyPercent=75 -XX:+ManagementServer -XX:MarkStackSize=4194304 -XX:MaxGCPauseMillis=250 -XX:MaxHeapSize=8329887744 -XX:MaxNewSize=4991221760 -XX:MinHeapDeltaBytes=8388608 -XX:NumberOfGCLogFiles=9 -XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 /data/solr-home/logs -XX:+ParallelRefProcEnabled -XX:+PerfDisableSharedMem -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseG1GC -XX:+UseGCLogFileRotation -XX:+UseLargePages As you can see, the gc log says that -XX:MaxNewSize is among the parameters.? The problem came to my attention when I put the gc log into the gceasy.io website and it said "don't set your generation sizes explicitly, G1 works better when they aren't explicit."? Which I thought was weird, because I didn't know of anywhere that the sizes were being set.? I don't WANT to set the generation sizes explicitly. I have confirmed by cross-referencing pids, checking which pid has the logfile open, and watching for new data written to the same log that it is indeed the same process, so I know the commandline arguments from the ps listing are from the right process. Here's the version info for the java version that's running: [solr at RWVM-SolrDev03 logs]$ java -version java version "1.8.0_162" Java(TM) SE Runtime Environment (build 1.8.0_162-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode) I cannot reproduce the behavior manually on a Windows 10 machine,? either with the same version that's on the Linux machine or using 8u_191.? The machine where I see this happening is running CentOS 7.? Hopefully I will be in a position to restart the service on the machine in the near future so I can check whether the issue is persistent. I will attempt to reproduce on another Linux machine.? Does this seem like a bug? Thanks, Shawn From java at elyograg.org Fri Nov 2 21:48:20 2018 From: java at elyograg.org (Shawn Heisey) Date: Fri, 2 Nov 2018 15:48:20 -0600 Subject: Java 8 adding -XX:MaxNewSize to parameters In-Reply-To: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> Message-ID: <7acd72cb-9022-f24d-2b47-e75b7220aad1@elyograg.org> On 11/2/2018 3:23 PM, Shawn Heisey wrote: > I will attempt to reproduce on another Linux machine.? Does this seem > like a bug? The problem is not reproducing on another Linux machine (Ubuntu 18) with an earlier version of Java: Process listing: solr???? 10709???? 1 20 15:44 ???????? 00:00:15 /usr/lib/jvm/java-8-oracle/bin/java -server -Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/storage0/solr740/logs/solr_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -Dsolr.log.dir=/storage0/solr740/logs -Djetty.port=8900 -DSTOP.PORT=7900 -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/storage0/solr7/server -Dsolr.solr.home=/storage0/solr740/data -Dsolr.data.home= -Dsolr.install.dir=/storage0/solr7 -Dsolr.default.confdir=/storage0/solr7/server/solr/configsets/_default/conf -Dlog4j.configurationFile=file:/storage0/solr740/log4j2.xml -Xss256k -Dsolr.jetty.https.port=8900 -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/storage0/solr7/bin/oom_solr.sh 8900 /storage0/solr740/logs -jar start.jar --module=http In the GC log: Java HotSpot(TM) 64-Bit Server VM (25.171-b11) for linux-amd64 JRE (1.8.0_171-b11), built on Mar 28 2018 17:07:08 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Memory: 4k page, physical 24628836k(9793172k free), swap 8000508k(8000508k free) CommandLine flags: -XX:+AggressiveOpts -XX:G1HeapRegionSize=8388608 -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=3221225472 -XX:InitiatingHeapOccupancyPercent=75 -XX:MaxGCPauseMillis=250 -XX:MaxHeapSize=3221225472 -XX:NumberOfGCLogFiles=9 -XX:OnOutOfMemoryError=/storage0/solr7/bin/oom_solr.sh 8900 /storage0/solr740/logs -XX:+ParallelRefProcEnabled -XX:+PerfDisableSharedMem -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation -XX:+UseLargePages I don't have a lot of other combinations I can easily try. Thanks, Shawn From java at elyograg.org Fri Nov 2 22:35:50 2018 From: java at elyograg.org (Shawn Heisey) Date: Fri, 2 Nov 2018 16:35:50 -0600 Subject: Java 8 adding -XX:MaxNewSize to parameters In-Reply-To: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> Message-ID: <549777c1-ee66-e362-3c04-09549d628471@elyograg.org> On 11/2/2018 3:23 PM, Shawn Heisey wrote: > Process started Sep27, with G1 tuning, but no parameters setting > specific generation sizes. Another followup. Here's the GC logfiles currently found in the logs directory: -rw-rw-r--. 1 solr solr 20973298 Oct? 3 04:08 solr_gc.log.0 -rw-rw-r--. 1 solr solr 20974540 Oct? 8 06:33 solr_gc.log.1 -rw-rw-r--. 1 solr solr 20971712 Oct 14 16:22 solr_gc.log.2 -rw-rw-r--. 1 solr solr 20971895 Oct 20 05:42 solr_gc.log.3 -rw-rw-r--. 1 solr solr 20971661 Oct 24 15:18 solr_gc.log.4 -rw-rw-r--. 1 solr solr 20974231 Oct 29 05:47 solr_gc.log.5 -rw-rw-r--. 1 solr solr 17246462 Nov? 2 18:15 solr_gc.log.6.current Only files numbered 1 through 6 contain the -XX:MaxNewSize parameter.? The info in the previous message was obtained from the last logfile, the one with current in the name. The first logfile, solr_gc.log.0, does NOT contain that parameter.? Here's the top of the logfile and the first log entry, showing consistency between the first log and the the info output by the ps command -- but the -XX:MaxNewSize parameter is not present: Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE (1.8.0_162-b12), built on Dec 19 2017 21:15:48 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Memory: 4k page, physical 16268024k(8771152k free), swap 2098172k(2098096k free) CommandLine flags: -XX:+AggressiveOpts -XX:G1HeapRegionSize=8388608 -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=8328839168 -XX:InitiatingHeapOccupancyPercent=75 -XX:+ManagementServer -XX:MaxGCPauseMillis=250 -XX:MaxHeapSize=8328839168 -XX:NumberOfGCLogFiles=9 -XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 /data/solr-home/logs -XX:+ParallelRefProcEnabled -XX:+PerfDisableSharedMem -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation -XX:+UseLargePages 2018-09-27T01:11:19.136-0400: 0.518: Total time for which application threads were stopped: 0.0001694 seconds, Stopping threads took: 0.0000184 seconds So that means that in the first GC logfile written by Java, that parameter wasn't there, but then it was added for the rest of the logfiles.? Which is REALLY odd. My best guess is that G1GC eventually decided on the size of the eden generation that it wanted to use, and then added the explicit commandline option for that size to the active arguments.? I had thought that G1 was capable of adjusting the generation sizes on the fly at any time ... but the five logs that contain the -XX:MaxNewSize option all have it set to the same value, which is 4991221760 bytes.? It has been set to that value for at least a month. Thanks, Shawn From java at elyograg.org Sat Nov 3 14:18:55 2018 From: java at elyograg.org (Shawn Heisey) Date: Sat, 3 Nov 2018 08:18:55 -0600 Subject: Java 8 adding -XX:MaxNewSize to parameters In-Reply-To: <549777c1-ee66-e362-3c04-09549d628471@elyograg.org> References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> <549777c1-ee66-e362-3c04-09549d628471@elyograg.org> Message-ID: On 11/2/2018 4:35 PM, Shawn Heisey wrote: > The first logfile, solr_gc.log.0, does NOT contain that parameter. > Here's the top of the logfile and the first log entry, showing > consistency between the first log and the the info output by the ps > command A followup to my followup.? I seem to be the only person speaking! What I wrote here might be a little bit confusing.? The consistency I was talking about here was with the "Sep27" start time indicator on the ps command output. Thanks, Shawn From thomas.schatzl at oracle.com Mon Nov 5 12:58:56 2018 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Mon, 05 Nov 2018 13:58:56 +0100 Subject: Java 8 adding -XX:MaxNewSize to parameters In-Reply-To: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> Message-ID: <10d9f555e8345aafc10a0c6c99682a038d77af16.camel@oracle.com> Hi Shawn, On Fri, 2018-11-02 at 15:23 -0600, Shawn Heisey wrote: > I have a Solr install with the following info on the process list: > > solr 4426 1 15 Sep27 ? 5-17:59:51 java -server > -Xms7943m > -Xmx7943m -XX:+UseG1GC -XX:+PerfDisableSharedMem > -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m > -XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 > -XX:+UseLargePages -XX:+AggressiveOpts -verbose:gc > -XX:+PrintHeapAtGC > -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps > -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime > -Xloggc:/data/solr-home/logs/solr_gc.log -XX:+UseGCLogFileRotation > -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M > > [...] > > Process started Sep27, with G1 tuning, but no parameters setting > specific generation sizes. > > Here's the info at the top of a GC log written by that process: > > 2018-10-29 05:47:53 GC log file created /data/solr- > home/logs/solr_gc.log.6 > Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE > (1.8.0_162-b12), built on Dec 19 2017 21:15:48 by "java_re" with gcc > 4.3.0 20080428 (Red Hat 4.3.0-8) > Memory: 4k page, physical 16268024k(163252k free), swap > 2098172k(2056676k free) > CommandLine flags: -XX:+AggressiveOpts -XX:CICompilerCount=3 > -XX:ConcGCThreads=1 -XX:G1HeapRegionSize=8388608 > -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=8329887744 > -XX:InitiatingHeapOccupancyPercent=75 -XX:+ManagementServer > -XX:MarkStackSize=4194304 -XX:MaxGCPauseMillis=250 > -XX:MaxHeapSize=8329887744 -XX:MaxNewSize=4991221760 > -XX:MinHeapDeltaBytes=8388608 -XX:NumberOfGCLogFiles=9 > -XX:OnOutOfMemoryError=/data/solr/bin/oom_solr.sh 8983 > /data/solr-home/logs -XX:+ParallelRefProcEnabled > -XX:+PerfDisableSharedMem -XX:+PrintGC > -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC > -XX:+PrintTenuringDistribution -XX:ThreadStackSize=256 > -XX:+UseCompressedClassPointers -XX:+UseCompressedOops > -XX:+UseFastUnorderedTimeStamps -XX:+UseG1GC > -XX:+UseGCLogFileRotation > -XX:+UseLargePages > > As you can see, the gc log says that -XX:MaxNewSize is among the > parameters. The problem came to my attention when I put the gc log > into the gceasy.io website and it said "don't set your generation > sizes explicitly, G1 works better when they aren't explicit." Which > I thought was weird, because I didn't know of anywhere that the sizes > were being set. I don't WANT to set the generation sizes explicitly. > > I have confirmed by cross-referencing pids, checking which pid has > the logfile open, and watching for new data written to the same log > that it is indeed the same process, so I know the commandline > arguments from the ps listing are from the right process. > > Here's the version info for the java version that's running: > > [solr at RWVM-SolrDev03 logs]$ java -version > java version "1.8.0_162" > Java(TM) SE Runtime Environment (build 1.8.0_162-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode) > > I cannot reproduce the behavior manually on a Windows 10 machine, > either with the same version that's on the Linux machine or using > 8u_191. The machine where I see this happening is running CentOS > 7. > Hopefully I will be in a position to restart the service on the > machine in the near future so I can check whether the issue is > persistent. > > I will attempt to reproduce on another Linux machine. Does this > seem like a bug? I remember that is a (known) bug in printing command line flags. This may also occur the other way around, i.e. flags specified on the command line are not printed. This has no impact on flags actually used. I spent some time coming up with an existing bug number for that, but failed. Looking at the code that prints these flags, it actually prints any command line flags modified so far (including e.g. ergonomics), which depending on the time the method is called, adds or removes some flags in the output. Thanks, Thomas From elyograg at elyograg.org Sat Nov 3 14:17:22 2018 From: elyograg at elyograg.org (Shawn Heisey) Date: Sat, 3 Nov 2018 08:17:22 -0600 Subject: Java 8 adding -XX:MaxNewSize to parameters In-Reply-To: <549777c1-ee66-e362-3c04-09549d628471@elyograg.org> References: <9a8b97cb-6abb-5355-2e5c-a76ab3916917@elyograg.org> <549777c1-ee66-e362-3c04-09549d628471@elyograg.org> Message-ID: On 11/2/2018 4:35 PM, Shawn Heisey wrote: > The first logfile, solr_gc.log.0, does NOT contain that parameter. > Here's the top of the logfile and the first log entry, showing > consistency between the first log and the the info output by the ps > command A followup to my followup.? I seem to be the only person speaking! What I wrote here might be a little bit confusing.? The consistency I was talking about here was with the "Sep27" start time indicator on the ps command output. Thanks, Shawn From jai.forums2013 at gmail.com Fri Nov 23 13:55:30 2018 From: jai.forums2013 at gmail.com (Jaikiran Pai) Date: Fri, 23 Nov 2018 19:25:30 +0530 Subject: Java 8 + Docker container - CMS collector leaves around instances that have no GC roots Message-ID: Hi, I'm looking for some inputs in debugging a high memory usage issue (and subsequently the process being killed) in one of the applications I deal with. Given that from what I have looked into this issue so far, this appears to be something to do with the CMS collector, so I hope this is the right place to this question. A bit of a background - The application that I'm dealing with is ElasticSearch server version 1.7.5. We use Java 8: java version "1.8.0_172" Java(TM) SE Runtime Environment (build 1.8.0_172-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode) To add to the complexity in debugging this issue, this runs as a docker container on docker version 18.03.0-ce on a CentOS 7 host VM kernel version 3.10.0-693.5.2.el7.x86_64. We have been noticing that this container/process keeps getting killed by the oom-killer every few days. The dmesg logs suggest that the process has hit the "limits" set on the docker cgroups level. After debugging this over past day or so, I've reached a point where I can't make much sense of the data I'm looking at. The JVM process is started using the following params (of relevance): java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC .... As you can see it uses CMS collector with 75% of tenured/old gen for initiating the GC. After a few hours/days of running I notice that even though the CMS collector does run almost every hour or so, there are huge number of objects _with no GC roots_ that never get collected. These objects internally seem to hold on to ByteBuffer(s) which (from what I see) as a result never get released and the non-heap memory keeps building up, till the process gets killed. To give an example, here's the jmap -histo output (only relevant parts): ?? 1:??????? 861642????? 196271400? [B ?? 2:??????? 198776?????? 28623744? org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame ?? 3:??????? 676722?????? 21655104? org.apache.lucene.store.ByteArrayDataInput ?? 4:??????? 202398?????? 19430208? org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState ?? 5:??????? 261819?????? 18850968? org.apache.lucene.util.fst.FST$Arc ?? 6:??????? 178661?????? 17018376? [C ?? 7:???????? 31452?????? 16856024? [I ?? 8:??????? 203911??????? 8049352? [J ?? 9:???????? 85700??????? 5484800? java.nio.DirectByteBufferR ? 10:??????? 168935??????? 5405920? java.util.concurrent.ConcurrentHashMap$Node ? 11:???????? 89948??????? 5105328? [Ljava.lang.Object; ? 12:??????? 148514??????? 4752448? org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference .... Total?????? 5061244????? 418712248 This above output is without the "live" option. Running jmap -histo:live returns something like (again only relevant parts): ? 13:???????? 31753??????? 1016096? org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference ? ... ? 44:?????????? 887???????? 127728? org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame ? ... ? 50:????????? 3054????????? 97728? org.apache.lucene.store.ByteArrayDataInput ? ... ? 59:?????????? 888????????? 85248? org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState ? Total?????? 1177783????? 138938920 Notice the vast difference between the live and non-live instances of the same class. This isn't just in one "snapshot". I have been monitoring this for more than a day and this pattern continues. Even taking heap dumps and using tools like visualvm shows that these instances have "no GC root" and I have even checked the gc log files to see that the CMS collector does occasionally run. However these objects never seem to get collected. I realize this data may not be enough to narrow down the issue, but what I am looking for is some kind of help/input/hints/suggestions on what I should be trying to figure out why these instances aren't GCed. Is this something that's expected in certain situations? -Jaikiran From leo.korinth at oracle.com Fri Nov 23 17:07:14 2018 From: leo.korinth at oracle.com (Leo Korinth) Date: Fri, 23 Nov 2018 18:07:14 +0100 Subject: Java 8 + Docker container - CMS collector leaves around instances that have no GC roots In-Reply-To: References: Message-ID: <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com> Hi! On 23/11/2018 14:55, Jaikiran Pai wrote: > Hi, > > I'm looking for some inputs in debugging a high memory usage issue (and > subsequently the process being killed) in one of the applications I deal > with. Given that from what I have looked into this issue so far, this > appears to be something to do with the CMS collector, so I hope this is > the right place to this question. > > A bit of a background - The application that I'm dealing with is > ElasticSearch server version 1.7.5. We use Java 8: > > java version "1.8.0_172" > Java(TM) SE Runtime Environment (build 1.8.0_172-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode) > > To add to the complexity in debugging this issue, this runs as a docker > container on docker version 18.03.0-ce on a CentOS 7 host VM kernel > version 3.10.0-693.5.2.el7.x86_64. > > We have been noticing that this container/process keeps getting killed > by the oom-killer every few days. The dmesg logs suggest that the > process has hit the "limits" set on the docker cgroups level. After > debugging this over past day or so, I've reached a point where I can't > make much sense of the data I'm looking at. The JVM process is started > using the following params (of relevance): > > java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC .... What is your limit? You will need _more_ than 6 gig of memory; exactly how much is hard to say. You will probably find the limit faster if you use Xms==Xmx and maybe not need days of running the application. > > As you can see it uses CMS collector with 75% of tenured/old gen for > initiating the GC. > > After a few hours/days of running I notice that even though the CMS > collector does run almost every hour or so, there are huge number of > objects _with no GC roots_ that never get collected. These objects > internally seem to hold on to ByteBuffer(s) which (from what I see) as a > result never get released and the non-heap memory keeps building up, > till the process gets killed. To give an example, here's the jmap -histo > output (only relevant parts): > > ?? 1:??????? 861642????? 196271400? [B > ?? 2:??????? 198776?????? 28623744 > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame > ?? 3:??????? 676722?????? 21655104 > org.apache.lucene.store.ByteArrayDataInput > ?? 4:??????? 202398?????? 19430208 > org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState > ?? 5:??????? 261819?????? 18850968? org.apache.lucene.util.fst.FST$Arc > ?? 6:??????? 178661?????? 17018376? [C > ?? 7:???????? 31452?????? 16856024? [I > ?? 8:??????? 203911??????? 8049352? [J > ?? 9:???????? 85700??????? 5484800? java.nio.DirectByteBufferR > ? 10:??????? 168935??????? 5405920 > java.util.concurrent.ConcurrentHashMap$Node > ? 11:???????? 89948??????? 5105328? [Ljava.lang.Object; > ? 12:??????? 148514??????? 4752448 > org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference > > .... > > Total?????? 5061244????? 418712248 > > This above output is without the "live" option. Running jmap -histo:live > returns something like (again only relevant parts): > > ? 13:???????? 31753??????? 1016096 > org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference > ? ... > ? 44:?????????? 887???????? 127728 > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame > ? ... > ? 50:????????? 3054????????? 97728 > org.apache.lucene.store.ByteArrayDataInput > ? ... > ? 59:?????????? 888????????? 85248 > org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState > > ? Total?????? 1177783????? 138938920 > > > Notice the vast difference between the live and non-live instances of > the same class. This isn't just in one "snapshot". I have been > monitoring this for more than a day and this pattern continues. Even > taking heap dumps and using tools like visualvm shows that these > instances have "no GC root" and I have even checked the gc log files to > see that the CMS collector does occasionally run. However these objects > never seem to get collected. What makes you believe they never get collected? > > I realize this data may not be enough to narrow down the issue, but what > I am looking for is some kind of help/input/hints/suggestions on what I > should be trying to figure out why these instances aren't GCed. Is this > something that's expected in certain situations? Being killed by the oom-killer suggests that your limit is too low and/or your -Xmx is too large. If an increasing number of objects does not get collected, you would get an exception (not being killed by the oom-killer). What likely is happening is that your java heap slowly grows (not unusual with CMS that does not do compaction of old objects) and that the memory consumed by your docker image exceeds your limit. How big the limit should be is hard to tell, but it _must_ be larger than your "-Xmx" (the JVM is using more memory than the java heap, so this would be true even without the addition of docker). Thanks, Leo > -Jaikiran > > > > > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > From jai.forums2013 at gmail.com Sat Nov 24 01:35:31 2018 From: jai.forums2013 at gmail.com (Jaikiran Pai) Date: Sat, 24 Nov 2018 07:05:31 +0530 Subject: Java 8 + Docker container - CMS collector leaves around instances that have no GC roots In-Reply-To: <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com> References: <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com> Message-ID: <3b85eaab-5c31-2568-522b-59ec8bfcdfc2@gmail.com> Hello Leo, Thank you for responding. Replies inline. On 23/11/18 10:37 PM, Leo Korinth wrote: > Hi! > > On 23/11/2018 14:55, Jaikiran Pai wrote: >> >> >> java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC >> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly >> -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC .... > > What is your limit? The docker cgroups limit (set via --memory) is set to 8G. Here's the docker stats output of that container right now (been running for around 16 hours now): CONTAINER ID??????? NAME???????????????????????????? CPU %?????????????? MEM USAGE / LIMIT?? MEM %?????????????? NET I/O???????????? BLOCK I/O?????????? PIDS da5ee21b1d??????? elasticsearch?? 94.62%????????????? 3.657GiB / 8GiB???? 45.71%????????????? 0B / 0B???????????? 1.21MB / 18.2GB???? 227 > You will need _more_ than 6 gig of memory; exactly how much is hard to > say. You will probably find the limit faster if you use Xms==Xmx and > maybe not need days of running the application. That's interesting and useful - I'll give that a try. That will certainly help speed up the investigation. > >> >> Notice the vast difference between the live and non-live instances of >> the same class. This isn't just in one "snapshot". I have been >> monitoring this for more than a day and this pattern continues. Even >> taking heap dumps and using tools like visualvm shows that these >> instances have "no GC root" and I have even checked the gc log files to >> see that the CMS collector does occasionally run. However these objects >> never seem to get collected. > > What makes you believe they never get collected? I have gc logs enabled on this setup. I notice that the Full GC gets triggered once in a while. However, even after that Full GC completes, I still see these vast amount of "non-live" objects staying around in the heap (via jmap -histo output as well as a heap dump using visualvm). This is the "total" in heap via jmap -histo: ??? ??? ?? #instances ? #bytes? ??? Total????? 22880189???? 1644503392 (around 1.5 G) and this is the -histo:live Total?????? 1292631????? 102790184 (around 98M, not even 100M) Some of these non-live objects hold on to the ByteBuffer(s) which keep filling up then non-heap memory too (right now the non-heap "mapped" ByteBuffer memory as shown in the JMX MBean is around 2.5G). The Full GC log message looks like this: 2018-11-24T00:57:00.665+0000: 59655.295: [Full GC (Heap Inspection Initiated GC)) ?2018-11-24T00:57:00.665+0000: 59655.295: [CMS: 711842K->101527K(989632K), 0.5322 0752 secs] 1016723K->101527K(1986432K), [Metaspace: 48054K->48054K(1093632K)], 00 .5325692 secs] [Times: user=0.53 sys=0.00, real=0.53 secs] The jmap -heap output is this: Server compiler detected. JVM version is 25.172-b11 using parallel threads in the new generation. using thread-local object allocation. Concurrent Mark-Sweep GC Heap Configuration: ?? MinHeapFreeRatio???????? = 40 ?? MaxHeapFreeRatio???????? = 70 ?? MaxHeapSize????????????? = 6442450944 (6144.0MB) ?? NewSize????????????????? = 1134100480 (1081.5625MB) ?? MaxNewSize?????????????? = 1134100480 (1081.5625MB) ?? OldSize????????????????? = 1013383168 (966.4375MB) ?? NewRatio???????????????? = 2 ?? SurvivorRatio??????????? = 8 ?? MetaspaceSize??????????? = 21807104 (20.796875MB) ?? CompressedClassSpaceSize = 1073741824 (1024.0MB) ?? MaxMetaspaceSize???????? = 17592186044415 MB ?? G1HeapRegionSize???????? = 0 (0.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): ?? capacity = 1020723200 (973.4375MB) ?? used???? = 900988112 (859.2492218017578MB) ?? free???? = 119735088 (114.18827819824219MB) ?? 88.26958297802969% used Eden Space: ?? capacity = 907345920 (865.3125MB) ?? used???? = 879277776 (838.5446319580078MB) ?? free???? = 28068144 (26.767868041992188MB) ?? 96.9065663512324% used From Space: ?? capacity = 113377280 (108.125MB) ?? used???? = 21710336 (20.70458984375MB) ?? free???? = 91666944 (87.42041015625MB) ?? 19.148753612716764% used To Space: ?? capacity = 113377280 (108.125MB) ?? used???? = 0 (0.0MB) ?? free???? = 113377280 (108.125MB) ?? 0.0% used concurrent mark-sweep generation: ?? capacity = 1013383168 (966.4375MB) ?? used???? = 338488264 (322.8075637817383MB) ?? free???? = 674894904 (643.6299362182617MB) ?? 33.40180443968061% used 14231 interned Strings occupying 2258280 bytes. So based on these various logs and data, I notice that they aren't being collected. At the time when the process gets chosen to be killed, even then I see these vast non-live objects holding on. I don't have too much knowledge in the GC area, so a genuine question - Is it just a wrong expectation that whenever a Full GC runs, it is supposed to clear out the non-live objects? Or is it a usual thing that the collector doesn't choose them for collection for some reason? > >> >> I realize this data may not be enough to narrow down the issue, but what >> I am looking for is some kind of help/input/hints/suggestions on what I >> should be trying to figure out why these instances aren't GCed. Is this >> something that's expected in certain situations? > > Being killed by the oom-killer suggests that your limit is too low > and/or your -Xmx is too large. If an increasing number of objects does > not get collected, you would get an exception (not being killed by the > oom-killer). What likely is happening is that your java heap slowly > grows (not unusual with CMS that does not do compaction of old > objects) and that the memory consumed by your docker image exceeds > your limit. You are right. The heap usage (keeping into consideration the uncollected non-live objects) does indeed grow very slowly. There however aren't too many (live) objects on heap really (even after days when this process is about to be killed). I'll have to read up on the compaction that you mentioned about CMS. Here's one example output from dmesg when this process was killed previously: [Wed Aug? 8 03:21:13 2018] java invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-600 [Wed Aug? 8 03:21:13 2018] java cpuset=390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5 mems_allowed=0 [Wed Aug? 8 03:21:13 2018] CPU: 7 PID: 15827 Comm: java Not tainted 3.10.0-693.5.2.el7.x86_64 #1 ... [Wed Aug? 8 03:21:13 2018]? ffff88081472dee0 00000000fe941297 ffff8808062cfcb8 ffffffff816a3e51 [Wed Aug? 8 03:21:13 2018]? ffff8808062cfd48 ffffffff8169f246 ffff88080d6d3500 0000000000000001 [Wed Aug? 8 03:21:13 2018]? 0000000000000000 0000000000000000 ffff8808062cfcf8 0000000000000046 [Wed Aug? 8 03:21:13 2018] Call Trace: [Wed Aug? 8 03:21:13 2018]? [] dump_stack+0x19/0x1b [Wed Aug? 8 03:21:13 2018]? [] dump_header+0x90/0x229 [Wed Aug? 8 03:21:13 2018]? [] ? find_lock_task_mm+0x56/0xc0 [Wed Aug? 8 03:21:13 2018]? [] ? try_get_mem_cgroup_from_mm+0x28/0x60 [Wed Aug? 8 03:21:13 2018]? [] oom_kill_process+0x254/0x3d0 [Wed Aug? 8 03:21:13 2018]? [] ? selinux_capable+0x1c/0x40 [Wed Aug? 8 03:21:13 2018]? [] mem_cgroup_oom_synchronize+0x546/0x570 [Wed Aug? 8 03:21:13 2018]? [] ? mem_cgroup_charge_common+0xc0/0xc0 [Wed Aug? 8 03:21:13 2018]? [] pagefault_out_of_memory+0x14/0x90 [Wed Aug? 8 03:21:13 2018]? [] mm_fault_error+0x68/0x12b [Wed Aug? 8 03:21:13 2018]? [] __do_page_fault+0x391/0x450 [Wed Aug? 8 03:21:13 2018]? [] do_page_fault+0x35/0x90 [Wed Aug? 8 03:21:13 2018]? [] page_fault+0x28/0x30 [Wed Aug? 8 03:21:13 2018] Task in /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5 killed as a result of limit of /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5 [Wed Aug? 8 03:21:13 2018] memory: usage 8388608kB, limit 8388608kB, failcnt 426669 [Wed Aug? 8 03:21:13 2018] memory+swap: usage 8388608kB, limit 8388608kB, failcnt 152 [Wed Aug? 8 03:21:13 2018] kmem: usage 5773512kB, limit 9007199254740988kB, failcnt 0 [Wed Aug? 8 03:21:13 2018] Memory cgroup stats for /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5: cache:14056KB rss:2601040KB rss_huge:2484224KB mapped_file:6428KB swap:0KB inactive_anon:0KB active_anon:142640KB inactive_file:4288KB active_file:3324KB unevictable:2464692KB [Wed Aug? 8 03:21:13 2018] [ pid ]?? uid? tgid total_vm????? rss nr_ptes swapents oom_score_adj name [Wed Aug? 8 03:21:13 2018] [15582]? 1000 15582? 3049827?? 654465??? 1446??????? 0????????? -600 java [Wed Aug? 8 03:21:13 2018] Memory cgroup out of memory: Kill process 12625 (java) score 0 or sacrifice child [Wed Aug? 8 03:21:13 2018] Killed process 15582 (java) total-vm:12199308kB, anon-rss:2599320kB, file-rss:18540kB, shmem-rss:0kB > How big the limit should be is hard to tell, but it _must_ be larger > than your "-Xmx" (the JVM is using more memory than the java heap, so > this would be true even without the addition of docker). Agreed and that's where the ByteBuffer(s) are coming into picture (around 2.5 GB right now and growing steadily). -Jaikiran From leo.korinth at oracle.com Mon Nov 26 10:03:26 2018 From: leo.korinth at oracle.com (Leo Korinth) Date: Mon, 26 Nov 2018 11:03:26 +0100 Subject: Java 8 + Docker container - CMS collector leaves around instances that have no GC roots In-Reply-To: <3b85eaab-5c31-2568-522b-59ec8bfcdfc2@gmail.com> References: <02dabb9d-a87f-3d4a-cfbc-9ef1eff0b11f@oracle.com> <3b85eaab-5c31-2568-522b-59ec8bfcdfc2@gmail.com> Message-ID: <6ebac3fe-2f69-2ed3-980d-6369a2151005@oracle.com> Hi! > > The docker cgroups limit (set via --memory) is set to 8G. Here's the > docker stats output of that container right now (been running for around > 16 hours now): Okay, so you have about 2 gig to use outside the java heap. The docker image will use some space. The JVM will use memory (off java heap) for compiler caches, threads, class metadata, mmaped files (_mmap is used by elastic_) and other things. I can not give you good help on how much this is, but a quick look at: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html suggest that you should only give maximum half the memory to the java heap. If you give 6 gig to the heap, they seem to suggest 12 gig of memory (instead of 8). Now, I do not know much about elastic search to say this is enough, and that guide does not seem to mention docker. I would suggest that you try to find a configuration that also _limits_ the use of file cache for elastic. If elastic does not understand that it is running under docker, it _might_ use huge file caches. Try to limit the file caches and give the docker image x gig for java heap, y gig for caches and some extra for slack. Hope this helps. Thanks, Leo > > CONTAINER ID??????? NAME???????????????????????????? CPU % > MEM USAGE / LIMIT?? MEM %?????????????? NET I/O???????????? BLOCK > I/O?????????? PIDS > da5ee21b1d??????? elasticsearch?? 94.62%????????????? 3.657GiB / > 8GiB???? 45.71%????????????? 0B / 0B???????????? 1.21MB / 18.2GB???? 227 >> You will need _more_ than 6 gig of memory; exactly how much is hard to >> say. You will probably find the limit faster if you use Xms==Xmx and >> maybe not need days of running the application. > That's interesting and useful - I'll give that a try. That will > certainly help speed up the investigation. > >> >>> >>> Notice the vast difference between the live and non-live instances of >>> the same class. This isn't just in one "snapshot". I have been >>> monitoring this for more than a day and this pattern continues. Even >>> taking heap dumps and using tools like visualvm shows that these >>> instances have "no GC root" and I have even checked the gc log files to >>> see that the CMS collector does occasionally run. However these objects >>> never seem to get collected. >> >> What makes you believe they never get collected? > I have gc logs enabled on this setup. I notice that the Full GC gets > triggered once in a while. However, even after that Full GC completes, I > still see these vast amount of "non-live" objects staying around in the > heap (via jmap -histo output as well as a heap dump using visualvm). > This is the "total" in heap via jmap -histo: > > ??? ??? ?? #instances ? #bytes > Total????? 22880189???? 1644503392 (around 1.5 G) > > and this is the -histo:live > > Total?????? 1292631????? 102790184 (around 98M, not even 100M) > > > Some of these non-live objects hold on to the ByteBuffer(s) which keep > filling up then non-heap memory too (right now the non-heap "mapped" > ByteBuffer memory as shown in the JMX MBean is around 2.5G). The Full GC > log message looks like this: > > 2018-11-24T00:57:00.665+0000: 59655.295: [Full GC (Heap Inspection > Initiated GC)) > ?2018-11-24T00:57:00.665+0000: 59655.295: [CMS: > 711842K->101527K(989632K), 0.5322 > 0752 secs] 1016723K->101527K(1986432K), [Metaspace: > 48054K->48054K(1093632K)], 00 > .5325692 secs] [Times: user=0.53 sys=0.00, real=0.53 secs] > > The jmap -heap output is this: > > Server compiler detected. > JVM version is 25.172-b11 > > using parallel threads in the new generation. > using thread-local object allocation. > Concurrent Mark-Sweep GC > > Heap Configuration: > ?? MinHeapFreeRatio???????? = 40 > ?? MaxHeapFreeRatio???????? = 70 > ?? MaxHeapSize????????????? = 6442450944 (6144.0MB) > ?? NewSize????????????????? = 1134100480 (1081.5625MB) > ?? MaxNewSize?????????????? = 1134100480 (1081.5625MB) > ?? OldSize????????????????? = 1013383168 (966.4375MB) > ?? NewRatio???????????????? = 2 > ?? SurvivorRatio??????????? = 8 > ?? MetaspaceSize??????????? = 21807104 (20.796875MB) > ?? CompressedClassSpaceSize = 1073741824 (1024.0MB) > ?? MaxMetaspaceSize???????? = 17592186044415 MB > ?? G1HeapRegionSize???????? = 0 (0.0MB) > > Heap Usage: > New Generation (Eden + 1 Survivor Space): > ?? capacity = 1020723200 (973.4375MB) > ?? used???? = 900988112 (859.2492218017578MB) > ?? free???? = 119735088 (114.18827819824219MB) > ?? 88.26958297802969% used > Eden Space: > ?? capacity = 907345920 (865.3125MB) > ?? used???? = 879277776 (838.5446319580078MB) > ?? free???? = 28068144 (26.767868041992188MB) > ?? 96.9065663512324% used > From Space: > ?? capacity = 113377280 (108.125MB) > ?? used???? = 21710336 (20.70458984375MB) > ?? free???? = 91666944 (87.42041015625MB) > ?? 19.148753612716764% used > To Space: > ?? capacity = 113377280 (108.125MB) > ?? used???? = 0 (0.0MB) > ?? free???? = 113377280 (108.125MB) > ?? 0.0% used > concurrent mark-sweep generation: > ?? capacity = 1013383168 (966.4375MB) > ?? used???? = 338488264 (322.8075637817383MB) > ?? free???? = 674894904 (643.6299362182617MB) > ?? 33.40180443968061% used > > 14231 interned Strings occupying 2258280 bytes. > > So based on these various logs and data, I notice that they aren't being > collected. At the time when the process gets chosen to be killed, even > then I see these vast non-live objects holding on. I don't have too much > knowledge in the GC area, so a genuine question - Is it just a wrong > expectation that whenever a Full GC runs, it is supposed to clear out > the non-live objects? Or is it a usual thing that the collector doesn't > choose them for collection for some reason? >> >>> >>> I realize this data may not be enough to narrow down the issue, but what >>> I am looking for is some kind of help/input/hints/suggestions on what I >>> should be trying to figure out why these instances aren't GCed. Is this >>> something that's expected in certain situations? >> >> Being killed by the oom-killer suggests that your limit is too low >> and/or your -Xmx is too large. If an increasing number of objects does >> not get collected, you would get an exception (not being killed by the >> oom-killer). What likely is happening is that your java heap slowly >> grows (not unusual with CMS that does not do compaction of old >> objects) and that the memory consumed by your docker image exceeds >> your limit. > You are right. The heap usage (keeping into consideration the > uncollected non-live objects) does indeed grow very slowly. There > however aren't too many (live) objects on heap really (even after days > when this process is about to be killed). I'll have to read up on the > compaction that you mentioned about CMS. > > Here's one example output from dmesg when this process was killed > previously: > > [Wed Aug? 8 03:21:13 2018] java invoked oom-killer: gfp_mask=0xd0, > order=0, oom_score_adj=-600 > [Wed Aug? 8 03:21:13 2018] java > cpuset=390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5 > mems_allowed=0 > [Wed Aug? 8 03:21:13 2018] CPU: 7 PID: 15827 Comm: java Not tainted > 3.10.0-693.5.2.el7.x86_64 #1 > ... > [Wed Aug? 8 03:21:13 2018]? ffff88081472dee0 00000000fe941297 > ffff8808062cfcb8 ffffffff816a3e51 > [Wed Aug? 8 03:21:13 2018]? ffff8808062cfd48 ffffffff8169f246 > ffff88080d6d3500 0000000000000001 > [Wed Aug? 8 03:21:13 2018]? 0000000000000000 0000000000000000 > ffff8808062cfcf8 0000000000000046 > [Wed Aug? 8 03:21:13 2018] Call Trace: > [Wed Aug? 8 03:21:13 2018]? [] dump_stack+0x19/0x1b > [Wed Aug? 8 03:21:13 2018]? [] dump_header+0x90/0x229 > [Wed Aug? 8 03:21:13 2018]? [] ? > find_lock_task_mm+0x56/0xc0 > [Wed Aug? 8 03:21:13 2018]? [] ? > try_get_mem_cgroup_from_mm+0x28/0x60 > [Wed Aug? 8 03:21:13 2018]? [] > oom_kill_process+0x254/0x3d0 > [Wed Aug? 8 03:21:13 2018]? [] ? selinux_capable+0x1c/0x40 > [Wed Aug? 8 03:21:13 2018]? [] > mem_cgroup_oom_synchronize+0x546/0x570 > [Wed Aug? 8 03:21:13 2018]? [] ? > mem_cgroup_charge_common+0xc0/0xc0 > [Wed Aug? 8 03:21:13 2018]? [] > pagefault_out_of_memory+0x14/0x90 > [Wed Aug? 8 03:21:13 2018]? [] mm_fault_error+0x68/0x12b > [Wed Aug? 8 03:21:13 2018]? [] __do_page_fault+0x391/0x450 > [Wed Aug? 8 03:21:13 2018]? [] do_page_fault+0x35/0x90 > [Wed Aug? 8 03:21:13 2018]? [] page_fault+0x28/0x30 > [Wed Aug? 8 03:21:13 2018] Task in > /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5 > killed as a result of limit of > /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5 > [Wed Aug? 8 03:21:13 2018] memory: usage 8388608kB, limit 8388608kB, > failcnt 426669 > [Wed Aug? 8 03:21:13 2018] memory+swap: usage 8388608kB, limit > 8388608kB, failcnt 152 > [Wed Aug? 8 03:21:13 2018] kmem: usage 5773512kB, limit > 9007199254740988kB, failcnt 0 > [Wed Aug? 8 03:21:13 2018] Memory cgroup stats for > /docker/390dfe142d30d73b43a35996ae11b9647fc0598651c8959e95788edbbc9916b5: > cache:14056KB rss:2601040KB rss_huge:2484224KB mapped_file:6428KB > swap:0KB inactive_anon:0KB active_anon:142640KB inactive_file:4288KB > active_file:3324KB unevictable:2464692KB > [Wed Aug? 8 03:21:13 2018] [ pid ]?? uid? tgid total_vm????? rss nr_ptes > swapents oom_score_adj name > [Wed Aug? 8 03:21:13 2018] [15582]? 1000 15582? 3049827?? 654465 > 1446??????? 0????????? -600 java > [Wed Aug? 8 03:21:13 2018] Memory cgroup out of memory: Kill process > 12625 (java) score 0 or sacrifice child > [Wed Aug? 8 03:21:13 2018] Killed process 15582 (java) > total-vm:12199308kB, anon-rss:2599320kB, file-rss:18540kB, shmem-rss:0kB > > >> How big the limit should be is hard to tell, but it _must_ be larger >> than your "-Xmx" (the JVM is using more memory than the java heap, so >> this would be true even without the addition of docker). > Agreed and that's where the ByteBuffer(s) are coming into picture > (around 2.5 GB right now and growing steadily). > > -Jaikiran > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > From jwhiting at redhat.com Mon Nov 26 10:26:41 2018 From: jwhiting at redhat.com (jwhiting at redhat.com) Date: Mon, 26 Nov 2018 10:26:41 +0000 Subject: Java 8 + Docker container - CMS collector leaves around instances that have no GC roots In-Reply-To: References: Message-ID: <5604aaaf008ce26e313e5e7ad7fa1ae4844afbac.camel@redhat.com> Hi Jaikiran Have a look at some blog posts by old friends :) These blog posts might be helpful (along with the other replies you received) to diagnose the root cause of the issue. In particular native memory tracking. https://developers.redhat.com/blog/2017/03/14/java-inside-docker/ https://developers.redhat.com/blog/2017/04/04/openjdk-and-containers/ Regards, Jeremy On Fri, 2018-11-23 at 19:25 +0530, Jaikiran Pai wrote: > Hi, > > I'm looking for some inputs in debugging a high memory usage issue > (and > subsequently the process being killed) in one of the applications I > deal > with. Given that from what I have looked into this issue so far, this > appears to be something to do with the CMS collector, so I hope this > is > the right place to this question. > > A bit of a background - The application that I'm dealing with is > ElasticSearch server version 1.7.5. We use Java 8: > > java version "1.8.0_172" > Java(TM) SE Runtime Environment (build 1.8.0_172-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode) > > To add to the complexity in debugging this issue, this runs as a > docker > container on docker version 18.03.0-ce on a CentOS 7 host VM kernel > version 3.10.0-693.5.2.el7.x86_64. > > We have been noticing that this container/process keeps getting > killed > by the oom-killer every few days. The dmesg logs suggest that the > process has hit the "limits" set on the docker cgroups level. After > debugging this over past day or so, I've reached a point where I > can't > make much sense of the data I'm looking at. The JVM process is > started > using the following params (of relevance): > > java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC .... > > As you can see it uses CMS collector with 75% of tenured/old gen for > initiating the GC. > > After a few hours/days of running I notice that even though the CMS > collector does run almost every hour or so, there are huge number of > objects _with no GC roots_ that never get collected. These objects > internally seem to hold on to ByteBuffer(s) which (from what I see) > as a > result never get released and the non-heap memory keeps building up, > till the process gets killed. To give an example, here's the jmap > -histo > output (only relevant parts): > > 1: 861642 196271400 [B > 2: 198776 28623744 > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame > 3: 676722 21655104 > org.apache.lucene.store.ByteArrayDataInput > 4: 202398 19430208 > org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTerm > State > 5: 261819 18850968 > org.apache.lucene.util.fst.FST$Arc > 6: 178661 17018376 [C > 7: 31452 16856024 [I > 8: 203911 8049352 [J > 9: 85700 5484800 java.nio.DirectByteBufferR > 10: 168935 5405920 > java.util.concurrent.ConcurrentHashMap$Node > 11: 89948 5105328 [Ljava.lang.Object; > 12: 148514 4752448 > org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference > > .... > > Total 5061244 418712248 > > This above output is without the "live" option. Running jmap > -histo:live > returns something like (again only relevant parts): > > 13: 31753 1016096 > org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference > ... > 44: 887 127728 > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame > ... > 50: 3054 97728 > org.apache.lucene.store.ByteArrayDataInput > ... > 59: 888 85248 > org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTerm > State > > Total 1177783 138938920 > > > Notice the vast difference between the live and non-live instances of > the same class. This isn't just in one "snapshot". I have been > monitoring this for more than a day and this pattern continues. Even > taking heap dumps and using tools like visualvm shows that these > instances have "no GC root" and I have even checked the gc log files > to > see that the CMS collector does occasionally run. However these > objects > never seem to get collected. > > I realize this data may not be enough to narrow down the issue, but > what > I am looking for is some kind of help/input/hints/suggestions on what > I > should be trying to figure out why these instances aren't GCed. Is > this > something that's expected in certain situations? > > -Jaikiran > > > > > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -- -- Jeremy Whiting Senior Software Engineer, Middleware Performance Team Red Hat ------------------------------------------------------------ Registered Address: Red Hat UK Ltd, Peninsular House, 30 Monument Street, London. United Kingdom. Registered in England and Wales under Company Registration No. 03798903. Directors: Directors:Michael Cunningham (US), Michael O'Neill(Ireland), Eric Shander (US)