CMS major GC doesn't clean up heap enough

Fri Jun 14 21:32:28 UTC 2019

Thanks for the quick response Bernd!

The app is Hadoop HDFS Namenode (not sure if you know something about it)

For your points:

- try to check heapdump or object histogram to get a feeling for the memory usage 
  We are doing it over the weekend and hopefully we can get some idea about untraceable objects. Any other things we need to pay attention to?

- if you have some active caching which reacts on settings, memory usage or available ram check if those settings are ok
- check the finalizer queue

  Not sure about the above too, I don’t think we have active caching based on our JVM options. I need to check the finalizer queues if possible.

- there might be not enough FullGCs to free finalizer, soft references, native buffers or class(loaders)

 There is no Full GC in the GC log and TBH we don’t want Full GC since this will cause long delay which is not acceptable.

I evaluated G1GC but it doesn’t work well especially it needs another 50% off heap memory size for RS and CS.

Thanks,
Fengnan

> On Jun 14, 2019, at 2:25 PM, Bernd Eckenfels <ecki at zusammenkunft.net> wrote:
> 
> Hello,
> 
> Hard to say  without knowing the app, but I can give you a few hints:
> 
> - try to check heapdump or object histogram to get a feeling for the memory usage
> - if you have some active caching which reacts on settings, memory usage or available ram check if those settings are ok
> - check the finalizer queue
> - there might be not enough FullGCs to free finalizer, soft references, native buffers or class(loaders)
> 
> There might be some pathological cases where GC falls flat, but it is rather unlikely compared to application errors.
> 
> Did you tried G1 yet?
> 
> Gruss
> Bernd
> 
> 
> --
> http://bernd.eckenfels.net
>  
> Von: hotspot-gc-use <hotspot-gc-use-bounces at openjdk.java.net> im Auftrag von Fengnan Li <lfengnan at uber.com>
> Gesendet: Freitag, Juni 14, 2019 11:12 PM
> An: hotspot-gc-use at openjdk.java.net
> Betreff: CMS major GC doesn't clean up heap enough
>  
> Hi, 
> 
> I am using CMS in in the app with very large heap size (200G), and I found that when the major GC was triggered, not much objects were cleaned up, ending with almost the same heap usage. Can someone help? 
> 
> Our JVM option: 
> 
> -XX:+CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=40 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:GCLogFileSize=104857600 -XX:InitialHeapSize=214748364800 -XX:+ManagementServer -XX:MaxHeapSize=214748364800 -XX:MaxNewSize=17179869184 -XX:MaxTenuringThreshold=6 -XX:NewSize=17179869184 -XX:OldPLABSize=16 -XX:ParGCCardsPerStrideChunk=32768 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseConcMarkSweepGC -XX:+UseCondCardMark -XX:+UseGCLogFileRotation -XX:+UseParNewGC 
> 
> Some part of the Major GC log: 
> 
> 2019-05-06T17:32:46.822+0000: 485.965: [GC (CMS Initial Mark) [1 CMS-initial-mark: 77254857K(192937984K)] 79190292K(208037504K), 0.1421191 secs] [Times: user=1.96 sys=0.04, real=0.14 secs] 
> 2019-05-06T17:32:46.964+0000: 486.108: [CMS-concurrent-mark-start] 
> 2019-05-06T17:32:52.757+0000: 491.900: [GC (Allocation Failure) 2019-05-06T17:32:52.757+0000: 491.900: [ParNew: 15099520K->1677696K(15099520K), 2.5151990 secs] 92354377K->80544859K(208037504K), 2.5153318 secs] [Times: user=50.76 sys=1.14, real=2.51 secs] 
> 2019-05-06T17:33:01.126+0000: 500.269: [GC (Allocation Failure) 2019-05-06T17:33:01.126+0000: 500.270: [ParNew: 15099520K->1677696K(15099520K), 2.2934176 secs] 93966683K->81938923K(208037504K), 2.2935316 secs] [Times: user=46.49 sys=1.09, real=2.29 secs] 
> 2019-05-06T17:33:09.430+0000: 508.573: [GC (Allocation Failure) 2019-05-06T17:33:09.430+0000: 508.574: [ParNew: 15099520K->1677696K(15099520K), 1.8824589 secs] 95360747K->83297959K(208037504K), 1.8825875 secs] [Times: user=33.19 sys=1.00, real=1.89 secs] 
> 2019-05-06T17:33:13.116+0000: 512.259: [CMS-concurrent-mark: 19.456/26.152 secs] [Times: user=265.04 sys=5.29, real=26.15 secs] 
> 2019-05-06T17:33:13.116+0000: 512.260: [CMS-concurrent-preclean-start] 
> 2019-05-06T17:33:14.382+0000: 513.526: [GC (Allocation Failure) 2019-05-06T17:33:14.382+0000: 513.526: [ParNew: 15099520K->1345724K(15099520K), 2.2104188 secs] 96719783K->84244360K(208037504K), 2.2105118 secs] [Times: user=42.36 sys=0.88, real=2.21 secs] 
> 2019-05-06T17:33:18.620+0000: 517.764: [CMS-concurrent-preclean: 3.284/5.504 secs] [Times: user=48.81 sys=1.03, real=5.51 secs] 
> 2019-05-06T17:33:18.620+0000: 517.764: [CMS-concurrent-abortable-preclean-start] 
> 2019-05-06T17:33:21.094+0000: 520.238: [GC (Allocation Failure) 2019-05-06T17:33:21.094+0000: 520.238: [ParNew: 14767548K->1677696K(15099520K), 1.7077210 secs] 97666184K->84979595K(208037504K), 1.7078275 secs] [Times: user=36.05 sys=0.39, real=1.71 secs] 
> 2019-05-06T17:33:23.280+0000: 522.424: [CMS-concurrent-abortable-preclean: 2.951/4.660 secs] [Times: user=41.72 sys=0.63, real=4.65 secs] 
> 2019-05-06T17:33:23.281+0000: 522.424: [GC (CMS Final Remark) [YG occupancy: 3134013 K (15099520 K)]2019-05-06T17:33:23.281+0000: 522.424: [Rescan (parallel) , 0.1699867 secs]2019-05-06T17:33:23.451+0000: 522.594: [weak refs processing, 0.0001797 secs]2019-05-06T17:33:23.451+0000: 522.594: [class unloading, 0.0121369 secs]2019-05-06T17:33:23.463+0000: 522.607: [scrub symbol table, 0.0050611 secs]2019-05-06T17:33:23.468+0000: 522.612: [scrub string table, 0.0007207 secs][1 CMS-remark: 83301899K(192937984K)] 86435913K(208037504K), 0.1900326 secs] [Times: user=3.61 sys=0.03, real=0.18 secs] 
> 2019-05-06T17:33:23.471+0000: 522.614: [CMS-concurrent-sweep-start] 
> 2019-05-06T17:33:27.826+0000: 526.969: [GC (Allocation Failure) 2019-05-06T17:33:27.826+0000: 526.969: [ParNew: 15099520K->1677696K(15099520K), 2.5180474 secs] 98390054K->86741207K(208037504K), 2.5181662 secs] [Times: user=45.81 sys=1.33, real=2.52 secs] 
> 2019-05-06T17:33:34.762+0000: 533.906: [GC (Allocation Failure) 2019-05-06T17:33:34.762+0000: 533.906: [ParNew: 15099520K->1641794K(15099520K), 1.9212927 secs] 100160159K->87911207K(208037504K), 1.9214017 secs] [Times: user=36.08 sys=0.88, real=1.92 secs] 
> 2019-05-06T17:33:41.271+0000: 540.414: [GC (Allocation Failure) 2019-05-06T17:33:41.271+0000: 540.414: [ParNew: 15063618K->1677696K(15099520K), 2.3496494 secs] 101330218K->89115463K(208037504K), 2.3497471 secs] [Times: user=46.58 sys=0.87, real=2.36 secs] 
> 2019-05-06T17:33:47.963+0000: 547.107: [GC (Allocation Failure) 2019-05-06T17:33:47.963+0000: 547.107: [ParNew: 15099520K->1677696K(15099520K), 2.0292300 secs] 102536721K->90319393K(208037504K), 2.0293108 secs] [Times: user=38.49 sys=0.84, real=2.03 secs] 
> 2019-05-06T17:33:50.641+0000: 549.784: [CMS-concurrent-sweep: 18.349/27.170 secs] [Times: user=202.69 sys=4.94, real=27.18 secs] 
> 2019-05-06T17:33:50.641+0000: 549.784: [CMS-concurrent-reset-start] 
> 2019-05-06T17:33:51.305+0000: 550.448: [CMS-concurrent-reset: 0.663/0.664 secs] [Times: user=1.28 sys=0.04, real=0.66 secs] 
> 2019-05-06T17:33:53.304+0000: 552.448: [GC (CMS Initial Mark) [1 CMS-initial-mark: 88641697K(192937984K)] 100346893K(208037504K), 0.2599963 secs] [Times: user=4.83 sys=0.03, real=0.26 secs] 
> 
> I am not sure if this is indicating some memory leak or GC is just not working properly. I have the same app in another machine running with the same JVM option works fine under 1/10th of the traffic as this one. 
> 
> Thanks, 
> Fengnan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20190614/e8e52fcb/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2710 bytes
Desc: not available
URL: <https://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20190614/e8e52fcb/smime-0001.p7s>