From gnormington at pivotal.io Thu Nov 9 14:52:36 2017 From: gnormington at pivotal.io (Glyn Normington) Date: Thu, 9 Nov 2017 14:52:36 +0000 Subject: Determining "GC" memory area size Message-ID: tl;dr: How is the size of the "GC" memory area (reported by NMT) determined? The open source project I work on is running Java applications in Linux containers which result in processes being killed when the container's defined memory size, essentially in terms of pages of RAM, is exceeded. When this happens, users don't get any reasonable feedback to know whether the heap, metaspace, etc. is the problem and what to do about it. We have two components which attempt to help with this situation: 1. Java memory calculator ( https://github.com/cloudfoundry/java-buildpack-memory-calculator) This takes the container memory size together with an estimate of the number of loaded classes and threads consumed by the application and sets various JVM memory settings such as heap, metaspace, etc. The goal is to prevent the JVM from using more memory than the container has available, so that container OOM does not occur and if the JVM runs out of memory, it does so in a diagnosable way. 2. jvmkill JVMTI agent (https://github.com/cloudfoundry/jvmkill) When the JVM hits a resource exhaustion event, due either to lack of memory or threads, this agent prints various diagnostics to help the user decide what needs to be done to avoid the problem in future. If a threshold is exceeded, the agent then kills the JVM, otherwise the agent returns to the JVM and allows OutOfMemoryError to be thrown. One of our users recently found (see [1] below for details) that the memory calculator is not taking the "GC" memory area into account. Consequently, a JVM can exceed the container's memory size which means the user doesn't get any helpful diagnostics from either jvmkill or an OutOfMemoryError. Using NMT, the user observed that "GC" memory seems to be about 5% of the heap size for heaps of a few GB in size. Can anyone here tell me how the GC memory area size is determined? If there is documentation, so much the better as we'd prefer not to depend on code details that might flux arbitrarily. -- Regards, Glyn [1] https://github.com/cloudfoundry/java-buildpack/issues/494 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tequilaron at gmail.com Mon Nov 13 19:12:34 2017 From: tequilaron at gmail.com (Ron Reynolds) Date: Mon, 13 Nov 2017 11:12:34 -0800 Subject: definition of a "TLAB refill"? Message-ID: sorry for the n00b or repeat question (if there is a good way to search the list archives please send me a URL) but i am trying to find a solid definition of a "TLAB refill" - most TLAB docs refer to the "refill" but don't actually define what it means. thanks. ....................ron. -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Mon Nov 13 20:16:31 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Mon, 13 Nov 2017 21:16:31 +0100 Subject: definition of a "TLAB refill"? In-Reply-To: References: Message-ID: <1510604191.7132.1.camel@oracle.com> Hi, On Mon, 2017-11-13 at 11:12 -0800, Ron Reynolds wrote: > sorry for the n00b or repeat question (if there is a good way to > search the list archives please send me a URL) but i am trying to > find a solid definition of a "TLAB refill" - most TLAB docs refer to > the "refill" but don't actually define what it means. > thanks. > ....................ron. a TLAB refill is the process of a (Java) thread, having filled its current TLAB with objects (or having some reason to give up the current one), getting a new, empty, one for further allocation. Where do you think in the (JDK) docs could be an appropriate place to add a few sentences of explanation? Thanks, Thomas From tequilaron at gmail.com Mon Nov 13 20:51:16 2017 From: tequilaron at gmail.com (Ron Reynolds) Date: Mon, 13 Nov 2017 12:51:16 -0800 Subject: definition of a "TLAB refill"? In-Reply-To: <1510604191.7132.1.camel@oracle.com> References: <1510604191.7132.1.camel@oracle.com> Message-ID: that's what i suspected it was but wasn't sure, even after reading thru one of the best docs i could find on TLABs - a 2006 blog posting (obviously pre-G1) at https://blogs.oracle.com/jonthecollector/the-real-thing i think it would be a great addition to http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.html the rest of the best cover of TLABs i've found is "Java Perf" by Oaks and "Java Perf Companion" by Hunt et al. the rest of the online docs i've hunted thru (for both G1 in general and TLAB in particular) are all over the place; anywhere where someone is explaining the output of -XX:+PrintTLAB would make sense to me. On Mon, Nov 13, 2017 at 12:16 PM, Thomas Schatzl wrote: > Hi, > > On Mon, 2017-11-13 at 11:12 -0800, Ron Reynolds wrote: > > sorry for the n00b or repeat question (if there is a good way to > > search the list archives please send me a URL) but i am trying to > > find a solid definition of a "TLAB refill" - most TLAB docs refer to > > the "refill" but don't actually define what it means. > > thanks. > > ....................ron. > > a TLAB refill is the process of a (Java) thread, having filled its > current TLAB with objects (or having some reason to give up the current > one), getting a new, empty, one for further allocation. > > Where do you think in the (JDK) docs could be an appropriate place to > add a few sentences of explanation? > > Thanks, > Thomas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jarkko.miettinen at relex.fi Tue Nov 14 13:08:07 2017 From: jarkko.miettinen at relex.fi (Jarkko Miettinen) Date: Tue, 14 Nov 2017 15:08:07 +0200 Subject: Tracking potential GC bugs In-Reply-To: <9194a30e-b1b1-c6fd-8942-2fc0580c2283@relexsolutions.com> References: <9194a30e-b1b1-c6fd-8942-2fc0580c2283@relexsolutions.com> Message-ID: Any hints for some better forums to solicit advice? On 20/09/2017 11.12, Jarkko Miettinen wrote: > Hello all, > > I am not sure if this is the best forum for soliciting advice on how > to hunt potential GC bugs, but this was the best I could come up with. > > Ideas about better forums are welcome. > > This post is about bugs > https://bugs.openjdk.java.net/browse/JDK-8172756 and > https://bugs.openjdk.java.net/browse/JDK-8143310 > > which both we're seeing when using G1 GC. We've seen this problem on > 92/112/131/141 releases of JDK 8. > > Currently, we have a situation where we're usually able to reproduce > the maybe crash once in a three days by running the whole application > > and mimicing actual usage with scripts, with no hope in sight for any > shorter / simpler reproduction. > > As the crash was in oopDesc::size(), we tried back-porting JDK-8168914 > even though our crash was elsewhere, adding memory fence to > > reading/writing the class and then trying to identify if the actual > pointed-to class was invalid (with Metaspace::contains(obj->klass())). > > These changes can be seen in this changeset: > https://gist.github.com/jmiettinen/3ae14b2cfa509a0f17efb35e5503c17b > > If I've understood corretly the JDK code, the OOPs for which size-call > crashes are from situations where GC goes through some set of > > objects (let's call them BadObjects) marking all that they refer grey > / copying them to survivor space. > > So we'll end up with something like this: > > class BadObject { > > ??? char* ptr; > > } > > where bad_object.ptr points to some garbled value. > > This raises at least following hypotheses: > > 1. Some stage of garbage collection misses updating references in a > BadObject. I don't know if G1 does that kind of pointer updating. > > 2. Some part of the software (native code, anything using Unsafe, > miscompiled Java-code) garbles the pointer. > > For the first hypothesis, we've so far tried turning > _hrm.verify_optional() and verify_region_sets_optional() in in > > G1CollectedHeap::do_collection_pause_at_safepoint on in production, > but they have not caught any irregularities. > > Could there be other causes? Are there any suggestions for next steps > given how hard the reproduction is? > > We're unable to move to JDK9 and try reproduction there as we're > running JRuby and it's not working at the moment with JDK9. > > Used JVM parameters are: > > -Xms3000G -Xmx3000G -XX:MaxPermSize=512m > -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing > -XX:MaxDirectMemorySize=20G -XX:AutoBoxCacheMax=8192 > -XX:MetaspaceSize=512M -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions > -XX:G1NewSizePercent=1 -XX:G1MaxNewSizePercent=80 > -XX:G1MixedGCLiveThresholdPercent=90 -XX:G1HeapWastePercent=5 > -XX:G1MixedGCCountTarget=4 -XX:MaxGCPauseMillis=3000 -verbose:gc > -XX:-PrintGCTimeStamps -XX:+PrintGCDateStamps > -XX:+PrintTenuringDistribution -XX:G1ReservePercent=20 > -XX:SurvivorRatio=1 -XX:+UseGCOverheadLimit > -XX:SoftRefLRUPolicyMSPerMB=10 > -Xloggc:/opt/apps/customer/shared/log/gc.log > -XX:-HeapDumpOnOutOfMemoryError -Djruby.compile.invokedynamic=false > -Djruby.ji.objectProxyCache=false > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From tequilaron at gmail.com Thu Nov 16 17:15:18 2017 From: tequilaron at gmail.com (Ron Reynolds) Date: Thu, 16 Nov 2017 09:15:18 -0800 Subject: "--" at the beginning of a end-collection line and best region count when it's not 2048 Message-ID: -- 59G->59G(59G), 2.0932801 secs] ^^ this "--" is entirely undocumented as far as i can tell; i believe it indicates that the young collection (or any collection?) was cancelled (i've always seen these in the last-ditch-attempt young-collections before a full-GC). is that correct? also the various docs recommend 2048 regions but if you MUST have more or less which should you choose and what are reasonable limits? e.g., i have a 59G heap with default 16MB regions (which means there should be 3832 of them); in what cases (if any) would it make sense to use 32MB regions instead? most docs say to size your regions based on max-heap size rather than starting-heap size, which would seem to indicate <2048 regions is better than >2048, since at the start that hypothetical JVM would certainly have fewer than 2048 regions until it grew into its max-heap size. thanks :) .....................ron. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamessun at fb.com Thu Nov 16 20:02:52 2017 From: jamessun at fb.com (James Sun) Date: Thu, 16 Nov 2017 20:02:52 +0000 Subject: Large heap size, slow concurrent marking causing frequent full GC Message-ID: <5ED18A0A-03D8-400F-957C-56BB9446FC4B@fb.com> Dear We observed frequent full GCs due to long concurrent marking phase (about 30 seconds to a minute). The GC log with heap histogram during full GC is attached. The Java version we use is 8_144 with G1 GC. The machines are with 56 cores and a heap size around 180 ? 210GB. Example concurrent mark duration: 2017-11-16T09:32:04.565-0800: 167543.159: [GC concurrent-mark-end, 45.7020802 secs] 2017-11-16T09:33:16.314-0800: 167614.908: [GC concurrent-mark-end, 51.0809053 secs] 2017-11-16T09:34:28.343-0800: 167686.938: [GC concurrent-mark-end, 48.7335047 secs] Wonder if anyone could help in terms of: 1. How in general we can make concurrent marking faster. We bumped up the ConcGCThread to 20 but it didn?t help that much. 2. We also turned on -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark but nothing related to marking shows up. 3. General advice in tuning GC in other aspects Thanks in advance James Here is the JVM config we have -Xss2048k -XX:MaxMetaspaceSize=4G -XX:+PreserveFramePointer -XX:-UseBiasedLocking -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+UseGCOverheadLimit -XX:+ExitOnOutOfMemoryError -agentpath:/packages/presto.presto/bin/libjvmkill.so -agentpath:/packages/presto.presto/bin/libperfagent.so -XX:+PrintReferenceGC -XX:+PrintGCCause -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintClassHistogramAfterFullGC -XX:+PrintClassHistogramBeforeFullGC -XX:PrintFLSStatistics=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintJNIGCStalls -XX:+UnlockDiagnosticVMOptions -XX:+AlwaysPreTouch -XX:+G1SummarizeRSetStats -XX:G1SummarizeRSetStatsPeriod=100 -Dorg.eclipse.jetty.io.SelectorManager.submitKeyUpdates=true -XX:-OmitStackTraceInFastThrow -XX:ReservedCodeCacheSize=1G -Djdk.nio.maxCachedBufferSize=30000000 -XX:G1MaxNewSizePercent=20 -XX:G1HeapRegionSize=32M -Xms180G -Xmx180G -XX:MarkStackSize=64M -XX:G1HeapWastePercent=2 -XX:ConcGCThreads=20 -XX:MaxGCPauseMillis=500 -XX:GCLockerRetryAllocationCount=5 -XX:MarkStackSizeMax=256M -XX:G1OldCSetRegionThresholdPercent=20 -XX:InitiatingHeapOccupancyPercent=40 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: log.zip Type: application/zip Size: 48297 bytes Desc: log.zip URL: From thomas.schatzl at oracle.com Thu Nov 16 21:38:55 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Thu, 16 Nov 2017 22:38:55 +0100 Subject: "--" at the beginning of a end-collection line and best region count when it's not 2048 In-Reply-To: References: Message-ID: <1510868335.2982.1.camel@oracle.com> Hi, On Thu, 2017-11-16 at 09:15 -0800, Ron Reynolds wrote: > -- 59G->59G(59G), 2.0932801 secs] > ^^ this "--" is entirely undocumented as far as i can tell; i believe > it indicates that the young collection (or any collection?) was > cancelled (i've always seen these in the last-ditch-attempt young- > collections before a full-GC).? is that correct? I do not remember ever seeing that output, can you tell which VM and which GC you are using? And maybe just a bit more of your log? Potentially this is some leftover from some other log message printed out concurrently, i.e. the log messages got intermixed. > also the various docs recommend 2048 regions but if you MUST have > more or less which should you?choose and what are reasonable limits? > ?e.g., i have a 59G heap with default 16MB regions (which means there > should be 3832 of them); in what cases (if any) would it make sense > to use 32MB regions instead? ? If you have lots of humongous (large) objects, or as an option to potentially decrease remembered set size (memory usage) and scan rs time use larger regions. Another reason to increase heap region size could be that 32M regions allow larger TLABs (for applications that really burn through memory). In most cases it does not matter. See https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col lector-tuning.htm for some discussion. While this is for jdk9, almost all of it applies to earlier versions too. > most docs say to size your regions based on max-heap size rather than > starting-heap size, which would seem to indicate <2048 regions is This changed in some 8 update or 9. Now defaults base the number of regions based on both initial and max size. > better than >2048, since at the start that hypothetical JVM would +-10% (or even more) generally do not make a difference. There are two boundaries: since regions are units of evacuation, if the region is too large the selection might impact the flexibility in selecting which regions to evacuate (i.e. if any single region G1 may evacuate blows the pause time goal). I *think* 2048 is just a random number far larger than maybe a (low) few hundred where G1 might get into trouble with finding consecutive free regions for large objects easily. However at the same time with larger regions there will be less objects that qualify as large, so it depends. G1 needs at least one eden, one survivor and one old region to operate. > certainly have fewer than 2048 regions until it grew into its max- > heap size. Thanks, Thomas From tequilaron at gmail.com Thu Nov 16 22:08:38 2017 From: tequilaron at gmail.com (Ron Reynolds) Date: Thu, 16 Nov 2017 14:08:38 -0800 Subject: "--" at the beginning of a end-collection line and best region count when it's not 2048 In-Reply-To: <1510868335.2982.1.camel@oracle.com> References: <1510868335.2982.1.camel@oracle.com> Message-ID: JVM - Java HotSpot(TM) 64-Bit Server VM (25.131-b11) for linux-amd64 JRE (1.8.0_131-b11) the lines are intermixed because we have -XX:+PrintAdaptiveSizePolicy and the G1Ergo lines get logged between the start-pause and end-pause lines. e.g. 2017-11-15T22:46:45.141+0000: [GC pause (G1 Evacuation Pause) (young) 167392.984: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 1876696, predicted base time: 435.30 ms, remaining time: 0.00 ms, target pause time: 200.00 ms] 167392.984: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 41 regions, survivors: 10 regions, predicted young region time: 64.16 ms] 167392.984: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 41 regions, survivors: 10 regions, old: 0 regions, predicted pause time: 499.46 ms, target pause time: 200.00 ms] 167392.985: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: region allocation request failed, allocation request: 8388600 bytes] 167392.985: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 8388600 bytes, attempted expansion amount: 16777216 bytes] 167392.985: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded] -- 59G->59G(59G), 2.0932801 secs] 167395.085: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 8216 bytes] 167395.085: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 16777216 bytes, attempted expansion amount: 16777216 bytes] 167395.085: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded] 2017-11-15T22:46:47.242+0000: [Full GC (Allocation Failure) ...class-histogram here... 59G->5071M(59G), 32.4098427 secs] we do have a few humongous objects (typically one every few hours; rarely more often than once per hour). some of them are quite humongous (the largest i've seen so far is 161M) so increasing the region size won't solve them all but it will reduce the count a bit. the amazing thing (to me) is how much heap is recovered when the Full-GC occurs; still working out how things could have gotten so sideways (i.e., so many dead objects laying around not being cleaned up). wondering if 200ms still isn't enough time (it was 100ms before)... thanks for all your help thus far. :) On Thu, Nov 16, 2017 at 1:38 PM, Thomas Schatzl wrote: > Hi, > > On Thu, 2017-11-16 at 09:15 -0800, Ron Reynolds wrote: > > -- 59G->59G(59G), 2.0932801 secs] > > ^^ this "--" is entirely undocumented as far as i can tell; i believe > > it indicates that the young collection (or any collection?) was > > cancelled (i've always seen these in the last-ditch-attempt young- > > collections before a full-GC). is that correct? > > I do not remember ever seeing that output, can you tell which VM and > which GC you are using? And maybe just a bit more of your log? > > Potentially this is some leftover from some other log message printed > out concurrently, i.e. the log messages got intermixed. > > > also the various docs recommend 2048 regions but if you MUST have > > more or less which should you choose and what are reasonable limits? > > e.g., i have a 59G heap with default 16MB regions (which means there > > should be 3832 of them); in what cases (if any) would it make sense > > to use 32MB regions instead? > > If you have lots of humongous (large) objects, or as an option to > potentially decrease remembered set size (memory usage) and scan rs > time use larger regions. Another reason to increase heap region size > could be that 32M regions allow larger TLABs (for applications that > really burn through memory). > > In most cases it does not matter. > > See https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col > lector-tuning.htm for some discussion. While this is for jdk9, almost > all of it applies to earlier versions too. > > > most docs say to size your regions based on max-heap size rather than > > starting-heap size, which would seem to indicate <2048 regions is > > This changed in some 8 update or 9. Now defaults base the number of > regions based on both initial and max size. > > > better than >2048, since at the start that hypothetical JVM would > > +-10% (or even more) generally do not make a difference. There are two > boundaries: since regions are units of evacuation, if the region is too > large the selection might impact the flexibility in selecting which > regions to evacuate (i.e. if any single region G1 may evacuate blows > the pause time goal). > > I *think* 2048 is just a random number far larger than maybe a (low) > few hundred where G1 might get into trouble with finding consecutive > free regions for large objects easily. However at the same time with > larger regions there will be less objects that qualify as large, so it > depends. > > G1 needs at least one eden, one survivor and one old region to operate. > > > certainly have fewer than 2048 regions until it grew into its max- > > heap size. > > Thanks, > Thomas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Thu Nov 16 23:12:41 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Fri, 17 Nov 2017 00:12:41 +0100 Subject: "--" at the beginning of a end-collection line and best region count when it's not 2048 In-Reply-To: References: <1510868335.2982.1.camel@oracle.com> Message-ID: <1510873961.2982.3.camel@oracle.com> Hi, On Thu, 2017-11-16 at 14:08 -0800, Ron Reynolds wrote: > JVM -?Java HotSpot(TM) 64-Bit Server VM (25.131-b11) for linux-amd64 > JRE (1.8.0_131-b11) > please at least try to update to latest jdk8u to have all fixes in. Strongly consider trying jdk9, because... > the lines are intermixed because we have?-XX:+PrintAdaptiveSizePolicy > and the G1Ergo lines get logged between the start-pause and end-pause > lines.? e.g. > > 2017-11-15T22:46:45.141+0000: [GC pause (G1 Evacuation Pause) (young) > 167392.984: [G1Ergonomics (CSet Construction) start choosing CSet, > _pending_cards: 1876696, predicted base time: 435.30 ms, remaining > time: 0.00 ms, target pause time: 200.00 ms] > ?167392.984: [G1Ergonomics (CSet Construction) add young regions to > CSet, eden: 41 regions, survivors: 10 regions, predicted young region > time: 64.16 ms] > ?167392.984: [G1Ergonomics (CSet Construction) finish choosing CSet, > eden: 41 regions, survivors: 10 regions, old: 0 regions, predicted > pause time: 499.46 ms, target pause time: 200.00 ms] > ?167392.985: [G1Ergonomics (Heap Sizing) attempt heap expansion, > reason: region allocation request failed, allocation request: 8388600 > bytes] > ?167392.985: [G1Ergonomics (Heap Sizing) expand the heap, requested > expansion amount: 8388600 bytes, attempted expansion amount: 16777216 > bytes] > ?167392.985: [G1Ergonomics (Heap Sizing) did not expand the heap, > reason: heap already fully expanded] > -- 59G->59G(59G), 2.0932801 secs] > > ?167395.085: [G1Ergonomics (Heap Sizing) attempt heap expansion, > reason: allocation request failed, allocation request: 8216 bytes] > ?167395.085: [G1Ergonomics (Heap Sizing) expand the heap, requested > expansion amount: 16777216 bytes, attempted expansion amount: > 16777216 bytes] > ?167395.085: [G1Ergonomics (Heap Sizing) did not expand the heap, > reason: heap already fully expanded] > > 2017-11-15T22:46:47.242+0000: [Full GC (Allocation Failure)? > ...class-histogram here... > ?59G->5071M(59G), 32.4098427 secs] > > we do have a few humongous objects (typically one every few hours; > rarely more often than once per hour).? some of them are quite > humongous (the largest i've seen so far is 161M) so increasing the > region size won't solve them all but it will reduce the count a > bit.?? Don't bother if you did not see an issue and these objects are few. > the amazing thing (to me) is how much heap is recovered when the > Full-GC occurs; still working out how things could have gotten so > sideways (i.e., so many dead objects laying around not being cleaned > up).? wondering if 200ms still isn't enough time (it was 100ms > before)... ... because marking apparently could not keep up with your allocation rate, which at first has nothing to do with the pause time. Because if G1 can't finish determining liveness, it won't start recovering space in the old gen. Can you check for "GC concurrent-mark-*" messages, maybe you are seeing "GC concurrent-mark-reset-for-overflow" ones. If so, increase MarkStackSize (default 4M to something much higher, like 512M; there is no way to really determine what value you need in jdk8, just that these reset-for-overflow hopefully go away) - or update to jdk9. It often needs much less memory. It would also be nice to see the time marking takes (the value printed with the "GC concurrent-mark-end" message) to see whether they are long/or there are large variations. JDK9 would also be much better with marking, faster and much more scalable; specifically see the presentation "JDK9 The quest for very large heaps" from JavaOne 2016 about changes to marking that allow really good performance [0]. Other than that hunch about MarkStackSize it is hard to tell what's going on without some larger part of the logs and options used (maybe starting marking to late because of a too high InitiatingHeapOccupancyPercent setting). Thanks, Thomas [0] https://www.youtube.com/watch?v=LppgqvKOUKs at 32:04;?quoting from the video: "marking runs 50x faster". It also shows other jdk9 improvements particularly applicable to running larger heaps. From tequilaron at gmail.com Thu Nov 16 23:40:24 2017 From: tequilaron at gmail.com (Ron Reynolds) Date: Thu, 16 Nov 2017 15:40:24 -0800 Subject: "--" at the beginning of a end-collection line and best region count when it's not 2048 In-Reply-To: <1510873961.2982.3.camel@oracle.com> References: <1510868335.2982.1.camel@oracle.com> <1510873961.2982.3.camel@oracle.com> Message-ID: yes, 2 of them (apparently it was so bad it had to be logged twice): 2017-11-14T12:15:44.868+0000: [GC concurrent-root-region-scan-start] 2017-11-14T12:15:44.976+0000: [GC concurrent-root-region-scan-end, 0.1078258 secs] 2017-11-14T12:15:44.976+0000: [GC concurrent-mark-start] 2017-11-14T12:15:46.812+0000: [GC concurrent-mark-reset-for-overflow] 2017-11-14T12:15:51.555+0000: [GC concurrent-mark-reset-for-overflow] getting Java9 into production will take a lot of effort but i'll start putting that idea into the proper heads... during the almost 2 days in the GC log i'm currently working with the GC concurrent-mark-end times start at ~2ms then, after running for 6 hours it starts jumping around a lot - range is roughly 0.5 seconds to 10.1 seconds. we have minimum cmd-line args (well, "minimum" is relative): -XX:InitialHeapSize=64298680320 -XX:MaxHeapSize=64298680320 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:-OmitStackTraceInFastThrow -XX:+PrintAdaptiveSizePolicy -XX:+PrintClassHistogramAfterFullGC -XX:+PrintClassHistogramBeforeFullGC -XX:GCLogFileSize=10485760 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:-PrintGCDetails -XX:-PrintGCTimeStamps -XX:+UseGCLogFileRotation these boxes only have 8 cores, btw. i'll get the -XX:MarkStackSize arg out there ASAP. it's not even mentioned in the "Java Perf Companion" or Scott Oaks book - and i thought Java had an insane number of undocumented options before... thanks! (do you really think it needs to be 512M? that appears to be the default max possible value...) On Thu, Nov 16, 2017 at 3:12 PM, Thomas Schatzl wrote: > Hi, > > On Thu, 2017-11-16 at 14:08 -0800, Ron Reynolds wrote: > > JVM - Java HotSpot(TM) 64-Bit Server VM (25.131-b11) for linux-amd64 > > JRE (1.8.0_131-b11) > > > > please at least try to update to latest jdk8u to have all fixes in. > Strongly consider trying jdk9, because... > > > the lines are intermixed because we have -XX:+PrintAdaptiveSizePolicy > > and the G1Ergo lines get logged between the start-pause and end-pause > > lines. e.g. > > > > 2017-11-15T22:46:45.141+0000: [GC pause (G1 Evacuation Pause) (young) > > 167392.984: [G1Ergonomics (CSet Construction) start choosing CSet, > > _pending_cards: 1876696, predicted base time: 435.30 ms, remaining > > time: 0.00 ms, target pause time: 200.00 ms] > > 167392.984: [G1Ergonomics (CSet Construction) add young regions to > > CSet, eden: 41 regions, survivors: 10 regions, predicted young region > > time: 64.16 ms] > > 167392.984: [G1Ergonomics (CSet Construction) finish choosing CSet, > > eden: 41 regions, survivors: 10 regions, old: 0 regions, predicted > > pause time: 499.46 ms, target pause time: 200.00 ms] > > 167392.985: [G1Ergonomics (Heap Sizing) attempt heap expansion, > > reason: region allocation request failed, allocation request: 8388600 > > bytes] > > 167392.985: [G1Ergonomics (Heap Sizing) expand the heap, requested > > expansion amount: 8388600 bytes, attempted expansion amount: 16777216 > > bytes] > > 167392.985: [G1Ergonomics (Heap Sizing) did not expand the heap, > > reason: heap already fully expanded] > > -- 59G->59G(59G), 2.0932801 secs] > > > > 167395.085: [G1Ergonomics (Heap Sizing) attempt heap expansion, > > reason: allocation request failed, allocation request: 8216 bytes] > > 167395.085: [G1Ergonomics (Heap Sizing) expand the heap, requested > > expansion amount: 16777216 bytes, attempted expansion amount: > > 16777216 bytes] > > 167395.085: [G1Ergonomics (Heap Sizing) did not expand the heap, > > reason: heap already fully expanded] > > > > 2017-11-15T22:46:47.242+0000: [Full GC (Allocation Failure) > > ...class-histogram here... > > 59G->5071M(59G), 32.4098427 secs] > > > > we do have a few humongous objects (typically one every few hours; > > rarely more often than once per hour). some of them are quite > > humongous (the largest i've seen so far is 161M) so increasing the > > region size won't solve them all but it will reduce the count a > > bit. > > Don't bother if you did not see an issue and these objects are few. > > > the amazing thing (to me) is how much heap is recovered when the > > Full-GC occurs; still working out how things could have gotten so > > sideways (i.e., so many dead objects laying around not being cleaned > > up). wondering if 200ms still isn't enough time (it was 100ms > > before)... > > ... because marking apparently could not keep up with your allocation > rate, which at first has nothing to do with the pause time. Because if > G1 can't finish determining liveness, it won't start recovering space > in the old gen. > > Can you check for "GC concurrent-mark-*" messages, maybe you are seeing > "GC concurrent-mark-reset-for-overflow" ones. If so, increase > MarkStackSize (default 4M to something much higher, like 512M; there is > no way to really determine what value you need in jdk8, just that these > reset-for-overflow hopefully go away) - or update to jdk9. It often > needs much less memory. > > It would also be nice to see the time marking takes (the value printed > with the "GC concurrent-mark-end" message) to see whether they are > long/or there are large variations. JDK9 would also be much better with > marking, faster and much more scalable; specifically see the > presentation "JDK9 The quest for very large heaps" from JavaOne 2016 > about changes to marking that allow really good performance [0]. > > Other than that hunch about MarkStackSize it is hard to tell what's > going on without some larger part of the logs and options used (maybe > starting marking to late because of a too high > InitiatingHeapOccupancyPercent setting). > > Thanks, > Thomas > > [0] https://www.youtube.com/watch?v=LppgqvKOUKs at 32:04; quoting from > the video: "marking runs 50x faster". It also shows other jdk9 > improvements particularly applicable to running larger heaps. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Fri Nov 17 07:18:29 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Fri, 17 Nov 2017 08:18:29 +0100 Subject: Concurrent-mark-reset-for-overflow [Was: Re: "--" at the beginning of a end-collection line and best region count when it's not 2048] In-Reply-To: References: <1510868335.2982.1.camel@oracle.com> <1510873961.2982.3.camel@oracle.com> Message-ID: <1510903109.2671.2.camel@oracle.com> Hi, On Thu, 2017-11-16 at 15:40 -0800, Ron Reynolds wrote: > yes, 2 of them (apparently it was so bad it had to be logged twice): > > 2017-11-14T12:15:44.868+0000: [GC concurrent-root-region-scan-start] > 2017-11-14T12:15:44.976+0000: [GC concurrent-root-region-scan-end, > 0.1078258 secs] > 2017-11-14T12:15:44.976+0000: [GC concurrent-mark-start] > 2017-11-14T12:15:46.812+0000: [GC concurrent-mark-reset-for-overflow] > 2017-11-14T12:15:51.555+0000: [GC concurrent-mark-reset-for-overflow] > > getting Java9 into production will take a lot of effort but i'll > start putting that idea into the proper heads... Unfortunately beyond suggesting JDK9 (or later) I do not see a good fix with jdk8u. > during the almost 2 days in the GC log i'm currently working with > the?GC concurrent-mark-end times start at ~2ms then, after running > for 6 hours it?starts jumping around a lot - range is roughly 0.5 > seconds to 10.1 seconds. > > we have minimum cmd-line args (well, "minimum" is relative): > -XX:InitialHeapSize=64298680320?-XX:MaxHeapSize=64298680320?- > XX:+UseG1GC? > -XX:MaxGCPauseMillis=200?-XX:-OmitStackTraceInFastThrow? > -XX:+PrintAdaptiveSizePolicy?-XX:+PrintClassHistogramAfterFullGC > -XX:+PrintClassHistogramBeforeFullGC > -XX:GCLogFileSize=10485760 -XX:NumberOfGCLogFiles=10? -XX:+PrintGC > -XX:+PrintGCDateStamps -XX:-PrintGCDetails -XX:-PrintGCTimeStamps > -XX:+UseGCLogFileRotation > > these boxes only have 8 cores, btw. Which means that marking uses 2 threads by default. Let's see if that works out first, otherwise increase a bit with -XX:ConcGCThreads=; that may not really help with jdk8 though, see that video link posted earlier. And of course there are only 8 cores available anyway. > i'll get the -XX:MarkStackSize?arg out there ASAP.? it's not even > mentioned in the "Java Perf Companion" or Scott Oaks book - and i > thought Java had an insane number of undocumented options before... > thanks!? (do you really think it needs to be 512M?? that appears to > be the default max possible value...) I just proposed a value that should work. As mentioned, you do not know if it worked unless it is too late. I tend to provide somewhat conservative options too, to see if it works. You can always decrease it later. Note that nowadays operating systems only really use memory for areas that have been in use, so automatically only the minimum value should be occupied in physical memory. Thanks, Thomas From thomas.schatzl at oracle.com Fri Nov 17 07:44:58 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Fri, 17 Nov 2017 08:44:58 +0100 Subject: definition of a "TLAB refill"? In-Reply-To: References: <1510604191.7132.1.camel@oracle.com> Message-ID: <1510904698.2671.7.camel@oracle.com> Hi, On Mon, 2017-11-13 at 12:51 -0800, Ron Reynolds wrote: > that's what i suspected it was but wasn't sure, even after reading > thru one of the best docs i could find on TLABs - a 2006 blog posting > (obviously pre-G1) at?https://blogs.oracle.com/jonthecollector/the-re > al-thing > i think it would be a great addition to?http://openjdk.java.net/group > s/hotspot/docs/HotSpotGlossary.html > the rest of the best cover of TLABs i've found is "Java Perf" by Oaks > and "Java Perf Companion" by Hunt et al. > the rest of the online docs i've hunted thru (for both G1 in general > and TLAB in particular) are all over the place; anywhere where > someone is explaining the output of -XX:+PrintTLAB would make sense > to me. I filed JDK-8191472 [0] to update the GC documentation [1]. The referenced blog entry [2] with example description is still valid. Thanks, Thomas [0] https://bugs.openjdk.java.net/browse/JDK-8191472 [1] https://docs.oracle.com/javase/9/gctuning/toc.htm [2] https://blogs.oracle.com/jonthecollector/the-real-thing > > On Mon, Nov 13, 2017 at 12:16 PM, Thomas Schatzl le.com> wrote: > > Hi, > > > > On Mon, 2017-11-13 at 11:12 -0800, Ron Reynolds wrote: > > > sorry for the n00b or repeat question (if there is a good way to > > > search the list archives please send me a URL) but i am trying to > > > find a solid definition of a "TLAB refill" - most TLAB docs refer > > to > > > the "refill" but don't actually define what it means. > > > thanks. > > > ....................ron. > > > > ? a TLAB refill is the process of a (Java) thread, having filled > > its > > current TLAB with objects (or having some reason to give up the > > current > > one), getting a new, empty, one for further allocation. > > > > Where do you think in the (JDK) docs could be an appropriate place > > to > > add a few sentences of explanation? > > > > Thanks, > > ? Thomas > > > > > > From thomas.schatzl at oracle.com Fri Nov 17 07:56:26 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Fri, 17 Nov 2017 08:56:26 +0100 Subject: Large heap size, slow concurrent marking causing frequent full GC In-Reply-To: <5ED18A0A-03D8-400F-957C-56BB9446FC4B@fb.com> References: <5ED18A0A-03D8-400F-957C-56BB9446FC4B@fb.com> Message-ID: <1510905386.2671.8.camel@oracle.com> Hi James, On Thu, 2017-11-16 at 20:02 +0000, James Sun wrote: > Dear > ? > We observed frequent full GCs due to long concurrent marking phase > (about 30 seconds to a minute). The GC log with heap histogram during > full GC is attached. > The Java version we use is 8_144 with G1 GC. ?The machines are with > 56 cores and a heap size around 180 ? 210GB. > ? > Example concurrent mark duration: > 2017-11-16T09:32:04.565-0800: 167543.159: [GC concurrent-mark-end, > 45.7020802 secs] > 2017-11-16T09:33:16.314-0800: 167614.908: [GC concurrent-mark-end, > 51.0809053 secs] > 2017-11-16T09:34:28.343-0800: 167686.938: [GC concurrent-mark-end, > 48.7335047 secs] > ? > ? > Wonder if anyone could help in terms of: > How in general we can make concurrent marking faster. We bumped up > the ConcGCThread to 20 but it didn?t help that much. There are known performance and scalability issues with JDK8(u). See the reason in some JavaOne 2016 talk about scaling G1 for huge heaps [0]. Unfortunately it seems you covered all other options already (bumping mark stack size to avoid overflows, increasing the number of heaps). > We also turned on -XX:+UnlockDiagnosticVMOptions > -XX:+G1SummarizeConcMark but nothing related to marking shows up. > General advice in tuning GC in other aspects > ? Apart from the full gc, and the mixed gcs I am discussing below, is there anything else of concern? > Thanks in advance > ? > James > ? > ? > Here is the JVM config we have > ? > -Xss2048k > -XX:MaxMetaspaceSize=4G > -XX:+PreserveFramePointer > -XX:-UseBiasedLocking > -XX:+PrintGCApplicationConcurrentTime > -XX:+PrintGCApplicationStoppedTime > -XX:+UnlockExperimentalVMOptions > -XX:+UseG1GC > -XX:+ExplicitGCInvokesConcurrent > -XX:+HeapDumpOnOutOfMemoryError > -XX:+UseGCOverheadLimit > -XX:+ExitOnOutOfMemoryError > -agentpath:/packages/presto.presto/bin/libjvmkill.so > -agentpath:/packages/presto.presto/bin/libperfagent.so > -XX:+PrintReferenceGC > -XX:+PrintGCCause > -XX:+PrintGCDateStamps > -XX:+PrintGCTimeStamps > -XX:+PrintGCDetails > -XX:+PrintClassHistogramAfterFullGC > -XX:+PrintClassHistogramBeforeFullGC > -XX:PrintFLSStatistics=2 This one is CMS specific. Can remove. > -XX:+PrintAdaptiveSizePolicy > -XX:+PrintSafepointStatistics > -XX:PrintSafepointStatisticsCount=1 > -XX:+PrintJNIGCStalls > -XX:+UnlockDiagnosticVMOptions > -XX:+AlwaysPreTouch > -XX:+G1SummarizeRSetStats > -XX:G1SummarizeRSetStatsPeriod=100 > -Dorg.eclipse.jetty.io.SelectorManager.submitKeyUpdates=true > -XX:-OmitStackTraceInFastThrow > -XX:ReservedCodeCacheSize=1G > -Djdk.nio.maxCachedBufferSize=30000000 > -XX:G1MaxNewSizePercent=20 > -XX:G1HeapRegionSize=32M At that heap size, the region size would be 32M anyway, may remove. > -Xms180G > -Xmx180G > -XX:MarkStackSize=64M > -XX:G1HeapWastePercent=2 That is very aggressive. That causes the long mixed gcs at the end of a old gen space reclamation phase. Either increase that to 5 to cut off the "long" mixed gcs (still mostly within your 500ms pause time goal), or increase G1MixedGCCountTarget to something like 16 to spread out the work (you are observing the "added expensive regions to CSet" message). See the documentation [1] for more information. > -XX:ConcGCThreads=20 > -XX:MaxGCPauseMillis=500 > -XX:GCLockerRetryAllocationCount=5 > -XX:MarkStackSizeMax=256M > -XX:G1OldCSetRegionThresholdPercent=20 > -XX:InitiatingHeapOccupancyPercent=40 > Thanks, Thomas [0] https://www.youtube.com/watch?v=LppgqvKOUKs at 32:04; quoting from the video: "marking runs 50x faster". It also shows other jdk9 improvements particularly applicable to running larger heaps. [1] https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col lector-tuning.htm#GUID-D2B6ADCE-6766-4FF8-AA9D-B7F4F3D0F469 From jamessun at fb.com Fri Nov 17 16:08:05 2017 From: jamessun at fb.com (James Sun) Date: Fri, 17 Nov 2017 16:08:05 +0000 Subject: Large heap size, slow concurrent marking causing frequent full GC In-Reply-To: <1510905386.2671.8.camel@oracle.com> References: <5ED18A0A-03D8-400F-957C-56BB9446FC4B@fb.com> <1510905386.2671.8.camel@oracle.com> Message-ID: Hi Thomas Thanks for the help! The video is super helpful. JDK9 is definitely something we are going to gradually adopt. Also, it is good know we didn?t miss much in terms of tuning. Thanks James On 11/16/17, 11:56 PM, "Thomas Schatzl" wrote: Hi James, On Thu, 2017-11-16 at 20:02 +0000, James Sun wrote: > Dear > > We observed frequent full GCs due to long concurrent marking phase > (about 30 seconds to a minute). The GC log with heap histogram during > full GC is attached. > The Java version we use is 8_144 with G1 GC. The machines are with > 56 cores and a heap size around 180 ? 210GB. > > Example concurrent mark duration: > 2017-11-16T09:32:04.565-0800: 167543.159: [GC concurrent-mark-end, > 45.7020802 secs] > 2017-11-16T09:33:16.314-0800: 167614.908: [GC concurrent-mark-end, > 51.0809053 secs] > 2017-11-16T09:34:28.343-0800: 167686.938: [GC concurrent-mark-end, > 48.7335047 secs] > > > Wonder if anyone could help in terms of: > How in general we can make concurrent marking faster. We bumped up > the ConcGCThread to 20 but it didn?t help that much. There are known performance and scalability issues with JDK8(u). See the reason in some JavaOne 2016 talk about scaling G1 for huge heaps [0]. Unfortunately it seems you covered all other options already (bumping mark stack size to avoid overflows, increasing the number of heaps). > We also turned on -XX:+UnlockDiagnosticVMOptions > -XX:+G1SummarizeConcMark but nothing related to marking shows up. > General advice in tuning GC in other aspects > Apart from the full gc, and the mixed gcs I am discussing below, is there anything else of concern? > Thanks in advance > > James > > > Here is the JVM config we have > > -Xss2048k > -XX:MaxMetaspaceSize=4G > -XX:+PreserveFramePointer > -XX:-UseBiasedLocking > -XX:+PrintGCApplicationConcurrentTime > -XX:+PrintGCApplicationStoppedTime > -XX:+UnlockExperimentalVMOptions > -XX:+UseG1GC > -XX:+ExplicitGCInvokesConcurrent > -XX:+HeapDumpOnOutOfMemoryError > -XX:+UseGCOverheadLimit > -XX:+ExitOnOutOfMemoryError > -agentpath:/packages/presto.presto/bin/libjvmkill.so > -agentpath:/packages/presto.presto/bin/libperfagent.so > -XX:+PrintReferenceGC > -XX:+PrintGCCause > -XX:+PrintGCDateStamps > -XX:+PrintGCTimeStamps > -XX:+PrintGCDetails > -XX:+PrintClassHistogramAfterFullGC > -XX:+PrintClassHistogramBeforeFullGC > -XX:PrintFLSStatistics=2 This one is CMS specific. Can remove. > -XX:+PrintAdaptiveSizePolicy > -XX:+PrintSafepointStatistics > -XX:PrintSafepointStatisticsCount=1 > -XX:+PrintJNIGCStalls > -XX:+UnlockDiagnosticVMOptions > -XX:+AlwaysPreTouch > -XX:+G1SummarizeRSetStats > -XX:G1SummarizeRSetStatsPeriod=100 > -Dorg.eclipse.jetty.io.SelectorManager.submitKeyUpdates=true > -XX:-OmitStackTraceInFastThrow > -XX:ReservedCodeCacheSize=1G > -Djdk.nio.maxCachedBufferSize=30000000 > -XX:G1MaxNewSizePercent=20 > -XX:G1HeapRegionSize=32M At that heap size, the region size would be 32M anyway, may remove. > -Xms180G > -Xmx180G > -XX:MarkStackSize=64M > -XX:G1HeapWastePercent=2 That is very aggressive. That causes the long mixed gcs at the end of a old gen space reclamation phase. Either increase that to 5 to cut off the "long" mixed gcs (still mostly within your 500ms pause time goal), or increase G1MixedGCCountTarget to something like 16 to spread out the work (you are observing the "added expensive regions to CSet" message). See the documentation [1] for more information. > -XX:ConcGCThreads=20 > -XX:MaxGCPauseMillis=500 > -XX:GCLockerRetryAllocationCount=5 > -XX:MarkStackSizeMax=256M > -XX:G1OldCSetRegionThresholdPercent=20 > -XX:InitiatingHeapOccupancyPercent=40 > Thanks, Thomas [0] https://www.youtube.com/watch?v=LppgqvKOUKs at 32:04; quoting from the video: "marking runs 50x faster". It also shows other jdk9 improvements particularly applicable to running larger heaps. [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_9_gctuning_garbage-2Dfirst-2Dgarbage-2Dcol&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=ikRH8URaurZA7JMys57d3w&m=HGrocRYpexpC7RoZ8Uv8PUt0O4PiaGa_WAQORiZ85MM&s=wBM_SfIKY-Oly4plffuMq3TTNYBmyJg6eUjOvJK1uSU&e= lector-tuning.htm#GUID-D2B6ADCE-6766-4FF8-AA9D-B7F4F3D0F469 From timur.akhmadeev at gmail.com Sat Nov 18 05:49:58 2017 From: timur.akhmadeev at gmail.com (Timur Akhmadeev) Date: Sat, 18 Nov 2017 05:49:58 +0000 Subject: Large heap size, slow concurrent marking causing frequent full GC In-Reply-To: References: <5ED18A0A-03D8-400F-957C-56BB9446FC4B@fb.com> <1510905386.2671.8.camel@oracle.com> Message-ID: Hi, You should also try using huge pages. With large heaps it's a must thing to do I think. On Fri, 17 Nov 2017 at 21:24, James Sun wrote: > Hi Thomas > > Thanks for the help! The video is super helpful. JDK9 is definitely > something we are going to gradually adopt. Also, it is good know we didn?t > miss much in terms of tuning. > > Thanks > > James > > On 11/16/17, 11:56 PM, "Thomas Schatzl" wrote: > > Hi James, > > On Thu, 2017-11-16 at 20:02 +0000, James Sun wrote: > > Dear > > > > We observed frequent full GCs due to long concurrent marking phase > > (about 30 seconds to a minute). The GC log with heap histogram during > > full GC is attached. > > The Java version we use is 8_144 with G1 GC. The machines are with > > 56 cores and a heap size around 180 ? 210GB. > > > > Example concurrent mark duration: > > 2017-11-16T09:32:04.565-0800: 167543.159: [GC concurrent-mark-end, > > 45.7020802 secs] > > 2017-11-16T09:33:16.314-0800: 167614.908: [GC concurrent-mark-end, > > 51.0809053 secs] > > 2017-11-16T09:34:28.343-0800: 167686.938: [GC concurrent-mark-end, > > 48.7335047 secs] > > > > > > Wonder if anyone could help in terms of: > > How in general we can make concurrent marking faster. We bumped up > > the ConcGCThread to 20 but it didn?t help that much. > > There are known performance and scalability issues with JDK8(u). See > the reason in some JavaOne 2016 talk about scaling G1 for huge heaps > [0]. Unfortunately it seems you covered all other options already > (bumping mark stack size to avoid overflows, increasing the number of > heaps). > > > We also turned on -XX:+UnlockDiagnosticVMOptions > > -XX:+G1SummarizeConcMark but nothing related to marking shows up. > > General advice in tuning GC in other aspects > > > > Apart from the full gc, and the mixed gcs I am discussing below, is > there anything else of concern? > > > Thanks in advance > > > > James > > > > > > Here is the JVM config we have > > > > -Xss2048k > > -XX:MaxMetaspaceSize=4G > > -XX:+PreserveFramePointer > > -XX:-UseBiasedLocking > > -XX:+PrintGCApplicationConcurrentTime > > -XX:+PrintGCApplicationStoppedTime > > -XX:+UnlockExperimentalVMOptions > > -XX:+UseG1GC > > -XX:+ExplicitGCInvokesConcurrent > > -XX:+HeapDumpOnOutOfMemoryError > > -XX:+UseGCOverheadLimit > > -XX:+ExitOnOutOfMemoryError > > -agentpath:/packages/presto.presto/bin/libjvmkill.so > > -agentpath:/packages/presto.presto/bin/libperfagent.so > > -XX:+PrintReferenceGC > > -XX:+PrintGCCause > > -XX:+PrintGCDateStamps > > -XX:+PrintGCTimeStamps > > -XX:+PrintGCDetails > > -XX:+PrintClassHistogramAfterFullGC > > -XX:+PrintClassHistogramBeforeFullGC > > -XX:PrintFLSStatistics=2 > > This one is CMS specific. Can remove. > > > -XX:+PrintAdaptiveSizePolicy > > -XX:+PrintSafepointStatistics > > -XX:PrintSafepointStatisticsCount=1 > > -XX:+PrintJNIGCStalls > > -XX:+UnlockDiagnosticVMOptions > > -XX:+AlwaysPreTouch > > -XX:+G1SummarizeRSetStats > > -XX:G1SummarizeRSetStatsPeriod=100 > > -Dorg.eclipse.jetty.io.SelectorManager.submitKeyUpdates=true > > -XX:-OmitStackTraceInFastThrow > > -XX:ReservedCodeCacheSize=1G > > -Djdk.nio.maxCachedBufferSize=30000000 > > -XX:G1MaxNewSizePercent=20 > > -XX:G1HeapRegionSize=32M > > At that heap size, the region size would be 32M anyway, may remove. > > > -Xms180G > > -Xmx180G > > -XX:MarkStackSize=64M > > -XX:G1HeapWastePercent=2 > > That is very aggressive. That causes the long mixed gcs at the end of a > old gen space reclamation phase. Either increase that to 5 to cut off > the "long" mixed gcs (still mostly within your 500ms pause time goal), > or increase G1MixedGCCountTarget to something like 16 to spread out the > work (you are observing the "added expensive regions to CSet" message). > See the documentation [1] for more information. > > > -XX:ConcGCThreads=20 > > -XX:MaxGCPauseMillis=500 > > -XX:GCLockerRetryAllocationCount=5 > > -XX:MarkStackSizeMax=256M > > -XX:G1OldCSetRegionThresholdPercent=20 > > -XX:InitiatingHeapOccupancyPercent=40 > > > > Thanks, > Thomas > > [0] https://www.youtube.com/watch?v=LppgqvKOUKs at 32:04; quoting from > the video: "marking runs 50x faster". It also shows other jdk9 > improvements particularly applicable to running larger heaps. > > [1] > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_9_gctuning_garbage-2Dfirst-2Dgarbage-2Dcol&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=ikRH8URaurZA7JMys57d3w&m=HGrocRYpexpC7RoZ8Uv8PUt0O4PiaGa_WAQORiZ85MM&s=wBM_SfIKY-Oly4plffuMq3TTNYBmyJg6eUjOvJK1uSU&e= > lector-tuning.htm#GUID-D2B6ADCE-6766-4FF8-AA9D-B7F4F3D0F469 > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -- Regards Timur Akhmadeev -------------- next part -------------- An HTML attachment was scrubbed... URL: From csxulijie at gmail.com Sat Nov 18 12:53:55 2017 From: csxulijie at gmail.com (Lijie Xu) Date: Sat, 18 Nov 2017 20:53:55 +0800 Subject: OOM error caused by large array allocation in G1 Message-ID: Hi All, I recently encountered an OOM error in a Spark application using G1 collector. This application launches multiple JVM instances to process the large data. Each JVM has 6.5GB heap size and uses G1 collector. A JVM instance throws an OOM error during allocating a large (570MB) array. However, this JVM has about 3GB free heap space at that time. After analyzing the application logic, heap usage, and GC log, I guess the root cause may be the lack of consecutive space for holding this large array in G1. I want to know whether my guess is right and why G1 has this defect. In the following sections, I will detail the JVM info, application, OOM phase, and heap usage. Any suggestions will be appreciated. *[JVM info]* java version "1.8.0_121" Oracle Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) *[Application]* This is a SVM machine learning Spark application available at https://github.com/JerryLead/SparkGC/blob/master/src/main/sc ala/applications/ml/SVMWithSGDExample.scala. *[OOM phase]* This application generates large task results (including a 400MB *nio.ByteBuffe*r) and uses *JavaSerializer* to serialize the task results. During serializing, the JavaSerializer tries to allocate (expand) a large * SerializeBuffer* (570MB) to hold this task result. The following GC log snippet shows the expansion details. 318.800: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 573046800 bytes] 318.800: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 573046800 bytes, attempted expansion amount: 573571072 bytes] 2017-11-17T09:58:17.362+0800: 318.802: [Full GC (Allocation Failure) 3895M->3855M(6656M), 1.7516132 secs] Although this JVM has about 3GB free space, it throws an OOM error during the expansion as follows. 17/11/17 09:59:07 ERROR Executor: Exception in task 1.0 in stage 10.0 (TID 1048) java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853) at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709) at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:238) at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeEx ternal$1.apply$mcV$sp(TaskResult.scala:66) at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeEx ternal$1.apply(TaskResult.scala:62) at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeEx ternal$1.apply(TaskResult.scala:62) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:62) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStr eam.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/11/17 09:59:07 DEBUG BlockManagerSlaveEndpoint: removing broadcast 14 *[G1 configuration]* -Xmx6.5G -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCCause -XX:+PrintGCApplicationStoppedTime -XX:+PrintAdaptiveSizePolicy -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+HeapDumpOnOutOfMemoryError *[Heap size and usage]* https://github.com/JerryLead/Misc/blob/master/OOM-SVM-G1-E1/OOM-SVM-G1-E1.pdf *[GC log]* https://github.com/JerryLead/Misc/blob/master/OOM-SVM-G1-E1/GClog -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Sat Nov 18 15:37:04 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Sat, 18 Nov 2017 16:37:04 +0100 Subject: OOM error caused by large array allocation in G1 In-Reply-To: References: Message-ID: <1511019424.3046.29.camel@oracle.com> Hi, On Sat, 2017-11-18 at 20:53 +0800, Lijie Xu wrote: > Hi All, > I recently encountered an OOM error in a Spark application using G1 > collector. This application launches multiple JVM instances to > process the large data. Each JVM has 6.5GB heap size and uses G1 > collector. A JVM instance throws an OOM error during allocating a > large (570MB) array. However, this JVM has about 3GB free heap space > at that time. After analyzing the application logic, heap usage, and > GC log, I guess the root cause may be the lack of consecutive space > for holding this large array in G1. I want to know whether my guess > is right ... Very likely. This is a long-standing issue (actually I have once investigated about it like 10 years ago on a different regional collector), and given your findings it is very likely you are correct. The issue also has an extra section in the tuning guide [0]. > ... and why G1 has this defect. Nobody fixed it yet. :) Reasons: - workaround easy and typically "just works". - no "real world" test setups where fixes could be tested available. People tend to disappear after getting to know the workaround. Unfortunately, Apache SPARK which is probably one of the more frequent environmnet it happens with, but it still does not work on jdk9/10 and soon 11 yet where development happens. - it's not very interesting work for many. Not sure why, probably because it involves implementing and evaluating longer term strategies in the collector to minimize impact of fragmentation which is a complex topic (at least if you are not satisfied with the last-ditch brute force approach). - there are more problematic issues to deal with that affect more installations, have test setups, and no or no good workaround. Actually I have been discussing this with colleagues just last week again in context of work for students/interns. :) If you want to look into this there are a bunch of CRs open that you might want to start with (e.g. [1][2][3]) to get an idea of possibilities - these CRs do not even mention the one brute force solution other VMs probably apply in that situation: have the full gc move large arrays too. Feel free to start a discussion about this topic either here or preferably in the hotspot-gc-dev mailing list. > In the following sections, I will detail the JVM info, application, > OOM phase, and heap usage. Any suggestions will be appreciated. Simply either increase the heap size or increase region size via -XX:HeapRegionSize. I think 16m regions will fix the issue in your case without any other performance impact, and reduce the amount of humongous objects significantly. > [JVM info] > java version "1.8.0_121" > Oracle Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) While it won't impact this issue, I recommend updating at least to the latest 8u release. Not suggesting jdk 9 here because we know that SPARK does not work there yet. Thanks, Thomas [0] https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col lector-tuning.htm#GUID-2428DA90-B93D-48E6-B336-A849ADF1C552 [1] https://bugs.openjdk.java.net/browse/JDK-8172713 [2] https://bugs.openjdk.java.net/browse/JDK-8038487 [3] https://bugs.openjdk.java.net/browse/JDK-8173627 From thomas.schatzl at oracle.com Sat Nov 18 15:37:43 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Sat, 18 Nov 2017 16:37:43 +0100 Subject: OOM error caused by large array allocation in G1 In-Reply-To: References: Message-ID: <1511019463.3046.32.camel@oracle.com> Hi, On Sat, 2017-11-18 at 20:53 +0800, Lijie Xu wrote: > Hi All, > I recently encountered an OOM error in a Spark application using G1 > collector. This application launches multiple JVM instances to > process the large data. Each JVM has 6.5GB heap size and uses G1 > collector. A JVM instance throws an OOM error during allocating a > large (570MB) array. However, this JVM has about 3GB free heap space > at that time. After analyzing the application logic, heap usage, and > GC log, I guess the root cause may be the lack of consecutive space > for holding this large array in G1. I want to know whether my guess > is right ... Very likely. This is a long-standing issue (actually I have once investigated about it like 10 years ago on a different regional collector), and given your findings it is very likely you are correct. The issue also has an extra section in the tuning guide [0]. > ... and why G1 has this defect. Nobody fixed it yet. :) Reasons: - workaround easy and typically "just works". - no "real world" test setups where fixes could be tested available. People tend to disappear after getting to know the workaround. Unfortunately, Apache SPARK which is probably one of the more frequent environmnet it happens with, but it still does not work on jdk9/10 and soon 11 yet where development happens. - it's not very interesting work for many. Not sure why, probably because it involves implementing and evaluating longer term strategies in the collector to minimize impact of fragmentation which is a complex topic (at least if you are not satisfied with the last-ditch brute force approach). - there are more problematic issues to deal with that affect more installations, have test setups, and no or no good workaround. Actually I have been discussing this with colleagues just last week again in context of work for students/interns. :) If you want to look into this there are a bunch of CRs open that you might want to start with (e.g. [1][2][3]) to get an idea of possibilities - these CRs do not even mention the one brute force solution other VMs probably apply in that situation: have the full gc move large arrays too. Feel free to start a discussion about this topic either here or preferably in the hotspot-gc-dev mailing list. > In the following sections, I will detail the JVM info, application, > OOM phase, and heap usage. Any suggestions will be appreciated. Simply either increase the heap size or increase region size via -XX:HeapRegionSize. I think 16m regions will fix the issue in your case without any other performance impact, and reduce the amount of humongous objects significantly. > [JVM info] > java version "1.8.0_121" > Oracle Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) While it won't impact this issue, I recommend updating at least to the latest 8u release. Not suggesting jdk 9 here because we know that SPARK does not work there yet. Thanks, Thomas [0] https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col lector-tuning.htm#GUID-2428DA90-B93D-48E6-B336-A849ADF1C552 [1] https://bugs.openjdk.java.net/browse/JDK-8172713 [2] https://bugs.openjdk.java.net/browse/JDK-8038487 [3] https://bugs.openjdk.java.net/browse/JDK-8173627 From thomas.schatzl at oracle.com Sat Nov 18 15:43:33 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Sat, 18 Nov 2017 16:43:33 +0100 Subject: Large heap size, slow concurrent marking causing frequent full GC In-Reply-To: References: <5ED18A0A-03D8-400F-957C-56BB9446FC4B@fb.com> <1510905386.2671.8.camel@oracle.com> Message-ID: <1511019813.6199.3.camel@oracle.com> Hi, On Sat, 2017-11-18 at 05:49 +0000, Timur Akhmadeev wrote: > Hi, > > You should also try using huge pages. With large heaps it's a must > thing to do I think. we've seen both improvements and regressions when using huge pages on apparently suitable applications, so please check whether performance actually improves. Also, enabling huge pages does not fix the underlying problem presented in the previously linked video. Thanks, Thomas From csxulijie at gmail.com Tue Nov 21 13:48:44 2017 From: csxulijie at gmail.com (Lijie Xu) Date: Tue, 21 Nov 2017 21:48:44 +0800 Subject: OOM error caused by large array allocation in G1 In-Reply-To: <1511019424.3046.29.camel@oracle.com> References: <1511019424.3046.29.camel@oracle.com> Message-ID: Hi Thomas, Sorry for the late reply and thanks for your nice interpretation. I added some comments and questions inline. Thanks, Lijie On Sat, Nov 18, 2017 at 11:37 PM, Thomas Schatzl wrote: > Hi, > > On Sat, 2017-11-18 at 20:53 +0800, Lijie Xu wrote: > > Hi All, > > I recently encountered an OOM error in a Spark application using G1 > > collector. This application launches multiple JVM instances to > > process the large data. Each JVM has 6.5GB heap size and uses G1 > > collector. A JVM instance throws an OOM error during allocating a > > large (570MB) array. However, this JVM has about 3GB free heap space > > at that time. After analyzing the application logic, heap usage, and > > GC log, I guess the root cause may be the lack of consecutive space > > for holding this large array in G1. I want to know whether my guess > > is right ... > > Very likely. This is a long-standing issue (actually I have once > investigated about it like 10 years ago on a different regional > collector), and given your findings it is very likely you are correct. > The issue also has an extra section in the tuning guide [0]. > *==> This reference is very helpful for me. Another question is that "Do Parallel and CMS collectors have this defect too"?* > > > ... and why G1 has this defect. > > Nobody fixed it yet. :) > > Reasons: > - workaround easy and typically "just works". > - no "real world" test setups where fixes could be tested available. > People tend to disappear after getting to know the workaround. > Unfortunately, Apache SPARK which is probably one of the more frequent > environmnet it happens with, but it still does not work on jdk9/10 and > soon 11 yet where development happens. > - it's not very interesting work for many. Not sure why, probably > because it involves implementing and evaluating longer term strategies > in the collector to minimize impact of fragmentation which is a complex > topic (at least if you are not satisfied with the last-ditch brute > force approach). > - there are more problematic issues to deal with that affect more > installations, have test setups, and no or no good workaround. > > Actually I have been discussing this with colleagues just last week > again in context of work for students/interns. :) > > If you want to look into this there are a bunch of CRs open that you > might want to start with (e.g. [1][2][3]) to get an idea of > possibilities - these CRs do not even mention the one brute force > solution other VMs probably apply in that situation: have the full gc > move large arrays too. > > Feel free to start a discussion about this topic either here or > preferably in the hotspot-gc-dev mailing list. > > > In the following sections, I will detail the JVM info, application, > > OOM phase, and heap usage. Any suggestions will be appreciated. > > Simply either increase the heap size or increase region size via > -XX:HeapRegionSize. I think 16m regions will fix the issue in your case > without any other performance impact, and reduce the amount of > humongous objects significantly. > *==> Your guess is quite right. I have changed the region size to 8m, 16m, and 32m.* *The application still throws an OOM error in 8m, but successfully finished in 16m and 32m.* > > [JVM info] > > java version "1.8.0_121" > > Oracle Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > > While it won't impact this issue, I recommend updating at least to the > latest 8u release. Not suggesting jdk 9 here because we know that SPARK > does not work there yet. > > Thanks, > Thomas > > [0] https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col > lector-tuning.htm#GUID-2428DA90-B93D-48E6-B336-A849ADF1C552 > > [1] https://bugs.openjdk.java.net/browse/JDK-8172713 > [2] https://bugs.openjdk.java.net/browse/JDK-8038487 > [3] https://bugs.openjdk.java.net/browse/JDK-8173627 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Tue Nov 21 13:59:43 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Tue, 21 Nov 2017 14:59:43 +0100 Subject: OOM error caused by large array allocation in G1 In-Reply-To: References: <1511019424.3046.29.camel@oracle.com> Message-ID: <1511272783.9073.13.camel@oracle.com> Hi, On Tue, 2017-11-21 at 21:48 +0800, Lijie Xu wrote: > Hi Thomas, > [...] > > > I want to know whether my guess is right ... > > > > Very likely. This is a long-standing issue (actually I have once > > investigated about it like 10 years ago on a different regional > > collector), and given your findings it is very likely you are > > correct. > > The issue also has an extra section in the tuning guide. > > ==> This reference is very helpful for me. Another question is that > "Do Parallel and CMS collectors have this defect too"? No. Parallel and CMS full GC always move all objects. I filed JDK-81915 65 [0] to at least avoid the OOME. Maybe it can be fixed by jdk 11. Thanks, Thomas [0] https://bugs.openjdk.java.net/browse/JDK-8191565 From mailravi at gmail.com Tue Nov 21 14:18:09 2017 From: mailravi at gmail.com (Ravi) Date: Tue, 21 Nov 2017 19:48:09 +0530 Subject: OOM error caused by large array allocation in G1 In-Reply-To: <1511272783.9073.13.camel@oracle.com> References: <1511019424.3046.29.camel@oracle.com> <1511272783.9073.13.camel@oracle.com> Message-ID: Did you go thru this blog: https://performancetestexpert.wordpress.com/2017/03/16/important-configuration-parameters-for-tuning-apache-spark-job/ If you have limitation on the hardware availability especially RAM,, one important suggestion is "Storage level has been changed to ?Disk_Only?:Before the change, we were getting OOM when processing 250K messages during the aggregation window of 300 seconds. After the change, we could process 540K messages in the aggregation window without getting OOM. Even though, IN-Memory gives better performance, due to limitation of the hardware availability i had to implement Disk-Only." Thanks Ravi On Tue, Nov 21, 2017 at 7:29 PM, Thomas Schatzl wrote: > Hi, > > On Tue, 2017-11-21 at 21:48 +0800, Lijie Xu wrote: > > Hi Thomas, > > > [...] > > > > I want to know whether my guess is right ... > > > > > > Very likely. This is a long-standing issue (actually I have once > > > investigated about it like 10 years ago on a different regional > > > collector), and given your findings it is very likely you are > > > correct. > > > The issue also has an extra section in the tuning guide. > > > > ==> This reference is very helpful for me. Another question is that > > "Do Parallel and CMS collectors have this defect too"? > > No. Parallel and CMS full GC always move all objects. I filed JDK-81915 > 65 [0] to at least avoid the OOME. Maybe it can be fixed by jdk 11. > > Thanks, > Thomas > > [0] https://bugs.openjdk.java.net/browse/JDK-8191565 > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: From csxulijie at gmail.com Tue Nov 21 15:47:38 2017 From: csxulijie at gmail.com (Lijie Xu) Date: Tue, 21 Nov 2017 23:47:38 +0800 Subject: OOM error caused by large array allocation in G1 In-Reply-To: References: <1511019424.3046.29.camel@oracle.com> <1511272783.9073.13.camel@oracle.com> Message-ID: Hi Ravi, The suggestions in the blog are quite useful. This SVM application really caches some training data in memory. To lower the memory consumption, many methods can be used including changing the storage level, lowering the execution memory threshold, and improving the parallelism. I'm doing research work on garbage collection, and want to find some defects in memory management through experiments. Thanks, Lijie On Tue, Nov 21, 2017 at 10:18 PM, Ravi wrote: > Did you go thru this blog: https://performancetestexpert. > wordpress.com/2017/03/16/important-configuration- > parameters-for-tuning-apache-spark-job/ > > If you have limitation on the hardware availability especially RAM,, one > important suggestion is "Storage level has been changed to > ?Disk_Only?:Before the change, we were getting OOM when processing 250K > messages during the aggregation window of 300 seconds. After the change, > we could process 540K messages in the aggregation window without getting > OOM. Even though, IN-Memory gives better performance, due to limitation of > the hardware availability i had to implement Disk-Only." > Thanks > Ravi > > On Tue, Nov 21, 2017 at 7:29 PM, Thomas Schatzl > wrote: > >> Hi, >> >> On Tue, 2017-11-21 at 21:48 +0800, Lijie Xu wrote: >> > Hi Thomas, >> > >> [...] >> > > > I want to know whether my guess is right ... >> > > >> > > Very likely. This is a long-standing issue (actually I have once >> > > investigated about it like 10 years ago on a different regional >> > > collector), and given your findings it is very likely you are >> > > correct. >> > > The issue also has an extra section in the tuning guide. >> > >> > ==> This reference is very helpful for me. Another question is that >> > "Do Parallel and CMS collectors have this defect too"? >> >> No. Parallel and CMS full GC always move all objects. I filed JDK-81915 >> 65 [0] to at least avoid the OOME. Maybe it can be fixed by jdk 11. >> >> Thanks, >> Thomas >> >> [0] https://bugs.openjdk.java.net/browse/JDK-8191565 >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Tue Nov 21 16:30:30 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Tue, 21 Nov 2017 17:30:30 +0100 Subject: OOM error caused by large array allocation in G1 In-Reply-To: References: <1511019424.3046.29.camel@oracle.com> <1511272783.9073.13.camel@oracle.com> Message-ID: <1511281830.2427.5.camel@oracle.com> Hi, On Tue, 2017-11-21 at 23:47 +0800, Lijie Xu wrote: > Hi Ravi, > > The suggestions in the blog are quite useful. This SVM application > really caches some training data in memory. To lower the memory > consumption, many methods can be used including changing the storage > level, lowering the execution memory threshold, and improving the > parallelism. I'm doing research work on garbage collection, and want > to find some defects in memory management through experiments. if by memory management you mean VM memory management, please keep us posted. You may also be able to filter out known issues by looking through the bug tracker: a query like this in the "advanced" search yields quite a few issues "project = JDK AND issuetype in (Bug, Enhancement, JEP) AND status in (Open, "In Progress", New) AND component = hotspot AND Subcomponent = gc". Many of those are probably overly specific though. Thanks, Thomas From yangang at ec.com.cn Sat Nov 25 20:43:58 2017 From: yangang at ec.com.cn (=?gbk?B?WWFuIEdhbmc=?=) Date: Sat, 25 Nov 2017 20:43:58 +0800(CST) Subject: =?gbk?B?UXVlc3Rpb24gYWJvdXQgQ2xhc3NTb2Z0LmNvbGxlY3QoKQ==?= Message-ID: <20171125124359.517FD1C0E69@mx3.ec.com.cn> An HTML attachment was scrubbed... URL: