From per.liden at oracle.com Mon May 18 21:45:59 2020 From: per.liden at oracle.com (Per Liden) Date: Mon, 18 May 2020 23:45:59 +0200 Subject: JVM stalls around uncommitting In-Reply-To: <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> Message-ID: <68eabfdf-b897-ad91-2b5c-597bbcfb7533@oracle.com> Hi, On 4/1/20 9:14 AM, Per Liden wrote: > Hi, > > On 3/31/20 9:59 PM, Zoltan Baranyi wrote: >> Hi ZGC Team, >> >> I run benchmarks against our application using ZGC on heaps in few >> hundreds GB scale. In the beginning everything goes smooth, but >> eventually I experience very long JVM stalls, sometimes longer than one >> minute. According to the JVM log, reaching safepoints occasionally takes >> very long time, matching to the duration of the stalls I experience. >> >> After a few iterations, I started looking at uncommitting and learned >> that the way ZGC performs uncommitting - flushing the pages, punching >> holes, removing blocks from the backing file - can be expensive [1] when >> uncommitting tens or more than a hundred GB of memory. The trace level >> heap logs confirmed that uncommitting blocks in this size takes many >> seconds. After disabled uncommitting my benchmark runs without the huge >> stalls and the overall experience with ZGC is quite good. >> >> Since uncommitting is done asynchronously to the mutators, I expected it >> not to interfere with them. My understanding is that flushing, >> bookeeping and uncommitting is done under a mutex [2], and contention >> on that can be the source of the stalls I see, such as when there is a >> demand to commit memory while uncommitting is taking place. Can you >> confirm if this above is an explanation that makes sense to you? If >> so, is there a cure to this that I couldn't find? Like a time bound or >> a cap on the amount of the memory that can be uncommitted in one go. > > Yes, uncommitting is relatively expensive. And it's also true that there > is a potential for lock contention affecting mutators. That can be > improved in various ways. Like you say, uncommitting in smaller chunks, > or possibly by releasing the lock while doing the actual syscall. > > If you still want uncommit to happen, one thing to try is using large > pages (-XX:+UseLargePages), since committing/uncommitting large pages is > typically less expensive. > > This issue is on our radar, so we intend to improve this going forward. Just a follow up to let you know that a fix for the uncommit issue is now out for review here: http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-May/029754.html With these patches the interference you experienced from uncommitting memory should be gone. As a bonus the normal allocation/commit path is now also a lot more streamlined and does all expensive work (e.g. committing memory) outside of the allocation lock. cheers, Per > > cheers, > Per > >> >> This is an example log captured during a stall: >> >> [1778,704s][info ][safepoint] Safepoint "ZMarkStart", Time since last: >> 34394880194 ns, Reaching safepoint: 247308 ns, At safepoint: 339634 ns, >> Total: 586942 ns >> [1833,707s][trace][gc,heap? ] Uncommitting memory: 459560M-459562M (2M) >> [...] >> [... zillions of continuous uncommitting log lines ...] >> [...] >> [1846,076s][trace][gc,heap? ] Uncommitting memory: 84M-86M (2M) >> [1846,076s][info ][gc,heap? ] Capacity: 528596M(86%)->386072M(63%), >> Uncommitted: 142524M >> [1846,076s][trace][gc,heap? ] Uncommit Timeout: 1s >> [1846,078s][info ][safepoint] Safepoint "Cleanup", Time since last: >> 18001682918 ns, Reaching safepoint: 49371131055 ns, At safepoint: 252559 >> ns, Total: 49371383614 ns >> >> In the above case TTSP is 49s, while the uncommitting lines cover only >> 13s. The TTSP would indicate that the safepoint request was signaled >> at 1797s, but the log is empty between 1778s and 1883s. If my >> understanding above is correct, could it be that waiting for the >> mutex, flushing etc takes that much time and just not visible in the log? >> >> If needed, I can dig out more details since I can reliably reproduce >> the stalls. >> >> My environment is OpenJDK 14 running on Linux 5.2.9 with these >> arguments: "-Xmx600G -XX:+HeapDumpOnOutOfMemoryError >> -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseNUMA >> -XX:+AlwaysPreTouch -Xlog:gc,safepoint,gc+heap=trace:jvm.log". >> >> Best regards, >> Zoltan >> >> [1] >> https://github.com/openjdk/zgc/blob/d90d2b1097a9de06d8b6e3e6f2f6bd4075471fa0/src/hotspot/os/linux/gc/z/zPhysicalMemoryBacking_linux.cpp#L566-L573 >> >> [2] >> https://github.com/openjdk/zgc/blob/d90d2b1097a9de06d8b6e3e6f2f6bd4075471fa0/src/hotspot/share/gc/z/zPageAllocator.cpp#L685-L711 >> From raell at web.de Mon May 25 13:13:41 2020 From: raell at web.de (raell at web.de) Date: Mon, 25 May 2020 15:13:41 +0200 Subject: Stacks, safepoints, snapshotting and GC Message-ID: Hi, I just wanted to ask if there are concrete release plans for concurrent thread scanning/root processing in ZGC? Regards Ralph ? ?On 7 January 2020 at 12:58:48, Per Liden wrote: ? > While we're on this topic I thought I could mention that part of the > plans to make ZGC a true sub-millisecond max pause time GC includes > removing thread stacks completely from the GC root set. I.e. in such a > world ZGC will not scan any thread stacks (virtual or not) during STW, > instead they will be scanned concurrently. > But we're not quite there yet... > cheers, > Per > On 12/19/19 1:52 PM, Ron Pressler wrote: >> >> This is a very good question. Virtual thread stacks (which are actually >> continuation stacks from the VM?s perspective) are not GC roots, and so are >> not scanned as part of the STW root-scanning. How and when they are scanned >> is one of the core differences between the default implementation and the >> new one, enabled with -XX:+UseContinuationChunks. >> >> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >> depending on the implementation, they may or may not make an impact >> on STW young-generation collection. How the different implementations >> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet >> to be explored and addressed. I would assume that their current impact >> is that they simply crash them :) >> >> - Ron >> >> >> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev](mailto:holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev])) wrote: >>> >>> Hi, >>> >>> Quick question - not sure if this is an actual issue or somethign that has >>> been addressed yet; pointers to docs welcome. >>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>> There have been some amazing advances in GC with Shenandoah and ZGC recently, >>> but their low pause times directly depend on the ability to quickly reach >>> safepoints and take stack snapshots for liveliness analysis. >>> How will this work with potentially one or two orders of magnitude more >>> virtual thread stacks? If I understand correctly TTSP should only depend >>> on the number of carrier threads (which fortunately should be much lower >>> than in legacy designs), but somehow the virtual stacks stil need to be >>> scraped..right? >>> >>> thanks, >>> Holger From per.liden at oracle.com Mon May 25 18:09:09 2020 From: per.liden at oracle.com (Per Liden) Date: Mon, 25 May 2020 20:09:09 +0200 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: References: Message-ID: <22ab8134-3ef4-2008-4b51-ffc823c1c3fa@oracle.com> This work is tracked by JEP 376 (https://openjdk.java.net/jeps/376). This JEP is not yet targeted to a specific release. However, you can already today try it out by building a JDK from https://github.com/openjdk/zgc, which has the latest patches for concurrent thread stack scanning. cheers, Per On 5/25/20 3:13 PM, raell at web.de wrote: > Hi, > > I just wanted to ask if there are concrete release plans for > concurrent thread scanning/root processing in ZGC? > > Regards > Ralph > > > ?On 7 January 2020 at 12:58:48, Per Liden wrote: > >> While we're on this topic I thought I could mention that part of the >> plans to make ZGC a true sub-millisecond max pause time GC includes >> removing thread stacks completely from the GC root set. I.e. in such a >> world ZGC will not scan any thread stacks (virtual or not) during STW, >> instead they will be scanned concurrently. > >> But we're not quite there yet... > >> cheers, >> Per > >> On 12/19/19 1:52 PM, Ron Pressler wrote: >>> >>> This is a very good question. Virtual thread stacks (which are actually >>> continuation stacks from the VM?s perspective) are not GC roots, and so are >>> not scanned as part of the STW root-scanning. How and when they are scanned >>> is one of the core differences between the default implementation and the >>> new one, enabled with -XX:+UseContinuationChunks. >>> >>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>> depending on the implementation, they may or may not make an impact >>> on STW young-generation collection. How the different implementations >>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet >>> to be explored and addressed. I would assume that their current impact >>> is that they simply crash them :) >>> >>> - Ron >>> >>> >>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev](mailto:holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev])) wrote: >>>> >>>> Hi, >>>> >>>> Quick question - not sure if this is an actual issue or somethign that has >>>> been addressed yet; pointers to docs welcome. >>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>> There have been some amazing advances in GC with Shenandoah and ZGC recently, >>>> but their low pause times directly depend on the ability to quickly reach >>>> safepoints and take stack snapshots for liveliness analysis. >>>> How will this work with potentially one or two orders of magnitude more >>>> virtual thread stacks? If I understand correctly TTSP should only depend >>>> on the number of carrier threads (which fortunately should be much lower >>>> than in legacy designs), but somehow the virtual stacks stil need to be >>>> scraped..right? >>>> >>>> thanks, >>>> Holger >