From per.liden at oracle.com Tue Jan 7 12:58:48 2020 From: per.liden at oracle.com (Per Liden) Date: Tue, 7 Jan 2020 13:58:48 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: References: Message-ID: While we're on this topic I thought I could mention that part of the plans to make ZGC a true sub-millisecond max pause time GC includes removing thread stacks completely from the GC root set. I.e. in such a world ZGC will not scan any thread stacks (virtual or not) during STW, instead they will be scanned concurrently. But we're not quite there yet... cheers, Per On 12/19/19 1:52 PM, Ron Pressler wrote: > > This is a very good question. Virtual thread stacks (which are actually > continuation stacks from the VM?s perspective) are not GC roots, and so are > not scanned as part of the STW root-scanning. How and when they are scanned > is one of the core differences between the default implementation and the > new one, enabled with -XX:+UseContinuationChunks. > > Virtual threads shouldn?t make any impact on time-to-safepoint, and, > depending on the implementation, they may or may not make an impact > on STW young-generation collection. How the different implementations > impact ZGC/Shenandoah, the non-generational low-pause collectors is yet > to be explored and addressed. I would assume that their current impact > is that they simply crash them :) > > - Ron > > > > On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote: > >> Hi, >> >> Quick question - not sure if this is an actual issue or somethign that has >> been addressed yet; pointers to docs welcome. >> How does (or will) Loom impact stack snapshotting and TTSP latency? >> There have been some amazing advances in GC with Shenandoah and ZGC recently, >> but their low pause times directly depend on the ability to quickly reach >> safepoints and take stack snapshots for liveliness analysis. >> How will this work with potentially one or two orders of magnitude more >> virtual thread stacks? If I understand correctly TTSP should only depend >> on the number of carrier threads (which fortunately should be much lower >> than in legacy designs), but somehow the virtual stacks stil need to be >> scraped..right? >> >> thanks, >> Holger > From stuart.monteith at linaro.org Thu Jan 9 17:05:46 2020 From: stuart.monteith at linaro.org (Stuart Monteith) Date: Thu, 9 Jan 2020 17:05:46 +0000 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: References: Message-ID: Hi Per, I'm curious as to how we'll implement this on Aarch64, the concern being how this would be made efficient with a weak memory model. Are you looking at using return barriers? BR, Stuart On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: > > While we're on this topic I thought I could mention that part of the > plans to make ZGC a true sub-millisecond max pause time GC includes > removing thread stacks completely from the GC root set. I.e. in such a > world ZGC will not scan any thread stacks (virtual or not) during STW, > instead they will be scanned concurrently. > > But we're not quite there yet... > > cheers, > Per > > On 12/19/19 1:52 PM, Ron Pressler wrote: > > > > This is a very good question. Virtual thread stacks (which are actually > > continuation stacks from the VM?s perspective) are not GC roots, and so are > > not scanned as part of the STW root-scanning. How and when they are scanned > > is one of the core differences between the default implementation and the > > new one, enabled with -XX:+UseContinuationChunks. > > > > Virtual threads shouldn?t make any impact on time-to-safepoint, and, > > depending on the implementation, they may or may not make an impact > > on STW young-generation collection. How the different implementations > > impact ZGC/Shenandoah, the non-generational low-pause collectors is yet > > to be explored and addressed. I would assume that their current impact > > is that they simply crash them :) > > > > - Ron > > > > > > > > On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote: > > > >> Hi, > >> > >> Quick question - not sure if this is an actual issue or somethign that has > >> been addressed yet; pointers to docs welcome. > >> How does (or will) Loom impact stack snapshotting and TTSP latency? > >> There have been some amazing advances in GC with Shenandoah and ZGC recently, > >> but their low pause times directly depend on the ability to quickly reach > >> safepoints and take stack snapshots for liveliness analysis. > >> How will this work with potentially one or two orders of magnitude more > >> virtual thread stacks? If I understand correctly TTSP should only depend > >> on the number of carrier threads (which fortunately should be much lower > >> than in legacy designs), but somehow the virtual stacks stil need to be > >> scraped..right? > >> > >> thanks, > >> Holger > > From erik.osterlund at oracle.com Thu Jan 9 23:02:09 2020 From: erik.osterlund at oracle.com (=?utf-8?Q?Erik_=C3=96sterlund?=) Date: Fri, 10 Jan 2020 00:02:09 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: References: Message-ID: Hi Stuart, The plan is to utilize something I call a stack watermark barrier. It involves rewriting the thread-local handshakes of today to use a conditional branch. For polls we today on returns the conditional branch will compare the stack pointer (rsp) to a thread-local value. This improved polling scheme can trap thread execution for 1) safepoints, 2) handshakes, 3) returns to unprocessed frames. This type of barrier allows the GC to concurrently slide the watermark (disarm frames) without having to poke at it. There won?t be any forseeable issues with weak memory ordering here. You are gonna love it! /Erik > On 9 Jan 2020, at 18:06, Stuart Monteith wrote: > > ?Hi Per, > I'm curious as to how we'll implement this on Aarch64, the concern > being how this would be made efficient with a weak memory model. Are > you looking at using return barriers? > > BR, > Stuart > >> On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: >> >> While we're on this topic I thought I could mention that part of the >> plans to make ZGC a true sub-millisecond max pause time GC includes >> removing thread stacks completely from the GC root set. I.e. in such a >> world ZGC will not scan any thread stacks (virtual or not) during STW, >> instead they will be scanned concurrently. >> >> But we're not quite there yet... >> >> cheers, >> Per >> >>> On 12/19/19 1:52 PM, Ron Pressler wrote: >>> >>> This is a very good question. Virtual thread stacks (which are actually >>> continuation stacks from the VM?s perspective) are not GC roots, and so are >>> not scanned as part of the STW root-scanning. How and when they are scanned >>> is one of the core differences between the default implementation and the >>> new one, enabled with -XX:+UseContinuationChunks. >>> >>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>> depending on the implementation, they may or may not make an impact >>> on STW young-generation collection. How the different implementations >>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet >>> to be explored and addressed. I would assume that their current impact >>> is that they simply crash them :) >>> >>> - Ron >>> >>> >>> >>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote: >>> >>>> Hi, >>>> >>>> Quick question - not sure if this is an actual issue or somethign that has >>>> been addressed yet; pointers to docs welcome. >>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>> There have been some amazing advances in GC with Shenandoah and ZGC recently, >>>> but their low pause times directly depend on the ability to quickly reach >>>> safepoints and take stack snapshots for liveliness analysis. >>>> How will this work with potentially one or two orders of magnitude more >>>> virtual thread stacks? If I understand correctly TTSP should only depend >>>> on the number of carrier threads (which fortunately should be much lower >>>> than in legacy designs), but somehow the virtual stacks stil need to be >>>> scraped..right? >>>> >>>> thanks, >>>> Holger >>> From rkennke at redhat.com Fri Jan 10 11:37:17 2020 From: rkennke at redhat.com (Roman Kennke) Date: Fri, 10 Jan 2020 12:37:17 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: References: Message-ID: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> Hi Erik, This is interesting. I had ideas about this, and actually made a prototype (in Shenandoah) that worked somewhat, and it sounds exactly like what you have in mind. Nice to see the approaches aligning :-) I put it on the shelve back then though, because the time to scan threads was very much dominated by other things then, e.g. time-to-safepoint. Not sure how things look now? I am not aware that we had improvements on the TTSP front.. Cheers, Roman > The plan is to utilize something I call a stack watermark barrier. It involves rewriting the thread-local handshakes of today to use a conditional branch. For polls we today on returns the conditional branch will compare the stack pointer (rsp) to a thread-local value. This improved polling scheme can trap thread execution for 1) safepoints, 2) handshakes, 3) returns to unprocessed frames. This type of barrier allows the GC to concurrently slide the watermark (disarm frames) without having to poke at it. > > There won?t be any forseeable issues with weak memory ordering here. You are gonna love it! > > /Erik > >> On 9 Jan 2020, at 18:06, Stuart Monteith wrote: >> >> ?Hi Per, >> I'm curious as to how we'll implement this on Aarch64, the concern >> being how this would be made efficient with a weak memory model. Are >> you looking at using return barriers? >> >> BR, >> Stuart >> >>> On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: >>> >>> While we're on this topic I thought I could mention that part of the >>> plans to make ZGC a true sub-millisecond max pause time GC includes >>> removing thread stacks completely from the GC root set. I.e. in such a >>> world ZGC will not scan any thread stacks (virtual or not) during STW, >>> instead they will be scanned concurrently. >>> >>> But we're not quite there yet... >>> >>> cheers, >>> Per >>> >>>> On 12/19/19 1:52 PM, Ron Pressler wrote: >>>> >>>> This is a very good question. Virtual thread stacks (which are actually >>>> continuation stacks from the VM?s perspective) are not GC roots, and so are >>>> not scanned as part of the STW root-scanning. How and when they are scanned >>>> is one of the core differences between the default implementation and the >>>> new one, enabled with -XX:+UseContinuationChunks. >>>> >>>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>>> depending on the implementation, they may or may not make an impact >>>> on STW young-generation collection. How the different implementations >>>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet >>>> to be explored and addressed. I would assume that their current impact >>>> is that they simply crash them :) >>>> >>>> - Ron >>>> >>>> >>>> >>>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote: >>>> >>>>> Hi, >>>>> >>>>> Quick question - not sure if this is an actual issue or somethign that has >>>>> been addressed yet; pointers to docs welcome. >>>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>>> There have been some amazing advances in GC with Shenandoah and ZGC recently, >>>>> but their low pause times directly depend on the ability to quickly reach >>>>> safepoints and take stack snapshots for liveliness analysis. >>>>> How will this work with potentially one or two orders of magnitude more >>>>> virtual thread stacks? If I understand correctly TTSP should only depend >>>>> on the number of carrier threads (which fortunately should be much lower >>>>> than in legacy designs), but somehow the virtual stacks stil need to be >>>>> scraped..right? >>>>> >>>>> thanks, >>>>> Holger >>>> > From robbin.ehn at oracle.com Fri Jan 10 13:32:09 2020 From: robbin.ehn at oracle.com (Robbin Ehn) Date: Fri, 10 Jan 2020 14:32:09 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> References: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> Message-ID: <5286392a-b595-a8ef-784f-0dd09ff980bf@oracle.com> Hi Roman, On 1/10/20 12:37 PM, Roman Kennke wrote: > time-to-safepoint. Not sure how things look now? I am not aware that we > had improvements on the TTSP front.. We have done some last year: https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036277.html "# Shenandoah, fast safepoints ... Serial.test:?safepoints.ttsp.avg thrpt 0.108 ms ; 7.4x lower!" /Robbin > > Cheers, > Roman > > >> The plan is to utilize something I call a stack watermark barrier. It involves rewriting the thread-local handshakes of today to use a conditional branch. For polls we today on returns the conditional branch will compare the stack pointer (rsp) to a thread-local value. This improved polling scheme can trap thread execution for 1) safepoints, 2) handshakes, 3) returns to unprocessed frames. This type of barrier allows the GC to concurrently slide the watermark (disarm frames) without having to poke at it. >> >> There won?t be any forseeable issues with weak memory ordering here. You are gonna love it! >> >> /Erik >> >>> On 9 Jan 2020, at 18:06, Stuart Monteith wrote: >>> >>> ?Hi Per, >>> I'm curious as to how we'll implement this on Aarch64, the concern >>> being how this would be made efficient with a weak memory model. Are >>> you looking at using return barriers? >>> >>> BR, >>> Stuart >>> >>>> On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: >>>> >>>> While we're on this topic I thought I could mention that part of the >>>> plans to make ZGC a true sub-millisecond max pause time GC includes >>>> removing thread stacks completely from the GC root set. I.e. in such a >>>> world ZGC will not scan any thread stacks (virtual or not) during STW, >>>> instead they will be scanned concurrently. >>>> >>>> But we're not quite there yet... >>>> >>>> cheers, >>>> Per >>>> >>>>> On 12/19/19 1:52 PM, Ron Pressler wrote: >>>>> >>>>> This is a very good question. Virtual thread stacks (which are actually >>>>> continuation stacks from the VM?s perspective) are not GC roots, and so are >>>>> not scanned as part of the STW root-scanning. How and when they are scanned >>>>> is one of the core differences between the default implementation and the >>>>> new one, enabled with -XX:+UseContinuationChunks. >>>>> >>>>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>>>> depending on the implementation, they may or may not make an impact >>>>> on STW young-generation collection. How the different implementations >>>>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet >>>>> to be explored and addressed. I would assume that their current impact >>>>> is that they simply crash them :) >>>>> >>>>> - Ron >>>>> >>>>> >>>>> >>>>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Quick question - not sure if this is an actual issue or somethign that has >>>>>> been addressed yet; pointers to docs welcome. >>>>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>>>> There have been some amazing advances in GC with Shenandoah and ZGC recently, >>>>>> but their low pause times directly depend on the ability to quickly reach >>>>>> safepoints and take stack snapshots for liveliness analysis. >>>>>> How will this work with potentially one or two orders of magnitude more >>>>>> virtual thread stacks? If I understand correctly TTSP should only depend >>>>>> on the number of carrier threads (which fortunately should be much lower >>>>>> than in legacy designs), but somehow the virtual stacks stil need to be >>>>>> scraped..right? >>>>>> >>>>>> thanks, >>>>>> Holger >>>>> >> > From erik.osterlund at oracle.com Fri Jan 10 13:39:18 2020 From: erik.osterlund at oracle.com (erik.osterlund at oracle.com) Date: Fri, 10 Jan 2020 14:39:18 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> References: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> Message-ID: Hi Roman, The most expensive things we do in the pause that scale with the size of some kind of working set are: 1) Monitor deflation in safepoint cleanup (being handled by Dan, expecting this to be a memory of the past soon) 2) Sampling nmethod hotness counters in safepoint cleanup (can be moved to concurrent with this barrier) 3) Processing thread stacks by GC (can be moved to concurrent with this barrier) So with this barrier, work from Dan, and safepoint synchronization improvements from Robbin, there won't be much of anything of significance left to be done in the safepoints (with ZGC). At least that's the idea. /Erik On 1/10/20 12:37 PM, Roman Kennke wrote: > Hi Erik, > > This is interesting. I had ideas about this, and actually made a > prototype (in Shenandoah) that worked somewhat, and it sounds exactly > like what you have in mind. Nice to see the approaches aligning :-) > > I put it on the shelve back then though, because the time to scan > threads was very much dominated by other things then, e.g. > time-to-safepoint. Not sure how things look now? I am not aware that we > had improvements on the TTSP front.. > > Cheers, > Roman > > >> The plan is to utilize something I call a stack watermark barrier. It involves rewriting the thread-local handshakes of today to use a conditional branch. For polls we today on returns the conditional branch will compare the stack pointer (rsp) to a thread-local value. This improved polling scheme can trap thread execution for 1) safepoints, 2) handshakes, 3) returns to unprocessed frames. This type of barrier allows the GC to concurrently slide the watermark (disarm frames) without having to poke at it. >> >> There won?t be any forseeable issues with weak memory ordering here. You are gonna love it! >> >> /Erik >> >>> On 9 Jan 2020, at 18:06, Stuart Monteith wrote: >>> >>> ?Hi Per, >>> I'm curious as to how we'll implement this on Aarch64, the concern >>> being how this would be made efficient with a weak memory model. Are >>> you looking at using return barriers? >>> >>> BR, >>> Stuart >>> >>>> On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: >>>> >>>> While we're on this topic I thought I could mention that part of the >>>> plans to make ZGC a true sub-millisecond max pause time GC includes >>>> removing thread stacks completely from the GC root set. I.e. in such a >>>> world ZGC will not scan any thread stacks (virtual or not) during STW, >>>> instead they will be scanned concurrently. >>>> >>>> But we're not quite there yet... >>>> >>>> cheers, >>>> Per >>>> >>>>> On 12/19/19 1:52 PM, Ron Pressler wrote: >>>>> >>>>> This is a very good question. Virtual thread stacks (which are actually >>>>> continuation stacks from the VM?s perspective) are not GC roots, and so are >>>>> not scanned as part of the STW root-scanning. How and when they are scanned >>>>> is one of the core differences between the default implementation and the >>>>> new one, enabled with -XX:+UseContinuationChunks. >>>>> >>>>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>>>> depending on the implementation, they may or may not make an impact >>>>> on STW young-generation collection. How the different implementations >>>>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet >>>>> to be explored and addressed. I would assume that their current impact >>>>> is that they simply crash them :) >>>>> >>>>> - Ron >>>>> >>>>> >>>>> >>>>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote: >>>>>> Hi, >>>>>> >>>>>> Quick question - not sure if this is an actual issue or somethign that has >>>>>> been addressed yet; pointers to docs welcome. >>>>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>>>> There have been some amazing advances in GC with Shenandoah and ZGC recently, >>>>>> but their low pause times directly depend on the ability to quickly reach >>>>>> safepoints and take stack snapshots for liveliness analysis. >>>>>> How will this work with potentially one or two orders of magnitude more >>>>>> virtual thread stacks? If I understand correctly TTSP should only depend >>>>>> on the number of carrier threads (which fortunately should be much lower >>>>>> than in legacy designs), but somehow the virtual stacks stil need to be >>>>>> scraped..right? >>>>>> >>>>>> thanks, >>>>>> Holger From rkennke at redhat.com Fri Jan 10 14:18:47 2020 From: rkennke at redhat.com (Roman Kennke) Date: Fri, 10 Jan 2020 15:18:47 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: <5286392a-b595-a8ef-784f-0dd09ff980bf@oracle.com> References: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> <5286392a-b595-a8ef-784f-0dd09ff980bf@oracle.com> Message-ID: Hi Robbin, On 1/10/20 2:32 PM, Robbin Ehn wrote: >> time-to-safepoint. Not sure how things look now? I am not aware that we >> had improvements on the TTSP front.. > > We have done some last year: > https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036277.html > > > "# Shenandoah, fast safepoints > ... > Serial.test:?safepoints.ttsp.avg?????? thrpt??????????? > 0.108????????????? ms ; 7.4x lower!" Wow, this is great! How have I missed that? :-) Too many things going on all the time... Thanks!! Roman >> Cheers, >> Roman >> >> >>> The plan is to utilize something I call a stack watermark barrier. It >>> involves rewriting the thread-local handshakes of today to use a >>> conditional branch. For polls we today on returns the conditional >>> branch will compare the stack pointer (rsp) to a thread-local value. >>> This improved polling scheme can trap thread execution for 1) >>> safepoints, 2) handshakes, 3) returns to unprocessed frames. This >>> type of barrier allows the GC to concurrently slide the watermark >>> (disarm frames) without having to poke at it. >>> >>> There won?t be any forseeable issues with weak memory ordering here. >>> You are gonna love it! >>> >>> /Erik >>> >>>> On 9 Jan 2020, at 18:06, Stuart Monteith >>>> wrote: >>>> >>>> ?Hi Per, >>>> ?? I'm curious as to how we'll implement this on Aarch64, the concern >>>> being how this would be made efficient with a weak memory model. Are >>>> you looking at using return barriers? >>>> >>>> BR, >>>> ?? Stuart >>>> >>>>> On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: >>>>> >>>>> While we're on this topic I thought I could mention that part of the >>>>> plans to make ZGC a true sub-millisecond max pause time GC includes >>>>> removing thread stacks completely from the GC root set. I.e. in such a >>>>> world ZGC will not scan any thread stacks (virtual or not) during STW, >>>>> instead they will be scanned concurrently. >>>>> >>>>> But we're not quite there yet... >>>>> >>>>> cheers, >>>>> Per >>>>> >>>>>> On 12/19/19 1:52 PM, Ron Pressler wrote: >>>>>> >>>>>> This is a very good question. Virtual thread stacks (which are >>>>>> actually >>>>>> continuation stacks from the VM?s perspective) are not GC roots, >>>>>> and so are >>>>>> not scanned as part of the STW root-scanning. How and when they >>>>>> are scanned >>>>>> is one of the core differences between the default implementation >>>>>> and the >>>>>> new one, enabled with -XX:+UseContinuationChunks. >>>>>> >>>>>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>>>>> depending on the implementation, they may or may not make an impact >>>>>> on STW young-generation collection. How the different implementations >>>>>> impact ZGC/Shenandoah, the non-generational low-pause collectors >>>>>> is yet >>>>>> to be explored and addressed. I would assume that their current >>>>>> impact >>>>>> is that they simply crash them :) >>>>>> >>>>>> - Ron >>>>>> >>>>>> >>>>>> >>>>>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte >>>>>>> (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) >>>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Quick question - not sure if this is an actual issue or somethign >>>>>>> that has >>>>>>> been addressed yet; pointers to docs welcome. >>>>>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>>>>> There have been some amazing advances in GC with Shenandoah and >>>>>>> ZGC recently, >>>>>>> but their low pause times directly depend on the ability to >>>>>>> quickly reach >>>>>>> safepoints and take stack snapshots for liveliness analysis. >>>>>>> How will this work with potentially one or two orders of >>>>>>> magnitude more >>>>>>> virtual thread stacks? If I understand correctly TTSP should only >>>>>>> depend >>>>>>> on the number of carrier threads (which fortunately should be >>>>>>> much lower >>>>>>> than in legacy designs), but somehow the virtual stacks stil need >>>>>>> to be >>>>>>> scraped..right? >>>>>>> >>>>>>> thanks, >>>>>>> Holger >>>>>> >>> >> > From rkennke at redhat.com Fri Jan 10 14:20:08 2020 From: rkennke at redhat.com (Roman Kennke) Date: Fri, 10 Jan 2020 15:20:08 +0100 Subject: Stacks, safepoints, snapshotting and GC In-Reply-To: References: <380f789b-bbc2-062f-b18d-978b9e56351a@redhat.com> Message-ID: <5df3c6f6-c62f-b4a3-fb33-e7215a30ca12@redhat.com> Hi Erik, > The most expensive things we do in the pause that scale with the size of > some kind of working set are: > > 1) Monitor deflation in safepoint cleanup (being handled by Dan, > expecting this to be a memory of the past soon) > 2) Sampling nmethod hotness counters in safepoint cleanup (can be moved > to concurrent with this barrier) > 3) Processing thread stacks by GC (can be moved to concurrent with this > barrier) > > So with this barrier, work from Dan, and safepoint synchronization > improvements from Robbin, there won't be much > of anything of significance left to be done in the safepoints (with > ZGC). At least that's the idea. That's awesome! :-) Cheers, Roman > On 1/10/20 12:37 PM, Roman Kennke wrote: >> Hi Erik, >> >> This is interesting. I had ideas about this, and actually made a >> prototype (in Shenandoah) that worked somewhat, and it sounds exactly >> like what you have in mind. Nice to see the approaches aligning :-) >> >> I put it on the shelve back then though, because the time to scan >> threads was very much dominated by other things then, e.g. >> time-to-safepoint. Not sure how things look now? I am not aware that we >> had improvements on the TTSP front.. >> >> Cheers, >> Roman >> >> >>> The plan is to utilize something I call a stack watermark barrier. It >>> involves rewriting the thread-local handshakes of today to use a >>> conditional branch. For polls we today on returns the conditional >>> branch will compare the stack pointer (rsp) to a thread-local value. >>> This improved polling scheme can trap thread execution for 1) >>> safepoints, 2) handshakes, 3) returns to unprocessed frames. This >>> type of barrier allows the GC to concurrently slide the watermark >>> (disarm frames) without having to poke at it. >>> >>> There won?t be any forseeable issues with weak memory ordering here. >>> You are gonna love it! >>> >>> /Erik >>> >>>> On 9 Jan 2020, at 18:06, Stuart Monteith >>>> wrote: >>>> >>>> ?Hi Per, >>>> ?? I'm curious as to how we'll implement this on Aarch64, the concern >>>> being how this would be made efficient with a weak memory model. Are >>>> you looking at using return barriers? >>>> >>>> BR, >>>> ?? Stuart >>>> >>>>> On Tue, 7 Jan 2020 at 12:59, Per Liden wrote: >>>>> >>>>> While we're on this topic I thought I could mention that part of the >>>>> plans to make ZGC a true sub-millisecond max pause time GC includes >>>>> removing thread stacks completely from the GC root set. I.e. in such a >>>>> world ZGC will not scan any thread stacks (virtual or not) during STW, >>>>> instead they will be scanned concurrently. >>>>> >>>>> But we're not quite there yet... >>>>> >>>>> cheers, >>>>> Per >>>>> >>>>>> On 12/19/19 1:52 PM, Ron Pressler wrote: >>>>>> >>>>>> This is a very good question. Virtual thread stacks (which are >>>>>> actually >>>>>> continuation stacks from the VM?s perspective) are not GC roots, >>>>>> and so are >>>>>> not scanned as part of the STW root-scanning. How and when they >>>>>> are scanned >>>>>> is one of the core differences between the default implementation >>>>>> and the >>>>>> new one, enabled with -XX:+UseContinuationChunks. >>>>>> >>>>>> Virtual threads shouldn?t make any impact on time-to-safepoint, and, >>>>>> depending on the implementation, they may or may not make an impact >>>>>> on STW young-generation collection. How the different implementations >>>>>> impact ZGC/Shenandoah, the non-generational low-pause collectors >>>>>> is yet >>>>>> to be explored and addressed. I would assume that their current >>>>>> impact >>>>>> is that they simply crash them :) >>>>>> >>>>>> - Ron >>>>>> >>>>>> >>>>>> >>>>>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte >>>>>>> (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) >>>>>>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> Quick question - not sure if this is an actual issue or somethign >>>>>>> that has >>>>>>> been addressed yet; pointers to docs welcome. >>>>>>> How does (or will) Loom impact stack snapshotting and TTSP latency? >>>>>>> There have been some amazing advances in GC with Shenandoah and >>>>>>> ZGC recently, >>>>>>> but their low pause times directly depend on the ability to >>>>>>> quickly reach >>>>>>> safepoints and take stack snapshots for liveliness analysis. >>>>>>> How will this work with potentially one or two orders of >>>>>>> magnitude more >>>>>>> virtual thread stacks? If I understand correctly TTSP should only >>>>>>> depend >>>>>>> on the number of carrier threads (which fortunately should be >>>>>>> much lower >>>>>>> than in legacy designs), but somehow the virtual stacks stil need >>>>>>> to be >>>>>>> scraped..right? >>>>>>> >>>>>>> thanks, >>>>>>> Holger >