Stacks, safepoints, snapshotting and GC

Fri Jan 10 13:39:18 UTC 2020

Hi Roman,

The most expensive things we do in the pause that scale with the size of 
some kind of working set are:

1) Monitor deflation in safepoint cleanup (being handled by Dan, 
expecting this to be a memory of the past soon)
2) Sampling nmethod hotness counters in safepoint cleanup (can be moved 
to concurrent with this barrier)
3) Processing thread stacks by GC (can be moved to concurrent with this 
barrier)

So with this barrier, work from Dan, and safepoint synchronization 
improvements from Robbin, there won't be much
of anything of significance left to be done in the safepoints (with 
ZGC). At least that's the idea.

/Erik

On 1/10/20 12:37 PM, Roman Kennke wrote:
> Hi Erik,
>
> This is interesting. I had ideas about this, and actually made a
> prototype (in Shenandoah) that worked somewhat, and it sounds exactly
> like what you have in mind. Nice to see the approaches aligning :-)
>
> I put it on the shelve back then though, because the time to scan
> threads was very much dominated by other things then, e.g.
> time-to-safepoint. Not sure how things look now? I am not aware that we
> had improvements on the TTSP front..
>
> Cheers,
> Roman
>
>
>> The plan is to utilize something I call a stack watermark barrier. It involves rewriting the thread-local handshakes of today to use a conditional branch. For polls we today on returns the conditional branch will compare the stack pointer (rsp) to a thread-local value. This improved polling scheme can trap thread execution for 1) safepoints, 2) handshakes, 3) returns to unprocessed frames. This type of barrier allows the GC to concurrently slide the watermark (disarm frames) without having to poke at it.
>>
>> There won’t be any forseeable issues with weak memory ordering here. You are gonna love it!
>>
>> /Erik
>>
>>> On 9 Jan 2020, at 18:06, Stuart Monteith <stuart.monteith at linaro.org> wrote:
>>>
>>> Hi Per,
>>>    I'm curious as to how we'll implement this on Aarch64, the concern
>>> being how this would be made efficient with a weak memory model. Are
>>> you looking at using return barriers?
>>>
>>> BR,
>>>    Stuart
>>>
>>>> On Tue, 7 Jan 2020 at 12:59, Per Liden <per.liden at oracle.com> wrote:
>>>>
>>>> While we're on this topic I thought I could mention that part of the
>>>> plans to make ZGC a true sub-millisecond max pause time GC includes
>>>> removing thread stacks completely from the GC root set. I.e. in such a
>>>> world ZGC will not scan any thread stacks (virtual or not) during STW,
>>>> instead they will be scanned concurrently.
>>>>
>>>> But we're not quite there yet...
>>>>
>>>> cheers,
>>>> Per
>>>>
>>>>> On 12/19/19 1:52 PM, Ron Pressler wrote:
>>>>>
>>>>> This is a very good question. Virtual thread stacks (which are actually
>>>>> continuation stacks from the VM’s perspective) are not GC roots, and so are
>>>>> not scanned as part of the STW root-scanning. How and when they are scanned
>>>>> is one of the core differences between the default implementation and the
>>>>> new one, enabled with -XX:+UseContinuationChunks.
>>>>>
>>>>> Virtual threads shouldn’t make any impact on time-to-safepoint, and,
>>>>> depending on the implementation, they may or may not make an impact
>>>>> on STW young-generation collection. How the different implementations
>>>>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet
>>>>> to be explored and addressed. I would assume that their current impact
>>>>> is that they simply crash them :)
>>>>>
>>>>> - Ron
>>>>>
>>>>>
>>>>>
>>>>>> On 19 December 2019 at 11:40:03, Holger Hoffstätte (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com)) wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Quick question - not sure if this is an actual issue or somethign that has
>>>>>> been addressed yet; pointers to docs welcome.
>>>>>> How does (or will) Loom impact stack snapshotting and TTSP latency?
>>>>>> There have been some amazing advances in GC with Shenandoah and ZGC recently,
>>>>>> but their low pause times directly depend on the ability to quickly reach
>>>>>> safepoints and take stack snapshots for liveliness analysis.
>>>>>> How will this work with potentially one or two orders of magnitude more
>>>>>> virtual thread stacks? If I understand correctly TTSP should only depend
>>>>>> on the number of carrier threads (which fortunately should be much lower
>>>>>> than in legacy designs), but somehow the virtual stacks stil need to be
>>>>>> scraped..right?
>>>>>>
>>>>>> thanks,
>>>>>> Holger