Stacks, safepoints, snapshotting and GC

Fri Jan 10 14:20:08 UTC 2020

Hi Erik,

> The most expensive things we do in the pause that scale with the size of
> some kind of working set are:
> 
> 1) Monitor deflation in safepoint cleanup (being handled by Dan,
> expecting this to be a memory of the past soon)
> 2) Sampling nmethod hotness counters in safepoint cleanup (can be moved
> to concurrent with this barrier)
> 3) Processing thread stacks by GC (can be moved to concurrent with this
> barrier)
> 
> So with this barrier, work from Dan, and safepoint synchronization
> improvements from Robbin, there won't be much
> of anything of significance left to be done in the safepoints (with
> ZGC). At least that's the idea.

That's awesome! :-)

Cheers,
Roman

> On 1/10/20 12:37 PM, Roman Kennke wrote:
>> Hi Erik,
>>
>> This is interesting. I had ideas about this, and actually made a
>> prototype (in Shenandoah) that worked somewhat, and it sounds exactly
>> like what you have in mind. Nice to see the approaches aligning :-)
>>
>> I put it on the shelve back then though, because the time to scan
>> threads was very much dominated by other things then, e.g.
>> time-to-safepoint. Not sure how things look now? I am not aware that we
>> had improvements on the TTSP front..
>>
>> Cheers,
>> Roman
>>
>>
>>> The plan is to utilize something I call a stack watermark barrier. It
>>> involves rewriting the thread-local handshakes of today to use a
>>> conditional branch. For polls we today on returns the conditional
>>> branch will compare the stack pointer (rsp) to a thread-local value.
>>> This improved polling scheme can trap thread execution for 1)
>>> safepoints, 2) handshakes, 3) returns to unprocessed frames. This
>>> type of barrier allows the GC to concurrently slide the watermark
>>> (disarm frames) without having to poke at it.
>>>
>>> There won’t be any forseeable issues with weak memory ordering here.
>>> You are gonna love it!
>>>
>>> /Erik
>>>
>>>> On 9 Jan 2020, at 18:06, Stuart Monteith
>>>> <stuart.monteith at linaro.org> wrote:
>>>>
>>>> Hi Per,
>>>>    I'm curious as to how we'll implement this on Aarch64, the concern
>>>> being how this would be made efficient with a weak memory model. Are
>>>> you looking at using return barriers?
>>>>
>>>> BR,
>>>>    Stuart
>>>>
>>>>> On Tue, 7 Jan 2020 at 12:59, Per Liden <per.liden at oracle.com> wrote:
>>>>>
>>>>> While we're on this topic I thought I could mention that part of the
>>>>> plans to make ZGC a true sub-millisecond max pause time GC includes
>>>>> removing thread stacks completely from the GC root set. I.e. in such a
>>>>> world ZGC will not scan any thread stacks (virtual or not) during STW,
>>>>> instead they will be scanned concurrently.
>>>>>
>>>>> But we're not quite there yet...
>>>>>
>>>>> cheers,
>>>>> Per
>>>>>
>>>>>> On 12/19/19 1:52 PM, Ron Pressler wrote:
>>>>>>
>>>>>> This is a very good question. Virtual thread stacks (which are
>>>>>> actually
>>>>>> continuation stacks from the VM’s perspective) are not GC roots,
>>>>>> and so are
>>>>>> not scanned as part of the STW root-scanning. How and when they
>>>>>> are scanned
>>>>>> is one of the core differences between the default implementation
>>>>>> and the
>>>>>> new one, enabled with -XX:+UseContinuationChunks.
>>>>>>
>>>>>> Virtual threads shouldn’t make any impact on time-to-safepoint, and,
>>>>>> depending on the implementation, they may or may not make an impact
>>>>>> on STW young-generation collection. How the different implementations
>>>>>> impact ZGC/Shenandoah, the non-generational low-pause collectors
>>>>>> is yet
>>>>>> to be explored and addressed. I would assume that their current
>>>>>> impact
>>>>>> is that they simply crash them :)
>>>>>>
>>>>>> - Ron
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 19 December 2019 at 11:40:03, Holger Hoffstätte
>>>>>>> (holger at applied-asynchrony.com(mailto:holger at applied-asynchrony.com))
>>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Quick question - not sure if this is an actual issue or somethign
>>>>>>> that has
>>>>>>> been addressed yet; pointers to docs welcome.
>>>>>>> How does (or will) Loom impact stack snapshotting and TTSP latency?
>>>>>>> There have been some amazing advances in GC with Shenandoah and
>>>>>>> ZGC recently,
>>>>>>> but their low pause times directly depend on the ability to
>>>>>>> quickly reach
>>>>>>> safepoints and take stack snapshots for liveliness analysis.
>>>>>>> How will this work with potentially one or two orders of
>>>>>>> magnitude more
>>>>>>> virtual thread stacks? If I understand correctly TTSP should only
>>>>>>> depend
>>>>>>> on the number of carrier threads (which fortunately should be
>>>>>>> much lower
>>>>>>> than in legacy designs), but somehow the virtual stacks stil need
>>>>>>> to be
>>>>>>> scraped..right?
>>>>>>>
>>>>>>> thanks,
>>>>>>> Holger
>