From per.liden at oracle.com  Mon May 18 21:45:59 2020
From: per.liden at oracle.com (Per Liden)
Date: Mon, 18 May 2020 23:45:59 +0200
Subject: JVM stalls around uncommitting
In-Reply-To: <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com>
References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com>
 <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com>
Message-ID: <68eabfdf-b897-ad91-2b5c-597bbcfb7533@oracle.com>

Hi,

On 4/1/20 9:14 AM, Per Liden wrote:
> Hi,
> 
> On 3/31/20 9:59 PM, Zoltan Baranyi wrote:
>> Hi ZGC Team,
>>
>> I run benchmarks against our application using ZGC on heaps in few
>> hundreds GB scale. In the beginning everything goes smooth, but 
>> eventually I experience very long JVM stalls, sometimes longer than one
>> minute. According to the JVM log, reaching safepoints occasionally takes
>> very long time, matching to the duration of the stalls I experience.
>>
>> After a few iterations, I started looking at uncommitting and learned
>> that the way ZGC performs uncommitting - flushing the pages, punching
>> holes, removing blocks from the backing file - can be expensive [1] when
>> uncommitting tens or more than a hundred GB of memory. The trace level
>> heap logs confirmed that uncommitting blocks in this size takes many
>> seconds. After disabled uncommitting my benchmark runs without the huge
>> stalls and the overall experience with ZGC is quite good.
>>
>> Since uncommitting is done asynchronously to the mutators, I expected it
>> not to interfere with them. My understanding is that flushing, 
>> bookeeping and uncommitting is done under a mutex [2], and contention 
>> on that can be the source of the stalls I see, such as when there is a 
>> demand to commit memory while uncommitting is taking place. Can you 
>> confirm if this above is an explanation that makes sense to you? If 
>> so, is there a cure to this that I couldn't find? Like a time bound or 
>> a cap on the amount of the memory that can be uncommitted in one go.
> 
> Yes, uncommitting is relatively expensive. And it's also true that there 
> is a potential for lock contention affecting mutators. That can be 
> improved in various ways. Like you say, uncommitting in smaller chunks, 
> or possibly by releasing the lock while doing the actual syscall.
> 
> If you still want uncommit to happen, one thing to try is using large 
> pages (-XX:+UseLargePages), since committing/uncommitting large pages is 
> typically less expensive.
> 
> This issue is on our radar, so we intend to improve this going forward.

Just a follow up to let you know that a fix for the uncommit issue is 
now out for review here: 
http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-May/029754.html

With these patches the interference you experienced from uncommitting 
memory should be gone. As a bonus the normal allocation/commit path is 
now also a lot more streamlined and does all expensive work (e.g. 
committing memory) outside of the allocation lock.

cheers,
Per

> 
> cheers,
> Per
> 
>>
>> This is an example log captured during a stall:
>>
>> [1778,704s][info ][safepoint] Safepoint "ZMarkStart", Time since last:
>> 34394880194 ns, Reaching safepoint: 247308 ns, At safepoint: 339634 ns,
>> Total: 586942 ns
>> [1833,707s][trace][gc,heap? ] Uncommitting memory: 459560M-459562M (2M)
>> [...]
>> [... zillions of continuous uncommitting log lines ...]
>> [...]
>> [1846,076s][trace][gc,heap? ] Uncommitting memory: 84M-86M (2M)
>> [1846,076s][info ][gc,heap? ] Capacity: 528596M(86%)->386072M(63%),
>> Uncommitted: 142524M
>> [1846,076s][trace][gc,heap? ] Uncommit Timeout: 1s
>> [1846,078s][info ][safepoint] Safepoint "Cleanup", Time since last:
>> 18001682918 ns, Reaching safepoint: 49371131055 ns, At safepoint: 252559
>> ns, Total: 49371383614 ns
>>
>> In the above case TTSP is 49s, while the uncommitting lines cover only
>> 13s. The TTSP would indicate that the safepoint request was signaled 
>> at 1797s, but the log is empty between 1778s and 1883s. If my 
>> understanding above is correct, could it be that waiting for the 
>> mutex, flushing etc takes that much time and just not visible in the log?
>>
>> If needed, I can dig out more details since I can reliably reproduce 
>> the stalls.
>>
>> My environment is OpenJDK 14 running on Linux 5.2.9 with these
>> arguments: "-Xmx600G -XX:+HeapDumpOnOutOfMemoryError
>> -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseNUMA
>> -XX:+AlwaysPreTouch -Xlog:gc,safepoint,gc+heap=trace:jvm.log".
>>
>> Best regards,
>> Zoltan
>>
>> [1]
>> https://github.com/openjdk/zgc/blob/d90d2b1097a9de06d8b6e3e6f2f6bd4075471fa0/src/hotspot/os/linux/gc/z/zPhysicalMemoryBacking_linux.cpp#L566-L573 
>>
>> [2]
>> https://github.com/openjdk/zgc/blob/d90d2b1097a9de06d8b6e3e6f2f6bd4075471fa0/src/hotspot/share/gc/z/zPageAllocator.cpp#L685-L711 
>>

From raell at web.de  Mon May 25 13:13:41 2020
From: raell at web.de (raell at web.de)
Date: Mon, 25 May 2020 15:13:41 +0200
Subject: Stacks, safepoints, snapshotting and GC
Message-ID: <trinity-791edc9d-c0ce-48ea-b10f-8cbd467e0add-1590412421285@3c-app-webde-bs05>

Hi, 

I just wanted to ask if there are concrete release plans for 
concurrent thread scanning/root processing in ZGC? 

Regards
Ralph

?
?On 7 January 2020 at 12:58:48, Per Liden wrote:
?
> While we're on this topic I thought I could mention that part of the 
> plans to make ZGC a true sub-millisecond max pause time GC includes 
> removing thread stacks completely from the GC root set. I.e. in such a 
> world ZGC will not scan any thread stacks (virtual or not) during STW, 
> instead they will be scanned concurrently.

> But we're not quite there yet...

> cheers,
> Per

> On 12/19/19 1:52 PM, Ron Pressler wrote:
>>
>> This is a very good question. Virtual thread stacks (which are actually
>> continuation stacks from the VM?s perspective) are not GC roots, and so are
>> not scanned as part of the STW root-scanning. How and when they are scanned
>> is one of the core differences between the default implementation and the
>> new one, enabled with -XX:+UseContinuationChunks.
>>
>> Virtual threads shouldn?t make any impact on time-to-safepoint, and,
>> depending on the implementation, they may or may not make an impact
>> on STW young-generation collection. How the different implementations
>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet
>> to be explored and addressed. I would assume that their current impact
>> is that they simply crash them :)
>>
>> - Ron
>>
>>
>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev](mailto:holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev])) wrote:
>>>
>>> Hi,
>>>
>>> Quick question - not sure if this is an actual issue or somethign that has
>>> been addressed yet; pointers to docs welcome.
>>> How does (or will) Loom impact stack snapshotting and TTSP latency?
>>> There have been some amazing advances in GC with Shenandoah and ZGC recently,
>>> but their low pause times directly depend on the ability to quickly reach
>>> safepoints and take stack snapshots for liveliness analysis.
>>> How will this work with potentially one or two orders of magnitude more
>>> virtual thread stacks? If I understand correctly TTSP should only depend
>>> on the number of carrier threads (which fortunately should be much lower
>>> than in legacy designs), but somehow the virtual stacks stil need to be
>>> scraped..right?
>>>
>>> thanks,
>>> Holger


From per.liden at oracle.com  Mon May 25 18:09:09 2020
From: per.liden at oracle.com (Per Liden)
Date: Mon, 25 May 2020 20:09:09 +0200
Subject: Stacks, safepoints, snapshotting and GC
In-Reply-To: <trinity-791edc9d-c0ce-48ea-b10f-8cbd467e0add-1590412421285@3c-app-webde-bs05>
References: <trinity-791edc9d-c0ce-48ea-b10f-8cbd467e0add-1590412421285@3c-app-webde-bs05>
Message-ID: <22ab8134-3ef4-2008-4b51-ffc823c1c3fa@oracle.com>

This work is tracked by JEP 376 (https://openjdk.java.net/jeps/376). 
This JEP is not yet targeted to a specific release. However, you can 
already today try it out by building a JDK from 
https://github.com/openjdk/zgc, which has the latest patches for 
concurrent thread stack scanning.

cheers,
Per

On 5/25/20 3:13 PM, raell at web.de wrote:
> Hi,
> 
> I just wanted to ask if there are concrete release plans for
> concurrent thread scanning/root processing in ZGC?
> 
> Regards
> Ralph
> 
>   
>  ?On 7 January 2020 at 12:58:48, Per Liden wrote:
>   
>> While we're on this topic I thought I could mention that part of the
>> plans to make ZGC a true sub-millisecond max pause time GC includes
>> removing thread stacks completely from the GC root set. I.e. in such a
>> world ZGC will not scan any thread stacks (virtual or not) during STW,
>> instead they will be scanned concurrently.
> 
>> But we're not quite there yet...
> 
>> cheers,
>> Per
> 
>> On 12/19/19 1:52 PM, Ron Pressler wrote:
>>>
>>> This is a very good question. Virtual thread stacks (which are actually
>>> continuation stacks from the VM?s perspective) are not GC roots, and so are
>>> not scanned as part of the STW root-scanning. How and when they are scanned
>>> is one of the core differences between the default implementation and the
>>> new one, enabled with -XX:+UseContinuationChunks.
>>>
>>> Virtual threads shouldn?t make any impact on time-to-safepoint, and,
>>> depending on the implementation, they may or may not make an impact
>>> on STW young-generation collection. How the different implementations
>>> impact ZGC/Shenandoah, the non-generational low-pause collectors is yet
>>> to be explored and addressed. I would assume that their current impact
>>> is that they simply crash them :)
>>>
>>> - Ron
>>>
>>>
>>> On 19 December 2019 at 11:40:03, Holger Hoffst?tte (holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev](mailto:holger at applied-asynchrony.com[https://mail.openjdk.java.net/mailman/listinfo/zgc-dev])) wrote:
>>>>
>>>> Hi,
>>>>
>>>> Quick question - not sure if this is an actual issue or somethign that has
>>>> been addressed yet; pointers to docs welcome.
>>>> How does (or will) Loom impact stack snapshotting and TTSP latency?
>>>> There have been some amazing advances in GC with Shenandoah and ZGC recently,
>>>> but their low pause times directly depend on the ability to quickly reach
>>>> safepoints and take stack snapshots for liveliness analysis.
>>>> How will this work with potentially one or two orders of magnitude more
>>>> virtual thread stacks? If I understand correctly TTSP should only depend
>>>> on the number of carrier threads (which fortunately should be much lower
>>>> than in legacy designs), but somehow the virtual stacks stil need to be
>>>> scraped..right?
>>>>
>>>> thanks,
>>>> Holger
>