JVM stalls around uncommitting

Thu Apr 2 23:27:41 UTC 2020

Hi Per,

Thank you for confirming the issue and for recommending large pages. I
re-run my benchmarks with large pages and it gave me a 25-30% performance
boost, which is a bit more than what I expected. My benchmarks run on a
600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is
busy. Since a significant part of the workload is ZGC itself, I assume -
besides the higher TLB hit rate - this gain is from managing the ZPages
more effectively on large pages.

I have a good experience overall, nice to see ZGC getting more and more
mature.

Cheers,
Zoltan

On Wed, Apr 1, 2020 at 9:15 AM Per Liden <per.liden at oracle.com> wrote:

> Hi,
>
> On 3/31/20 9:59 PM, Zoltan Baranyi wrote:
> > Hi ZGC Team,
> >
> > I run benchmarks against our application using ZGC on heaps in few
> > hundreds GB scale. In the beginning everything goes smooth, but
> > eventually I experience very long JVM stalls, sometimes longer than one
> > minute. According to the JVM log, reaching safepoints occasionally takes
> > very long time, matching to the duration of the stalls I experience.
> >
> > After a few iterations, I started looking at uncommitting and learned
> > that the way ZGC performs uncommitting - flushing the pages, punching
> > holes, removing blocks from the backing file - can be expensive [1] when
> > uncommitting tens or more than a hundred GB of memory. The trace level
> > heap logs confirmed that uncommitting blocks in this size takes many
> > seconds. After disabled uncommitting my benchmark runs without the huge
> > stalls and the overall experience with ZGC is quite good.
> >
> > Since uncommitting is done asynchronously to the mutators, I expected it
> > not to interfere with them. My understanding is that flushing,
> > bookeeping and uncommitting is done under a mutex [2], and contention on
> > that can be the source of the stalls I see, such as when there is a
> > demand to commit memory while uncommitting is taking place. Can you
> > confirm if this above is an explanation that makes sense to you? If so,
> > is there a cure to this that I couldn't find? Like a time bound or a cap
> > on the amount of the memory that can be uncommitted in one go.
>
> Yes, uncommitting is relatively expensive. And it's also true that there
> is a potential for lock contention affecting mutators. That can be
> improved in various ways. Like you say, uncommitting in smaller chunks,
> or possibly by releasing the lock while doing the actual syscall.
>
> If you still want uncommit to happen, one thing to try is using large
> pages (-XX:+UseLargePages), since committing/uncommitting large pages is
> typically less expensive.
>
> This issue is on our radar, so we intend to improve this going forward.
>
> cheers,
> Per
>
>