JVM stalls around uncommitting
Per Liden
per.liden at oracle.com
Fri Apr 3 07:35:55 UTC 2020
Hi Zoltan,
On 4/3/20 1:27 AM, Zoltán Baranyi wrote:
> Hi Per,
>
> Thank you for confirming the issue and for recommending large pages. I
> re-run my benchmarks with large pages and it gave me a 25-30% performance
> boost, which is a bit more than what I expected. My benchmarks run on a
> 600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is
> busy. Since a significant part of the workload is ZGC itself, I assume -
> besides the higher TLB hit rate - this gain is from managing the ZPages
> more effectively on large pages.
A 25-30% improvement is indeed more than I would have expected. ZGC's
internal handling of ZPages is the same regardless of the underlying
page size, but as you say, you'll get better TLB hit-rate and the
mmap/fallocate syscalls become a lot less expensive.
Another reason for the boost might be that ZGC's NUMA-awareness, until
recently, worked much better when using large pages. But this has now
been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649.
Btw, which JDK version are you using?
>
> I have a good experience overall, nice to see ZGC getting more and more
> mature.
Good to hear. Thanks for the feedback!
/Per
>
> Cheers,
> Zoltan
>
> On Wed, Apr 1, 2020 at 9:15 AM Per Liden <per.liden at oracle.com> wrote:
>
>> Hi,
>>
>> On 3/31/20 9:59 PM, Zoltan Baranyi wrote:
>>> Hi ZGC Team,
>>>
>>> I run benchmarks against our application using ZGC on heaps in few
>>> hundreds GB scale. In the beginning everything goes smooth, but
>>> eventually I experience very long JVM stalls, sometimes longer than one
>>> minute. According to the JVM log, reaching safepoints occasionally takes
>>> very long time, matching to the duration of the stalls I experience.
>>>
>>> After a few iterations, I started looking at uncommitting and learned
>>> that the way ZGC performs uncommitting - flushing the pages, punching
>>> holes, removing blocks from the backing file - can be expensive [1] when
>>> uncommitting tens or more than a hundred GB of memory. The trace level
>>> heap logs confirmed that uncommitting blocks in this size takes many
>>> seconds. After disabled uncommitting my benchmark runs without the huge
>>> stalls and the overall experience with ZGC is quite good.
>>>
>>> Since uncommitting is done asynchronously to the mutators, I expected it
>>> not to interfere with them. My understanding is that flushing,
>>> bookeeping and uncommitting is done under a mutex [2], and contention on
>>> that can be the source of the stalls I see, such as when there is a
>>> demand to commit memory while uncommitting is taking place. Can you
>>> confirm if this above is an explanation that makes sense to you? If so,
>>> is there a cure to this that I couldn't find? Like a time bound or a cap
>>> on the amount of the memory that can be uncommitted in one go.
>>
>> Yes, uncommitting is relatively expensive. And it's also true that there
>> is a potential for lock contention affecting mutators. That can be
>> improved in various ways. Like you say, uncommitting in smaller chunks,
>> or possibly by releasing the lock while doing the actual syscall.
>>
>> If you still want uncommit to happen, one thing to try is using large
>> pages (-XX:+UseLargePages), since committing/uncommitting large pages is
>> typically less expensive.
>>
>> This issue is on our radar, so we intend to improve this going forward.
>>
>> cheers,
>> Per
>>
>>
More information about the zgc-dev
mailing list