Long Running Subphase: Concurrent Classes Unload

25 Aug 2021

      Hi all,

I would like to offer my thanks for your work in creating such an
impressive tool for the JVM. We recently integrated ZGC into our Java RPC
at Canva and it yielded marked improvements in latency. GC pauses are now
effectively non-existent on the service and this has allowed us to push the
envelope further on the SLA's we can offer our clients.

To give some context, the service is an RPC instance running inside the
Finagle RPC framework. The RPC runs in a container ECS environment with 4
GiB of memory and 1vCPU. The instance runs with the following OpenJDK
13.0.8 Zulu JVM options enabled:

"-XX:InitialRAMPercentage=75",
"-XX:MaxRAMPercentage=75",
"-XX:StartFlightRecording:maxsize=100M,settings=profile,filename=/tmp/record.jfr,dumponexit=true",
"-XX:-UseBiasedLocking",
"-XX:+UnlockExperimentalVMOptions",
"-XX:+UseZGC",

Additionally, the GC was further tuned to deal with an issue that I will
discuss later in the message. These were added based on the advice provided
to Sergey Tselovalnikov in "Experience with ZGC
<https://mail.openjdk.java.net/pipermail/zgc-dev/2020-March/000880.html>" .

"-XX:ZAllocationSpikeTolerance=3",
"-XX:-ZUncommit",
"-XX:SoftMaxHeapSize=2G",

Based on these options the JVM is allocated from the container's memory:
* Heap 3 gb
* Metaspace 102 mb
* Total 3.1 gb

Each instance at peak processes around 230 requests per second. These
requests involve accepting requests and logging them to a persistence layer
after some minor transformations using Jackson. In general, each request
can create a lot of objects and is very short-lived, since the process
performs the transform using a few intermediate objects. This leads to an
average consistent creation rate around 112.47 mb/sec. However, similar to
the problem Sergey mentioned in his message, we also have faced issues with
ZGC in this environment even after providing the suggested tuning options.

The issue appears to be that ZGC spends a large portion of time,
infrequently, in "Subphase: Concurrent Classes Unload" and uses the
majority of the containers CPU allocation during these phases. Often times
there will be sequentially long phases that slow the instance down for
15-30 seconds causing an increase in latency. I've provided two example
times from the GC Logs and can provide more if needed.

[2021-08-24T14:17:45.464+0000][info][gc,stats    ]    Subphase: Concurrent
Classes Unload                 13082.074 / 13082.074   392.443 / 13082.074
  272.530 / 28215.293   272.530 / 28215.293   ms
[2021-08-24T14:17:45.464+0000][info][gc,stats    ]    Subphase: Concurrent
Mark                             410.997 / 410.997     147.171 / 410.997
  154.437 / 610.867     154.437 / 610.867     ms

These long phases lead to allocation stall as well as a lot of "ICBufferFull"
logs during the subphase:

[2021-08-24T14:17:40.898+0000][info][gc          ] Allocation Stall
(server-1) 372.703ms

The average time taken for the phase is usually much lower, usually
around 350 ms. I am hoping there is a lever I can pull to maybe increase
the overall average time used for this phase, to then decrease the maximum
time taken for this phase. Our current solution is to provide a bigger
overhead of CPU to reduce these phases and investigate how we can reduce
the allocation rate to alleviate this problem. I am happy to provide
further information and logs on the scenario if they will help.

It has been a great experience to dig deeper into this issue and learn more
about the JVM and ZGC and I look forward to continuing this learning.

Thanks for working on this awesome GC!

Cheers,
Jack

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>

Long Running Subphase: Concurrent Classes Unload

Jack Caperon