RE: Long Running Subphase: Concurrent Classes Unload

25 Aug 2021

      Hi Jack,

Thank you for reporting this issue. The symptoms you are describing are new
to me. However, I can imagine what has caused this. If my suspicion is correct,
then this is https://bugs.openjdk.java.net/browse/JDK-8267972 causing the
problems, and its fix will be in JDK17. Is there any way you would be interested
in trying the latest JDK17 EA build and see if the symptoms have disappeared?

A more detailed explanation is that before this fix, a Java thread could call
through an inline cache (speculative callsite) to a dying nmethod (JIT-compiled method),
trap into the VM seeing that it is dying, and then fixup the inline cache to point to the
same dying nmethod again, consuming an ICStub (the thing causing the safepoint
messages you see). Then the next call would do the same thing, rinse and repeat,
until a concurrent GC thread gets to fixing it up properly.
This could cause a situation where Java threads are spinning up an unreasonable
number of ICStubs at a high pace, for each call until the GC gets there. I can imagine
especially in a constrained container environment, that this problem could get even
worse, as the GC might take a long time getting there. In fact, the shotage on ICStub
would even make the GC much slower because it would need to request more
ICStubs instead of making progress. So the Java thread and GC thread would
continuously step on each other toes, making the problem worse.

So yeah, hopefully my suspicion is correct, in which case the issue has hopefully already
been fixed. Please let me know if you can not check with the latest 17 EA release,
or if my suspicion was wrong and there are more issues lurking. Then I would
need to have a closer look at this.

Hope this helps.

Thanks,
/Erik
...
-----Original Message-----
From: zgc-dev <zgc-dev-retn@openjdk.java.net> On Behalf Of Jack Caperon
Sent: Wednesday, 25 August 2021 03:54
To: zgc-dev@openjdk.java.net
Subject: Long Running Subphase: Concurrent Classes Unload
Hi all,
I would like to offer my thanks for your work in creating such an impressive
tool for the JVM. We recently integrated ZGC into our Java RPC at Canva and
it yielded marked improvements in latency. GC pauses are now effectively
non-existent on the service and this has allowed us to push the envelope
further on the SLA's we can offer our clients.
To give some context, the service is an RPC instance running inside the
Finagle RPC framework. The RPC runs in a container ECS environment with 4
GiB of memory and 1vCPU. The instance runs with the following OpenJDK
13.0.8 Zulu JVM options enabled:
"-XX:InitialRAMPercentage=75",
"-XX:MaxRAMPercentage=75",
"-
XX:StartFlightRecording:maxsize=100M,settings=profile,filename=/tmp/reco
rd.jfr,dumponexit=true",
"-XX:-UseBiasedLocking",
"-XX:+UnlockExperimentalVMOptions",
"-XX:+UseZGC",
Additionally, the GC was further tuned to deal with an issue that I will discuss
later in the message. These were added based on the advice provided to
Sergey Tselovalnikov in "Experience with ZGC
<https://mail.openjdk.java.net/pipermail/zgc-dev/2020-
March/000880.html>" .
"-XX:ZAllocationSpikeTolerance=3",
"-XX:-ZUncommit",
"-XX:SoftMaxHeapSize=2G",
Based on these options the JVM is allocated from the container's memory:
* Heap 3 gb
* Metaspace 102 mb
* Total 3.1 gb
Each instance at peak processes around 230 requests per second. These
requests involve accepting requests and logging them to a persistence layer
after some minor transformations using Jackson. In general, each request
can create a lot of objects and is very short-lived, since the process performs
the transform using a few intermediate objects. This leads to an average
consistent creation rate around 112.47 mb/sec. However, similar to the
problem Sergey mentioned in his message, we also have faced issues with
ZGC in this environment even after providing the suggested tuning options.
The issue appears to be that ZGC spends a large portion of time,
infrequently, in "Subphase: Concurrent Classes Unload" and uses the
majority of the containers CPU allocation during these phases. Often times
there will be sequentially long phases that slow the instance down for
15-30 seconds causing an increase in latency. I've provided two example
times from the GC Logs and can provide more if needed.
[2021-08-24T14:17:45.464+0000][info][gc,stats    ]    Subphase: Concurrent
Classes Unload                 13082.074 / 13082.074   392.443 / 13082.074
  272.530 / 28215.293   272.530 / 28215.293   ms
[2021-08-24T14:17:45.464+0000][info][gc,stats    ]    Subphase: Concurrent
Mark                             410.997 / 410.997     147.171 / 410.997
  154.437 / 610.867     154.437 / 610.867     ms
These long phases lead to allocation stall as well as a lot of "ICBufferFull"
logs during the subphase:
[2021-08-24T14:17:40.898+0000][info][gc          ] Allocation Stall
(server-1) 372.703ms
The average time taken for the phase is usually much lower, usually around
350 ms. I am hoping there is a lever I can pull to maybe increase the overall
average time used for this phase, to then decrease the maximum time taken
for this phase. Our current solution is to provide a bigger overhead of CPU to
reduce these phases and investigate how we can reduce the allocation rate
to alleviate this problem. I am happy to provide further information and logs
on the scenario if they will help.
It has been a great experience to dig deeper into this issue and learn more
about the JVM and ZGC and I look forward to continuing this learning.
Thanks for working on this awesome GC!
Cheers,
Jack
--
**
** <https://www.canva.com/>Empowering the world to design Share
accurate information on COVID-19 and spread messages of support to your
community.
Here are some resources
<https://about.canva.com/coronavirus-awareness-
collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_t
emplates>
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva>
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
<https://instagram.com/canva>

RE: Long Running Subphase: Concurrent Classes Unload

Erik Osterlund