Hi Jack, Thank you for reporting this issue. The symptoms you are describing are new to me. However, I can imagine what has caused this. If my suspicion is correct, then this is https://bugs.openjdk.java.net/browse/JDK-8267972 causing the problems, and its fix will be in JDK17. Is there any way you would be interested in trying the latest JDK17 EA build and see if the symptoms have disappeared? A more detailed explanation is that before this fix, a Java thread could call through an inline cache (speculative callsite) to a dying nmethod (JIT-compiled method), trap into the VM seeing that it is dying, and then fixup the inline cache to point to the same dying nmethod again, consuming an ICStub (the thing causing the safepoint messages you see). Then the next call would do the same thing, rinse and repeat, until a concurrent GC thread gets to fixing it up properly. This could cause a situation where Java threads are spinning up an unreasonable number of ICStubs at a high pace, for each call until the GC gets there. I can imagine especially in a constrained container environment, that this problem could get even worse, as the GC might take a long time getting there. In fact, the shotage on ICStub would even make the GC much slower because it would need to request more ICStubs instead of making progress. So the Java thread and GC thread would continuously step on each other toes, making the problem worse. So yeah, hopefully my suspicion is correct, in which case the issue has hopefully already been fixed. Please let me know if you can not check with the latest 17 EA release, or if my suspicion was wrong and there are more issues lurking. Then I would need to have a closer look at this. Hope this helps. Thanks, /Erik
-----Original Message----- From: zgc-dev <zgc-dev-retn@openjdk.java.net> On Behalf Of Jack Caperon Sent: Wednesday, 25 August 2021 03:54 To: zgc-dev@openjdk.java.net Subject: Long Running Subphase: Concurrent Classes Unload
Hi all,
I would like to offer my thanks for your work in creating such an impressive tool for the JVM. We recently integrated ZGC into our Java RPC at Canva and it yielded marked improvements in latency. GC pauses are now effectively non-existent on the service and this has allowed us to push the envelope further on the SLA's we can offer our clients.
To give some context, the service is an RPC instance running inside the Finagle RPC framework. The RPC runs in a container ECS environment with 4 GiB of memory and 1vCPU. The instance runs with the following OpenJDK 13.0.8 Zulu JVM options enabled:
"-XX:InitialRAMPercentage=75", "-XX:MaxRAMPercentage=75", "- XX:StartFlightRecording:maxsize=100M,settings=profile,filename=/tmp/reco rd.jfr,dumponexit=true", "-XX:-UseBiasedLocking", "-XX:+UnlockExperimentalVMOptions", "-XX:+UseZGC",
Additionally, the GC was further tuned to deal with an issue that I will discuss later in the message. These were added based on the advice provided to Sergey Tselovalnikov in "Experience with ZGC <https://mail.openjdk.java.net/pipermail/zgc-dev/2020- March/000880.html>" .
"-XX:ZAllocationSpikeTolerance=3", "-XX:-ZUncommit", "-XX:SoftMaxHeapSize=2G",
Based on these options the JVM is allocated from the container's memory: * Heap 3 gb * Metaspace 102 mb * Total 3.1 gb
Each instance at peak processes around 230 requests per second. These requests involve accepting requests and logging them to a persistence layer after some minor transformations using Jackson. In general, each request can create a lot of objects and is very short-lived, since the process performs the transform using a few intermediate objects. This leads to an average consistent creation rate around 112.47 mb/sec. However, similar to the problem Sergey mentioned in his message, we also have faced issues with ZGC in this environment even after providing the suggested tuning options.
The issue appears to be that ZGC spends a large portion of time, infrequently, in "Subphase: Concurrent Classes Unload" and uses the majority of the containers CPU allocation during these phases. Often times there will be sequentially long phases that slow the instance down for 15-30 seconds causing an increase in latency. I've provided two example times from the GC Logs and can provide more if needed.
[2021-08-24T14:17:45.464+0000][info][gc,stats ] Subphase: Concurrent Classes Unload 13082.074 / 13082.074 392.443 / 13082.074 272.530 / 28215.293 272.530 / 28215.293 ms [2021-08-24T14:17:45.464+0000][info][gc,stats ] Subphase: Concurrent Mark 410.997 / 410.997 147.171 / 410.997 154.437 / 610.867 154.437 / 610.867 ms
These long phases lead to allocation stall as well as a lot of "ICBufferFull" logs during the subphase:
[2021-08-24T14:17:40.898+0000][info][gc ] Allocation Stall (server-1) 372.703ms
The average time taken for the phase is usually much lower, usually around 350 ms. I am hoping there is a lever I can pull to maybe increase the overall average time used for this phase, to then decrease the maximum time taken for this phase. Our current solution is to provide a bigger overhead of CPU to reduce these phases and investigate how we can reduce the allocation rate to alleviate this problem. I am happy to provide further information and logs on the scenario if they will help.
It has been a great experience to dig deeper into this issue and learn more about the JVM and ZGC and I look forward to continuing this learning.
Thanks for working on this awesome GC!
Cheers, Jack
-- ** ** <https://www.canva.com/>Empowering the world to design Share accurate information on COVID-19 and spread messages of support to your community. Here are some resources <https://about.canva.com/coronavirus-awareness- collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_t emplates> that can help. <https://twitter.com/canva> <https://facebook.com/canva> <https://au.linkedin.com/company/canva> <https://twitter.com/canva> <https://facebook.com/canva> <https://au.linkedin.com/company/canva> <https://instagram.com/canva>