Long Running Subphase: Concurrent Classes Unload

Wed Aug 25 08:35:15 UTC 2021

Hi Jack,

Thank you for reporting this issue. The symptoms you are describing are new
to me. However, I can imagine what has caused this. If my suspicion is correct,
then this is https://bugs.openjdk.java.net/browse/JDK-8267972 causing the
problems, and its fix will be in JDK17. Is there any way you would be interested
in trying the latest JDK17 EA build and see if the symptoms have disappeared?

A more detailed explanation is that before this fix, a Java thread could call
through an inline cache (speculative callsite) to a dying nmethod (JIT-compiled method),
trap into the VM seeing that it is dying, and then fixup the inline cache to point to the
same dying nmethod again, consuming an ICStub (the thing causing the safepoint
messages you see). Then the next call would do the same thing, rinse and repeat,
until a concurrent GC thread gets to fixing it up properly.
This could cause a situation where Java threads are spinning up an unreasonable
number of ICStubs at a high pace, for each call until the GC gets there. I can imagine
especially in a constrained container environment, that this problem could get even
worse, as the GC might take a long time getting there. In fact, the shotage on ICStub
would even make the GC much slower because it would need to request more
ICStubs instead of making progress. So the Java thread and GC thread would
continuously step on each other toes, making the problem worse.

So yeah, hopefully my suspicion is correct, in which case the issue has hopefully already
been fixed. Please let me know if you can not check with the latest 17 EA release,
or if my suspicion was wrong and there are more issues lurking. Then I would
need to have a closer look at this.

Hope this helps.

Thanks,
/Erik

> -----Original Message-----
> From: zgc-dev <zgc-dev-retn at openjdk.java.net> On Behalf Of Jack Caperon
> Sent: Wednesday, 25 August 2021 03:54
> To: zgc-dev at openjdk.java.net
> Subject: Long Running Subphase: Concurrent Classes Unload
> 
> Hi all,
> 
> I would like to offer my thanks for your work in creating such an impressive
> tool for the JVM. We recently integrated ZGC into our Java RPC at Canva and
> it yielded marked improvements in latency. GC pauses are now effectively
> non-existent on the service and this has allowed us to push the envelope
> further on the SLA's we can offer our clients.
> 
> To give some context, the service is an RPC instance running inside the
> Finagle RPC framework. The RPC runs in a container ECS environment with 4
> GiB of memory and 1vCPU. The instance runs with the following OpenJDK
> 13.0.8 Zulu JVM options enabled:
> 
> "-XX:InitialRAMPercentage=75",
> "-XX:MaxRAMPercentage=75",
> "-
> XX:StartFlightRecording:maxsize=100M,settings=profile,filename=/tmp/reco
> rd.jfr,dumponexit=true",
> "-XX:-UseBiasedLocking",
> "-XX:+UnlockExperimentalVMOptions",
> "-XX:+UseZGC",
> 
> Additionally, the GC was further tuned to deal with an issue that I will discuss
> later in the message. These were added based on the advice provided to
> Sergey Tselovalnikov in "Experience with ZGC
> <https://mail.openjdk.java.net/pipermail/zgc-dev/2020-
> March/000880.html>" .
> 
> "-XX:ZAllocationSpikeTolerance=3",
> "-XX:-ZUncommit",
> "-XX:SoftMaxHeapSize=2G",
> 
> Based on these options the JVM is allocated from the container's memory:
> * Heap 3 gb
> * Metaspace 102 mb
> * Total 3.1 gb
> 
> Each instance at peak processes around 230 requests per second. These
> requests involve accepting requests and logging them to a persistence layer
> after some minor transformations using Jackson. In general, each request
> can create a lot of objects and is very short-lived, since the process performs
> the transform using a few intermediate objects. This leads to an average
> consistent creation rate around 112.47 mb/sec. However, similar to the
> problem Sergey mentioned in his message, we also have faced issues with
> ZGC in this environment even after providing the suggested tuning options.
> 
> The issue appears to be that ZGC spends a large portion of time,
> infrequently, in "Subphase: Concurrent Classes Unload" and uses the
> majority of the containers CPU allocation during these phases. Often times
> there will be sequentially long phases that slow the instance down for
> 15-30 seconds causing an increase in latency. I've provided two example
> times from the GC Logs and can provide more if needed.
> 
> [2021-08-24T14:17:45.464+0000][info][gc,stats    ]    Subphase: Concurrent
> Classes Unload                 13082.074 / 13082.074   392.443 / 13082.074
>   272.530 / 28215.293   272.530 / 28215.293   ms
> [2021-08-24T14:17:45.464+0000][info][gc,stats    ]    Subphase: Concurrent
> Mark                             410.997 / 410.997     147.171 / 410.997
>   154.437 / 610.867     154.437 / 610.867     ms
> 
> These long phases lead to allocation stall as well as a lot of "ICBufferFull"
> logs during the subphase:
> 
> [2021-08-24T14:17:40.898+0000][info][gc          ] Allocation Stall
> (server-1) 372.703ms
> 
> The average time taken for the phase is usually much lower, usually around
> 350 ms. I am hoping there is a lever I can pull to maybe increase the overall
> average time used for this phase, to then decrease the maximum time taken
> for this phase. Our current solution is to provide a bigger overhead of CPU to
> reduce these phases and investigate how we can reduce the allocation rate
> to alleviate this problem. I am happy to provide further information and logs
> on the scenario if they will help.
> 
> It has been a great experience to dig deeper into this issue and learn more
> about the JVM and ZGC and I look forward to continuing this learning.
> 
> Thanks for working on this awesome GC!
> 
> Cheers,
> Jack
> 
> --
> **
> ** <https://www.canva.com/>Empowering the world to design Share
> accurate information on COVID-19 and spread messages of support to your
> community.
> Here are some resources
> <https://about.canva.com/coronavirus-awareness-
> collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_t
> emplates>
> that can help.
>  <https://twitter.com/canva> <https://facebook.com/canva>
> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> <https://instagram.com/canva>
> 
> 
> 
> 
> 
> 
> 
> 
>