Experience with ZGC

Thu Mar 19 23:34:45 UTC 2020

Hi, Erik!

Thank you, these knobs can come in really handy!

> if there is any way I could try one of the spiky workload, I would be
delighted to have acloser look at it.

I can't share them, unfortunately, but the idea is that some of the
services have scheduled tasks every N (typically 5-15) minutes. These tasks
might be CPU intensive with lots of allocation involved, for instance -
rebuilding indexes. With GC, such tasks do not noticeably interfere with
serving the requests. WIth ZGC such services fail to keep up with the
incoming requests causing most of them to fail.

On Tue, 17 Mar 2020 at 07:41, Erik Österlund <erik.osterlund at oracle.com>
wrote:

> Hi Sergey,
>
> Thank you for sharing your experience using ZGC. I am glad to hear that
> you like it in general.
>
> As for the more spiky workloads, it is possible to tame the GC by tuning
> two knobs:
> 1) -XX:ZAllocationSpikeTolerance
> This flag sets a factor of how much we can expect the allocation rate to
> fluctuate. The default is 2.
> Higher values will trigger GC earlier, anticipating that allocation
> rates will spike more.
>
> 2) -XX:SoftMaxHeapSize
> The GC will try to keep the heap below this size. So by setting it lower
> than the MaxHeapSize, you
> can accommodate more spiky allocation rate and heap residency better.
>
> I hope this helps you. Having said that, I would love for the defaults
> to be able to catch such issues
> better automatically, so if there is any way I could try one of the
> spiky workload, I would be delighted
> to have acloser look at it.
>
> Thanks,
> /Erik
>
> On 2020-03-14 04:20, Sergey Tselovalnikov wrote:
> > Hi,
> >
> > I met Chad (https://twitter.com/chadarimura) a few weeks ago at UnVoxxed
> > Hawaii unconference and mentioned that we use ZGC at Canva, and he
> > encouraged me to share the details. So I wanted to share our experience
> > here. I hope, sharing our success with ZGC can encourage other people to
> > try it out.
> >
> > At Canva, we use ZGC for our API Gateway (further AFE for short). ZGC
> > helped us to reduce  GC pauses from around 30-50ms with occasional spikes
> > to hundreds of ms down to only 1-2ms pauses [0]. GC pauses used to cause
> > issues with the TCP backlog filling up which would result in further
> > queuing inside the app, and would require allocating more
> > threads/connections to clear up the queue. These two graphs show the
> > difference we observed [1].
> >
> > To give some background, AFE runs on a few dozens of c5.large AWS
> > instances. The application runs on OpenJDK JDK 13 with 1.5 GB max heap
> > size, and a stable heap size around 400 MB. It uses Jetty with
> non-blocking
> > APIs as a web framework, and Finagle as an RPC framework. When fully
> warmed
> > up, less than 10% of CPU time is spent in GC threads. Enabling ZGC didn't
> > require any special tuning, however, we increased the max heap size which
> > was previously lower following the recommendations [2].
> >
> > There were a few issues that we faced:
> >
> > * Occasional crashes prior to 13.0.2
> > Prior to JDK 13.0.2, we observed a number of crashes that would happen
> > after running the app for around 14 hours. The symptoms were very similar
> > to the ones in JDK-8230565. Looking at the crash logs, we found that the
> > crashes would happen when one of the application methods is being
> > recompiled from level 3 to level 4, so we had to mitigate this issue.
> > However, after updating to 13.0.2, we haven't seen them anymore.
> >
> > * Occasional allocation stalls
> > We're still seeing occasional "Application Stall" events which are a bit
> > harder to debug. It doesn't happen very often, and we're still collecting
> > data, but it seems that at least in some cases it's preceded by a number
> of
> > "ICBufferFull" safepoints.
> >
> > * The results depend on the load profile
> > ZGC worked really well for us for AFE for which the workload consists of
> a
> > large number of requests flowing through at a relatively steady rate.
> > However, we didn't have much success trying to apply it to some of the
> > services with a bit spikier workloads. For such workloads, ZGC resulted
> in
> > a large number of application stalls after which the app wasn't able to
> > keep up with the incoming requests anymore.
> >
> > Thanks for working on this awesome GC!
> >
> > [0]
> >
> https://user-images.githubusercontent.com/1780970/76673536-116fd000-659e-11ea-8832-4aefa06f02b2.png
> > [1]
> >
> https://user-images.githubusercontent.com/1780970/76673708-93acc400-659f-11ea-903e-0a9d50ef154d.png
> > [2] https://wiki.openjdk.java.net/display/zgc/Main#Main-SettingHeapSize
> >
>
>

-- 
Cheers,
Sergey Tselovalnikov