Experience with ZGC

Thu Mar 19 17:28:13 UTC 2020

Hey Sergey, thanks for the interesting read. I'm interested to hear more about how you are tracking the Allocation Stalls? Are you post-processing the GC logs to look for stalls, or you have some other mechanism?

On 3/13/20, 20:22, "zgc-dev on behalf of Sergey Tselovalnikov" <zgc-dev-bounces at openjdk.java.net on behalf of sergeicelov at gmail.com> wrote:

    Hi,

    I met Chad (https://twitter.com/chadarimura) a few weeks ago at UnVoxxed
    Hawaii unconference and mentioned that we use ZGC at Canva, and he
    encouraged me to share the details. So I wanted to share our experience
    here. I hope, sharing our success with ZGC can encourage other people to
    try it out.

    At Canva, we use ZGC for our API Gateway (further AFE for short). ZGC
    helped us to reduce  GC pauses from around 30-50ms with occasional spikes
    to hundreds of ms down to only 1-2ms pauses [0]. GC pauses used to cause
    issues with the TCP backlog filling up which would result in further
    queuing inside the app, and would require allocating more
    threads/connections to clear up the queue. These two graphs show the
    difference we observed [1].

    To give some background, AFE runs on a few dozens of c5.large AWS
    instances. The application runs on OpenJDK JDK 13 with 1.5 GB max heap
    size, and a stable heap size around 400 MB. It uses Jetty with non-blocking
    APIs as a web framework, and Finagle as an RPC framework. When fully warmed
    up, less than 10% of CPU time is spent in GC threads. Enabling ZGC didn't
    require any special tuning, however, we increased the max heap size which
    was previously lower following the recommendations [2].

    There were a few issues that we faced:

    * Occasional crashes prior to 13.0.2
    Prior to JDK 13.0.2, we observed a number of crashes that would happen
    after running the app for around 14 hours. The symptoms were very similar
    to the ones in JDK-8230565. Looking at the crash logs, we found that the
    crashes would happen when one of the application methods is being
    recompiled from level 3 to level 4, so we had to mitigate this issue.
    However, after updating to 13.0.2, we haven't seen them anymore.

    * Occasional allocation stalls
    We're still seeing occasional "Application Stall" events which are a bit
    harder to debug. It doesn't happen very often, and we're still collecting
    data, but it seems that at least in some cases it's preceded by a number of
    "ICBufferFull" safepoints.

    * The results depend on the load profile
    ZGC worked really well for us for AFE for which the workload consists of a
    large number of requests flowing through at a relatively steady rate.
    However, we didn't have much success trying to apply it to some of the
    services with a bit spikier workloads. For such workloads, ZGC resulted in
    a large number of application stalls after which the app wasn't able to
    keep up with the incoming requests anymore.

    Thanks for working on this awesome GC!

    [0]
    https://user-images.githubusercontent.com/1780970/76673536-116fd000-659e-11ea-8832-4aefa06f02b2.png
    [1]
    https://user-images.githubusercontent.com/1780970/76673708-93acc400-659f-11ea-903e-0a9d50ef154d.png
    [2] https://wiki.openjdk.java.net/display/zgc/Main#Main-SettingHeapSize

    --
    Cheers,
    Sergey Tselovalnikov