Why is my workload not behaving better with gen-zgc?

Thu Feb 16 17:59:53 UTC 2023

Thanks a lot for the detailed answer. I am really starting to think
that we need to spend time looking at the application code to reduce
allocations. I was looking for a quicker win by looking at GC
algorithms but it is pretty clear we are allocating way more than we
would need for the task with better written app code. It might have
been bearable for G1GC but some of the allocation patterns are
problematic and can be written much more GC-friendly. I did try your
suggestions but did still see critical allocation stalls, although the
P50, P75 etc showed an improvement in my benchmark with
-XX:ConcGCThreads=10 -XX:ZAllocationSpikeTolerance=10. But I am more
looking into allocation profiling and resolving this that way.

On Wed, Feb 15, 2023 at 11:07 AM Erik Osterlund
<erik.osterlund at oracle.com> wrote:
>
> Hi Ragnar,
>
> Thanks for taking generational ZGC for a spin, and providing feedback.
>
> First to the simple answer: you are getting allocation stalls throughout the entire program run. That’s why latencies are hurting. What that means is that you are allocating memory faster than the concurrent GC can free it up, so the application threads have to wait until the GC is done.
>
> As for why the allocation stalls are occurring, I think there are multiple factors.
>
> 1. The allocation rate of your program looks very “spiky”. As in sometimes it isn’t allocation all that much, but sometimes it allocates at >5G per second. This is problematic application behaviour for any concurrent GC, as it becomes very hard to predict when a concurrent GC should start to make sure it can finish in a timely manner before you run out of heap memory.
>
> 2. The maximum number of GC threads that ZGC by default allows itself to use (-XX:ConcGCThreads=…) is by default 25% of the machine, which in your case is 3 threads. When your allocation rate is that high, it seems like we need more threads to keep up. For example, G1 in your example uses 10 threads.
>
> Typically, in real applications, we want to keep down the number of GC threads, because they risk preempting application threads, which causes latency jitter for the program. The goal of concurrent GC isn’t just to move work that was traditionally done in GC pauses to a concurrent phase, just for the sake of it. We are doing it to allow the GC to stay out of the way and let the program run more undisturbed. But let’s say we used as many threads as there are cores on the machines, and the concurrent GC in practice has to throw the application out from all the cores to finish on time, then we didn’t win so much by doing that GC.
>
> If you would like to see better numbers in your stress test, and better understand what is happening, one thing you can play around with is to for example set -XX:ConcGCThreads=10 so that ZGC gets as many threads as G1 in your example. You can also set -XX:ZAllocationSpikeTolerance=… to something higher than 2 (maybe 10?) to accomodate the large fluctuations in allocation rate. You can also configure a lower -XX:SoftMaxHeapSize=… to some value lower than -Xmx to make the GC trigger earlier. This might make the numbers look better. But in general, this type of application behaviour is probably not what we are currently catering for, really.
>
> Thanks,
> /Erik
>
> > On 15 Feb 2023, at 07:17, Ragnar Rova <ragnar.rova at gmail.com> wrote:
> >
> > Hello!
> >
> > Unsure if this is the right place to ask, but the gen-zgc page did ask
> > for feedback, so here goes:
> >
> > I have a workload that is performing worse latency-wise when measuring
> > actual service response times compared to G1, both P99 all the way
> > down to P50 as measured in a short 5-minute load test. Benchmark was
> > run on aarch64 on MacOS using 21-genzgc+1-8.
> >
> > I have benchmarked this workload before on OpenJDK 19 using stock ZGC
> > and also here G1 performs better. My speculation was that ZGC
> > performed worse due to the generational nature of the allocations my
> > workload has. But, when recently testing 21-genzgc+1-8, G1 still
> > outperforms in terms of both throughput (that was expected) and
> > latency profile (less so).
> >
> > What kind of measurements would be needed to discuss this?
> >
> > Attaching gc logs for a 5-minute benchmark run if that is helpful for
> > genzgc and G1 from the same JDK build.
> > <gen-zgc.log.gz><g1gc.log.gz>
>