Why is my workload not behaving better with gen-zgc?

Tue Feb 21 08:26:43 UTC 2023

Ragnar,

That makes sense. I did a consulting job helping a financial services company migrate about two dozen apps/1300 hosts from Oracle JVM to a different JDK with a pause less collector. The rule of thumb we ended up with was double the physical heap size to replace GC pauses with additional CPU usage. Median latencies increased but 99%, 99.9%, 99.99%, 99.999%, and max latencies reduced dramatically. 

I’m still surprised how few shops realize this is possible. I’m surprised by how many smart programmers waste time doing unnecessary work tuning garbage collectors when this is a solved problem.

Peter

Sent from my iPhone

> On Feb 19, 2023, at 1:53 PM, Ragnar Rova <ragnar.rova at gmail.com> wrote:
> 
> Thanks Peter, indeed 32G does change things considerably. When I
> combine 32G heap with -XX:ConcGCThreads=10 I get a pretty decent
> result.
> 
> My service response times in a 5m load test look something like this
> with gen-zgc, 32g and 10 gc threads:
> 
>  Thread Stats   Avg      Stdev     Max   +/- Stdev
>    Latency    32.69ms   15.99ms 494.53ms   90.87%
>    Req/Sec   147.89     30.82   220.00     68.13%
>  Latency Distribution
>     50%   28.76ms
>     75%   34.62ms
>     90%   45.30ms
>     99%   84.54ms
> 
> Compared to G1 GC, with 32Gb heap, and MaxGCPauseTimeMillis=100
> 
>  Thread Stats   Avg      Stdev     Max   +/- Stdev
>    Latency    34.34ms   25.04ms 537.59ms   89.74%
>    Req/Sec   152.46     47.19   230.00     62.63%
>  Latency Distribution
>     50%   26.42ms
>     75%   29.02ms
>     90%   60.24ms
>     99%  125.78ms
> 
> Attaching the gc logs as well for both collectors
> 
>> On Thu, Feb 16, 2023 at 7:19 PM Peter Booth <peter_booth at me.com> wrote:
>> 
>> Ragnar,
>> 
>> I was looking at the g1gc.log, just to get a feel for your app’s behavior.
>> After 90 seconds the app’s old gen is stable around 10GB. It’s remarkable
>> how the allocation is almost all short-lived objects.
>> One constant about low pause collectors is that they need space to work with.
>> Are you able to repeat the same test with a 32GB heap?
>> 
>> Peter
>> 
>> 
>> 
>>>> On Feb 16, 2023, at 12:59 PM, Ragnar Rova <ragnar.rova at gmail.com> wrote:
>>> 
>>> Thanks a lot for the detailed answer. I am really starting to think
>>> that we need to spend time looking at the application code to reduce
>>> allocations. I was looking for a quicker win by looking at GC
>>> algorithms but it is pretty clear we are allocating way more than we
>>> would need for the task with better written app code. It might have
>>> been bearable for G1GC but some of the allocation patterns are
>>> problematic and can be written much more GC-friendly. I did try your
>>> suggestions but did still see critical allocation stalls, although the
>>> P50, P75 etc showed an improvement in my benchmark with
>>> -XX:ConcGCThreads=10 -XX:ZAllocationSpikeTolerance=10. But I am more
>>> looking into allocation profiling and resolving this that way.
>>> 
>>> On Wed, Feb 15, 2023 at 11:07 AM Erik Osterlund
>>> <erik.osterlund at oracle.com> wrote:
>>>> 
>>>> Hi Ragnar,
>>>> 
>>>> Thanks for taking generational ZGC for a spin, and providing feedback.
>>>> 
>>>> First to the simple answer: you are getting allocation stalls throughout the entire program run. That’s why latencies are hurting. What that means is that you are allocating memory faster than the concurrent GC can free it up, so the application threads have to wait until the GC is done.
>>>> 
>>>> As for why the allocation stalls are occurring, I think there are multiple factors.
>>>> 
>>>> 1. The allocation rate of your program looks very “spiky”. As in sometimes it isn’t allocation all that much, but sometimes it allocates at >5G per second. This is problematic application behaviour for any concurrent GC, as it becomes very hard to predict when a concurrent GC should start to make sure it can finish in a timely manner before you run out of heap memory.
>>>> 
>>>> 2. The maximum number of GC threads that ZGC by default allows itself to use (-XX:ConcGCThreads=…) is by default 25% of the machine, which in your case is 3 threads. When your allocation rate is that high, it seems like we need more threads to keep up. For example, G1 in your example uses 10 threads.
>>>> 
>>>> Typically, in real applications, we want to keep down the number of GC threads, because they risk preempting application threads, which causes latency jitter for the program. The goal of concurrent GC isn’t just to move work that was traditionally done in GC pauses to a concurrent phase, just for the sake of it. We are doing it to allow the GC to stay out of the way and let the program run more undisturbed. But let’s say we used as many threads as there are cores on the machines, and the concurrent GC in practice has to throw the application out from all the cores to finish on time, then we didn’t win so much by doing that GC.
>>>> 
>>>> If you would like to see better numbers in your stress test, and better understand what is happening, one thing you can play around with is to for example set -XX:ConcGCThreads=10 so that ZGC gets as many threads as G1 in your example. You can also set -XX:ZAllocationSpikeTolerance=… to something higher than 2 (maybe 10?) to accomodate the large fluctuations in allocation rate. You can also configure a lower -XX:SoftMaxHeapSize=… to some value lower than -Xmx to make the GC trigger earlier. This might make the numbers look better. But in general, this type of application behaviour is probably not what we are currently catering for, really.
>>>> 
>>>> Thanks,
>>>> /Erik
>>>> 
>>>>> On 15 Feb 2023, at 07:17, Ragnar Rova <ragnar.rova at gmail.com> wrote:
>>>>> 
>>>>> Hello!
>>>>> 
>>>>> Unsure if this is the right place to ask, but the gen-zgc page did ask
>>>>> for feedback, so here goes:
>>>>> 
>>>>> I have a workload that is performing worse latency-wise when measuring
>>>>> actual service response times compared to G1, both P99 all the way
>>>>> down to P50 as measured in a short 5-minute load test. Benchmark was
>>>>> run on aarch64 on MacOS using 21-genzgc+1-8.
>>>>> 
>>>>> I have benchmarked this workload before on OpenJDK 19 using stock ZGC
>>>>> and also here G1 performs better. My speculation was that ZGC
>>>>> performed worse due to the generational nature of the allocations my
>>>>> workload has. But, when recently testing 21-genzgc+1-8, G1 still
>>>>> outperforms in terms of both throughput (that was expected) and
>>>>> latency profile (less so).
>>>>> 
>>>>> What kind of measurements would be needed to discuss this?
>>>>> 
>>>>> Attaching gc logs for a 5-minute benchmark run if that is helpful for
>>>>> genzgc and G1 from the same JDK build.
>>>>> <gen-zgc.log.gz><g1gc.log.gz>
>>>> 
>> 
> <zgc-32g-10gcthreads.log.gz>
> <g1gc-32g-maxgcpausemillis100.log.gz>