ZGC unstable behaviour on java 17

Mon Nov 8 13:45:55 UTC 2021

Hi Anatoly,

Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email.

It’s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar.

Anyway, is there any way for me to get access to try to reproduce your issue?

Thanks,
/Erik

> On 4 Nov 2021, at 17:00, Anatoly Deyneka <adeyneka at gmail.com> wrote:
> 
> We are running a GC intensive application with allocation rate up to
> 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
> openjdk 17 2021-09-14
> OpenJDK Runtime Environment (build 17+35-2724)
> OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
> 
> During migration from java 15 to java 17 we see unstable behavior when zgc
> goes to infinite loop and doesn't return back to the normal state.
> We see the following behavior:
> - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with
> 16% CPU utilization
> - something happens and CPU utilization jumps to 40%, used memory jumps to
> 19GB, zgc cycles time is increased in 10times
> - sometimes app works about 30min before the crash sometimes it's crashed
> instantly
> - if it's still alive memory is not released and the app works with used
> memory 18-20GB
> - we see allocation stalls
> - we see the following GC stats
> [17606.140s][info][gc,phases   ] GC(719) Concurrent Process Non-Strong
> References 25781.928ms
> [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes Unlink
> 14280.772 / 25769.511  1126.563 / 25769.511   217.882 / 68385.750   217.882
> / 68385.750   ms
> -  we see JVM starts massively unload classes (jvm_classes_unloaded_total
> counter keep increasing) .
> - we’ve managed to profile JVM in such ‘unstable’ state and see ZTask
> consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool)
> job. See attached image.
> 
> We played with options:
> -XX:ZCollectionInterval=5
> -XX:SoftMaxHeapSize=12G
> -XX:ZAllocationSpikeTolerance=4
> -XX:MetaspaceReclaimPolicy=aggressive
> 
> But it doesn't help. it just postpones this event.
> Only if we redirect traffic to another node it goes to normal.
> 
> Regards,
> Anatoly Deyneka
> 
> [image: image.png]
> [image: image.png]
> [image: image.png]