[External] : Re: ZGC unstable behaviour on java 17

Tue Nov 9 14:44:23 UTC 2021

Hi Anatoly,

Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That’s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual.

The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don’t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache.

This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle.

Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a “Metadata GC Clear Soft References” GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn’t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen.

Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring.

Thanks,
/Erik

From: Anatoly Deyneka <adeyneka at gmail.com>
Sent: Tuesday, 9 November 2021 14:13
To: Erik Osterlund <erik.osterlund at oracle.com>
Cc: zgc-dev at openjdk.java.net
Subject: [External] : Re: ZGC unstable behaviour on java 17

Hi Erik,

Probably we haven't reached this condition on java 15.
We see a big difference in metaspace allocation starting from java 16.
The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy

Here is all our knowledge about the problem:
https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md<https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$>

Unfortunately we can reproduce this issue only on our production platform.
If we can provide more details or logs please let me know.

Regards,
Anatoly Deyneka

On Mon, 8 Nov 2021 at 16:46, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:
Hi Anatoly,

Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email.

It’s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar.

Anyway, is there any way for me to get access to try to reproduce your issue?

Thanks,
/Erik

> On 4 Nov 2021, at 17:00, Anatoly Deyneka <adeyneka at gmail.com<mailto:adeyneka at gmail.com>> wrote:
>
> We are running a GC intensive application with allocation rate up to
> 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
> openjdk 17 2021-09-14
> OpenJDK Runtime Environment (build 17+35-2724)
> OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
>
> During migration from java 15 to java 17 we see unstable behavior when zgc
> goes to infinite loop and doesn't return back to the normal state.
> We see the following behavior:
> - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with
> 16% CPU utilization
> - something happens and CPU utilization jumps to 40%, used memory jumps to
> 19GB, zgc cycles time is increased in 10times
> - sometimes app works about 30min before the crash sometimes it's crashed
> instantly
> - if it's still alive memory is not released and the app works with used
> memory 18-20GB
> - we see allocation stalls
> - we see the following GC stats
> [17606.140s][info][gc,phases   ] GC(719) Concurrent Process Non-Strong
> References 25781.928ms
> [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes Unlink
> 14280.772 / 25769.511  1126.563 / 25769.511   217.882 / 68385.750   217.882
> / 68385.750   ms
> -  we see JVM starts massively unload classes (jvm_classes_unloaded_total
> counter keep increasing) .
> - we’ve managed to profile JVM in such ‘unstable’ state and see ZTask
> consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool)
> job. See attached image.
>
> We played with options:
> -XX:ZCollectionInterval=5
> -XX:SoftMaxHeapSize=12G
> -XX:ZAllocationSpikeTolerance=4
> -XX:MetaspaceReclaimPolicy=aggressive
>
> But it doesn't help. it just postpones this event.
> Only if we redirect traffic to another node it goes to normal.
>
> Regards,
> Anatoly Deyneka
>
> [image: image.png]
> [image: image.png]
> [image: image.png]