[External] : Re: ZGC unstable behaviour on java 17

Tue Nov 9 15:25:54 UTC 2021

Additionally to Eriks proposal, some more remarks from the side (I'll just
lurk in the background otherwise) :

- Maybe try without reflection inflation? E.g.
`-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the
generation of reflection class loaders and rule out that they are the
problem.

- Does it happen with other GCs?

- Elastic Metaspace did away with huge pages in Metaspace for several
reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?

- Can you switch metaspace reclamation off using
`-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes,
there may be something wrong with the reclamation algorithm.

- If the VM becomes unresponsive, could you also give us the output of
`jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders
show-classes` though with 1M classes output may be large.

Thanks, Thomas

On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund <erik.osterlund at oracle.com>
wrote:

> Hi Anatoly,
>
> Thank you for the additional details. This is very helpful. I talked to
> Per a bit about this, and we feel like understanding what increased the
> metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to
> lead us to the problem. That’s more than 3 orders of magnitude higher
> allocation rate of metaspace memory, which might explain why class
> unloading is taking much longer than usual.
>
> The culprit that is being repeatedly allocated, is rather likely to also
> be repeatedly unloaded, or you would hit an out of memory issue. And the
> log snippet you showed, shows that there is a bunch of LambdaForms being
> unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference
> based cache, so that you don’t have to create a new LambdaForm, when there
> is one in the cache. This soft reference based policy might not work very
> well if all SoftReferences are cleared by the GC, because then it would
> seem like a lot more LambdaForms would have to be created from scratch, as
> opposed to being reused from the cache.
>
> This might explain why it takes a while before this starts happening. The
> interesting thing is figuring out what could have led to all the soft
> references being cleared. This could be because of metaspace exhaustion, or
> due to allocation stalls. In the charts we can see that the allocation rate
> was at the highest point, right before the incident. So without seeing the
> log, it seems like you could have gotten an allocation stall, which
> triggered a full collection clearing all soft references, which suddenly
> leads to class unloading taking much longer. If class unloading takes much
> longer, then you are quite likely to reach the same situation repeatedly,
> in a bad cycle.
>
> Is there any way you could share an -Xlog:gc* log with us? If not, finding
> out if there was an allocation stall or a “Metadata GC Clear Soft
> References” GC just before everything went bad, might help with figuring
> out what went wrong. Something that did happen between JDK 15 and JDK 17 is
> dynamic thread selection for the GC. I wonder if that has somehow led to a
> stall, which you wouldn’t see in JDK 15 with the more static thread
> selection. If so, then maybe setting a SoftMaxHeapSize lower than the
> MaxHeapSize might help avoiding stall in the first place, due to
> fluctuations in the workload, or alternatively turning off
> UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for
> allocation stalls not happen.
>
> Anyway, figuring out what kind of GC triggered right before the incident
> would help a lot I think, to figure out if forced soft ref clearing is
> indeed what instigated this, or if this is all just a red herring.
>
> Thanks,
> /Erik
>
> From: Anatoly Deyneka <adeyneka at gmail.com>
> Sent: Tuesday, 9 November 2021 14:13
> To: Erik Osterlund <erik.osterlund at oracle.com>
> Cc: zgc-dev at openjdk.java.net
> Subject: [External] : Re: ZGC unstable behaviour on java 17
>
> Hi Erik,
>
> Probably we haven't reached this condition on java 15.
> We see a big difference in metaspace allocation starting from java 16.
> The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy
>
> Here is all our knowledge about the problem:
>
> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md
> <
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$
> >
>
> Unfortunately we can reproduce this issue only on our production platform.
> If we can provide more details or logs please let me know.
>
> Regards,
> Anatoly Deyneka
>
> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund <erik.osterlund at oracle.com
> <mailto:erik.osterlund at oracle.com>> wrote:
> Hi Anatoly,
>
> Thanks for reporting this. This looks rather odd indeed. Unfortunately,
> the attached images have been stripped from your original email.
>
> It’s particularly interesting that you see this behaviour when migrating
> from JDK 15 to 17. There have been a few bug fixes, but nothing I can
> easily imagine causing this. I have been surprised before though. One
> notable thing in the class unloading path that has changed though, is the
> elastic metaspace. I see that you use the
> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15.
> This makes me wonder if this reproduces also without that option, i.e. the
> way that you run the workload is more similar.
>
> Anyway, is there any way for me to get access to try to reproduce your
> issue?
>
> Thanks,
> /Erik
>
> > On 4 Nov 2021, at 17:00, Anatoly Deyneka <adeyneka at gmail.com<mailto:
> adeyneka at gmail.com>> wrote:
> >
> > We are running a GC intensive application with allocation rate up to
> > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
> > openjdk 17 2021-09-14
> > OpenJDK Runtime Environment (build 17+35-2724)
> > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
> >
> > During migration from java 15 to java 17 we see unstable behavior when
> zgc
> > goes to infinite loop and doesn't return back to the normal state.
> > We see the following behavior:
> > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with
> > 16% CPU utilization
> > - something happens and CPU utilization jumps to 40%, used memory jumps
> to
> > 19GB, zgc cycles time is increased in 10times
> > - sometimes app works about 30min before the crash sometimes it's crashed
> > instantly
> > - if it's still alive memory is not released and the app works with used
> > memory 18-20GB
> > - we see allocation stalls
> > - we see the following GC stats
> > [17606.140s][info][gc,phases   ] GC(719) Concurrent Process Non-Strong
> > References 25781.928ms
> > [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes Unlink
> > 14280.772 / 25769.511  1126.563 / 25769.511   217.882 / 68385.750
>  217.882
> > / 68385.750   ms
> > -  we see JVM starts massively unload classes (jvm_classes_unloaded_total
> > counter keep increasing) .
> > - we’ve managed to profile JVM in such ‘unstable’ state and see ZTask
> > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool)
> > job. See attached image.
> >
> > We played with options:
> > -XX:ZCollectionInterval=5
> > -XX:SoftMaxHeapSize=12G
> > -XX:ZAllocationSpikeTolerance=4
> > -XX:MetaspaceReclaimPolicy=aggressive
> >
> > But it doesn't help. it just postpones this event.
> > Only if we redirect traffic to another node it goes to normal.
> >
> > Regards,
> > Anatoly Deyneka
> >
> > [image: image.png]
> > [image: image.png]
> > [image: image.png]
>