[External] : Re: ZGC unstable behaviour on java 17

Wed Nov 10 13:03:51 UTC 2021

Thank you Erik,

Do you have a understanding of how it can be fixed?
SoftRefLRUPolicyMSPerMB - is a workaround and not a fix.
We have working shenandoah with default settings. It unloads classes
without performance regression.
BTW ZGC also can return to normal state if we temporary reduce allocation
rate by redirecting new users to another node(old users still stay on the
same node).
We did that many times when we saw such behaviour. ZGC returned to normal
state quite fast.

I guess ZGC does the slow unloading class job in the same threads that
affects overall performance of GC.
It explains why zgc cycles time is dramatically increased.

According to *Risks and Assumptions#Uncommit speed* (
https://openjdk.java.net/jeps/387):
"Metaspace reclamation can happen during a garbage-collection pause, so
this could be a problem.
We haven’t observed this problem so far, but if uncommit times become an
issue then we could offload the uncommitting work to a separate thread so
that it could be done independently of GC pauses."

We observed and fixed class allocation problem in some scenarios of using
jackson+blackbird.
So we can try to reproduce this issue.

Regards,
Anatoly Deyneka

On Wed, 10 Nov 2021 at 00:23, Erik Osterlund <erik.osterlund at oracle.com>
wrote:

> Hi Anatoly,
>
> Thank you for the additional details. Very interesting indeed. Everything
> seems to run well until you get an allocation stall due to class unloading
> taking a long time. The very same GC where the first stall happens, nuked
> quite a lot of soft references, while previous GCs did not. It really looks
> like a clear pattern wrt soft reference count and the class unloading
> headache.
>
> This seems to support the original intuition that the lambda form cache
> might have caused excessive class unloading after lots of soft references
> get cleared.
>
> The GC itself didn’t explicitly clear all soft references, but they might
> have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a
> *very* high value, the soft ref clearing won’t get triggered. That might
> make things better for you.
>
> Having said that, it seems like this cache really shouldn’t be using soft
> references if the result is this horrible when they get cleared. Unless of
> course that this is a red herring.
>
> Thanks,
> /Erik
>
> On 9 Nov 2021, at 17:39, Anatoly Deyneka <adeyneka at gmail.com> wrote:
>
> 
> Please find the log here:
>
> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0
> <https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$>
>
> - Maybe try without reflection inflation? E.g.
> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the
> generation of reflection class loaders and rule out that they are the
> problem.
> we would like to go with inflation. We use a lot of reflection.
> - Does it happen with other GCs?
> we tried shenandoah today and it worked fine(just 1 stall for the whole
> day)
> - Elastic Metaspace did away with huge pages in Metaspace for several
> reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?
> no, we didn't use this flag
> - Can you switch metaspace reclamation off using
> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes,
> there may be something wrong with the reclamation algorithm.
> we can try this
> - If the VM becomes unresponsive, could you also give us the output of
> `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders
> show-classes` though with 1M classes output may be large.
> good point. we will analize it.
>
> Regards,
> Anatoly Deyneka
>
> On Tue, 9 Nov 2021 at 18:26, Thomas Stüfe <thomas.stuefe at gmail.com> wrote:
>
>> Additionally to Eriks proposal, some more remarks from the side (I'll
>> just lurk in the background otherwise) :
>>
>> - Maybe try without reflection inflation? E.g.
>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the
>> generation of reflection class loaders and rule out that they are the
>> problem.
>>
>> - Does it happen with other GCs?
>>
>> - Elastic Metaspace did away with huge pages in Metaspace for several
>> reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?
>>
>> - Can you switch metaspace reclamation off using
>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes,
>> there may be something wrong with the reclamation algorithm.
>>
>> - If the VM becomes unresponsive, could you also give us the output of
>> `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders
>> show-classes` though with 1M classes output may be large.
>>
>> Thanks, Thomas
>>
>> On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund <erik.osterlund at oracle.com>
>> wrote:
>>
>>> Hi Anatoly,
>>>
>>> Thank you for the additional details. This is very helpful. I talked to
>>> Per a bit about this, and we feel like understanding what increased the
>>> metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to
>>> lead us to the problem. That’s more than 3 orders of magnitude higher
>>> allocation rate of metaspace memory, which might explain why class
>>> unloading is taking much longer than usual.
>>>
>>> The culprit that is being repeatedly allocated, is rather likely to also
>>> be repeatedly unloaded, or you would hit an out of memory issue. And the
>>> log snippet you showed, shows that there is a bunch of LambdaForms being
>>> unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference
>>> based cache, so that you don’t have to create a new LambdaForm, when there
>>> is one in the cache. This soft reference based policy might not work very
>>> well if all SoftReferences are cleared by the GC, because then it would
>>> seem like a lot more LambdaForms would have to be created from scratch, as
>>> opposed to being reused from the cache.
>>>
>>> This might explain why it takes a while before this starts happening.
>>> The interesting thing is figuring out what could have led to all the soft
>>> references being cleared. This could be because of metaspace exhaustion, or
>>> due to allocation stalls. In the charts we can see that the allocation rate
>>> was at the highest point, right before the incident. So without seeing the
>>> log, it seems like you could have gotten an allocation stall, which
>>> triggered a full collection clearing all soft references, which suddenly
>>> leads to class unloading taking much longer. If class unloading takes much
>>> longer, then you are quite likely to reach the same situation repeatedly,
>>> in a bad cycle.
>>>
>>> Is there any way you could share an -Xlog:gc* log with us? If not,
>>> finding out if there was an allocation stall or a “Metadata GC Clear Soft
>>> References” GC just before everything went bad, might help with figuring
>>> out what went wrong. Something that did happen between JDK 15 and JDK 17 is
>>> dynamic thread selection for the GC. I wonder if that has somehow led to a
>>> stall, which you wouldn’t see in JDK 15 with the more static thread
>>> selection. If so, then maybe setting a SoftMaxHeapSize lower than the
>>> MaxHeapSize might help avoiding stall in the first place, due to
>>> fluctuations in the workload, or alternatively turning off
>>> UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for
>>> allocation stalls not happen.
>>>
>>> Anyway, figuring out what kind of GC triggered right before the incident
>>> would help a lot I think, to figure out if forced soft ref clearing is
>>> indeed what instigated this, or if this is all just a red herring.
>>>
>>> Thanks,
>>> /Erik
>>>
>>> From: Anatoly Deyneka <adeyneka at gmail.com>
>>> Sent: Tuesday, 9 November 2021 14:13
>>> To: Erik Osterlund <erik.osterlund at oracle.com>
>>> Cc: zgc-dev at openjdk.java.net
>>> Subject: [External] : Re: ZGC unstable behaviour on java 17
>>>
>>> Hi Erik,
>>>
>>> Probably we haven't reached this condition on java 15.
>>> We see a big difference in metaspace allocation starting from java 16.
>>> The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy
>>>
>>> Here is all our knowledge about the problem:
>>>
>>> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md
>>> <https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$>
>>> <
>>> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$
>>> >
>>>
>>> Unfortunately we can reproduce this issue only on our production
>>> platform.
>>> If we can provide more details or logs please let me know.
>>>
>>> Regards,
>>> Anatoly Deyneka
>>>
>>> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund <erik.osterlund at oracle.com
>>> <mailto:erik.osterlund at oracle.com>> wrote:
>>> Hi Anatoly,
>>>
>>> Thanks for reporting this. This looks rather odd indeed. Unfortunately,
>>> the attached images have been stripped from your original email.
>>>
>>> It’s particularly interesting that you see this behaviour when migrating
>>> from JDK 15 to 17. There have been a few bug fixes, but nothing I can
>>> easily imagine causing this. I have been surprised before though. One
>>> notable thing in the class unloading path that has changed though, is the
>>> elastic metaspace. I see that you use the
>>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15.
>>> This makes me wonder if this reproduces also without that option, i.e. the
>>> way that you run the workload is more similar.
>>>
>>> Anyway, is there any way for me to get access to try to reproduce your
>>> issue?
>>>
>>> Thanks,
>>> /Erik
>>>
>>> > On 4 Nov 2021, at 17:00, Anatoly Deyneka <adeyneka at gmail.com<mailto:
>>> adeyneka at gmail.com>> wrote:
>>> >
>>> > We are running a GC intensive application with allocation rate up to
>>> > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
>>> > openjdk 17 2021-09-14
>>> > OpenJDK Runtime Environment (build 17+35-2724)
>>> > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
>>> >
>>> > During migration from java 15 to java 17 we see unstable behavior when
>>> zgc
>>> > goes to infinite loop and doesn't return back to the normal state.
>>> > We see the following behavior:
>>> > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory
>>> with
>>> > 16% CPU utilization
>>> > - something happens and CPU utilization jumps to 40%, used memory
>>> jumps to
>>> > 19GB, zgc cycles time is increased in 10times
>>> > - sometimes app works about 30min before the crash sometimes it's
>>> crashed
>>> > instantly
>>> > - if it's still alive memory is not released and the app works with
>>> used
>>> > memory 18-20GB
>>> > - we see allocation stalls
>>> > - we see the following GC stats
>>> > [17606.140s][info][gc,phases   ] GC(719) Concurrent Process Non-Strong
>>> > References 25781.928ms
>>> > [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes Unlink
>>> > 14280.772 / 25769.511  1126.563 / 25769.511   217.882 / 68385.750
>>>  217.882
>>> > / 68385.750   ms
>>> > -  we see JVM starts massively unload classes
>>> (jvm_classes_unloaded_total
>>> > counter keep increasing) .
>>> > - we’ve managed to profile JVM in such ‘unstable’ state and see ZTask
>>> > consuming a lot of cpu doing
>>> CompiledMethod::unload_nmethod_caches(bool)
>>> > job. See attached image.
>>> >
>>> > We played with options:
>>> > -XX:ZCollectionInterval=5
>>> > -XX:SoftMaxHeapSize=12G
>>> > -XX:ZAllocationSpikeTolerance=4
>>> > -XX:MetaspaceReclaimPolicy=aggressive
>>> >
>>> > But it doesn't help. it just postpones this event.
>>> > Only if we redirect traffic to another node it goes to normal.
>>> >
>>> > Regards,
>>> > Anatoly Deyneka
>>> >
>>> > [image: image.png]
>>> > [image: image.png]
>>> > [image: image.png]
>>>
>>