[External] : Re: ZGC unstable behaviour on java 17

Tue Nov 9 16:39:29 UTC 2021

Please find the log here:
https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0

- Maybe try without reflection inflation? E.g.
`-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the
generation of reflection class loaders and rule out that they are the
problem.
we would like to go with inflation. We use a lot of reflection.
- Does it happen with other GCs?
we tried shenandoah today and it worked fine(just 1 stall for the whole day)
- Elastic Metaspace did away with huge pages in Metaspace for several
reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?
no, we didn't use this flag
- Can you switch metaspace reclamation off using
`-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes,
there may be something wrong with the reclamation algorithm.
we can try this
- If the VM becomes unresponsive, could you also give us the output of
`jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders
show-classes` though with 1M classes output may be large.
good point. we will analize it.

Regards,
Anatoly Deyneka

On Tue, 9 Nov 2021 at 18:26, Thomas Stüfe <thomas.stuefe at gmail.com> wrote:

> Additionally to Eriks proposal, some more remarks from the side (I'll just
> lurk in the background otherwise) :
>
> - Maybe try without reflection inflation? E.g.
> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the
> generation of reflection class loaders and rule out that they are the
> problem.
>
> - Does it happen with other GCs?
>
> - Elastic Metaspace did away with huge pages in Metaspace for several
> reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?
>
> - Can you switch metaspace reclamation off using
> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes,
> there may be something wrong with the reclamation algorithm.
>
> - If the VM becomes unresponsive, could you also give us the output of
> `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders
> show-classes` though with 1M classes output may be large.
>
> Thanks, Thomas
>
> On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund <erik.osterlund at oracle.com>
> wrote:
>
>> Hi Anatoly,
>>
>> Thank you for the additional details. This is very helpful. I talked to
>> Per a bit about this, and we feel like understanding what increased the
>> metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to
>> lead us to the problem. That’s more than 3 orders of magnitude higher
>> allocation rate of metaspace memory, which might explain why class
>> unloading is taking much longer than usual.
>>
>> The culprit that is being repeatedly allocated, is rather likely to also
>> be repeatedly unloaded, or you would hit an out of memory issue. And the
>> log snippet you showed, shows that there is a bunch of LambdaForms being
>> unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference
>> based cache, so that you don’t have to create a new LambdaForm, when there
>> is one in the cache. This soft reference based policy might not work very
>> well if all SoftReferences are cleared by the GC, because then it would
>> seem like a lot more LambdaForms would have to be created from scratch, as
>> opposed to being reused from the cache.
>>
>> This might explain why it takes a while before this starts happening. The
>> interesting thing is figuring out what could have led to all the soft
>> references being cleared. This could be because of metaspace exhaustion, or
>> due to allocation stalls. In the charts we can see that the allocation rate
>> was at the highest point, right before the incident. So without seeing the
>> log, it seems like you could have gotten an allocation stall, which
>> triggered a full collection clearing all soft references, which suddenly
>> leads to class unloading taking much longer. If class unloading takes much
>> longer, then you are quite likely to reach the same situation repeatedly,
>> in a bad cycle.
>>
>> Is there any way you could share an -Xlog:gc* log with us? If not,
>> finding out if there was an allocation stall or a “Metadata GC Clear Soft
>> References” GC just before everything went bad, might help with figuring
>> out what went wrong. Something that did happen between JDK 15 and JDK 17 is
>> dynamic thread selection for the GC. I wonder if that has somehow led to a
>> stall, which you wouldn’t see in JDK 15 with the more static thread
>> selection. If so, then maybe setting a SoftMaxHeapSize lower than the
>> MaxHeapSize might help avoiding stall in the first place, due to
>> fluctuations in the workload, or alternatively turning off
>> UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for
>> allocation stalls not happen.
>>
>> Anyway, figuring out what kind of GC triggered right before the incident
>> would help a lot I think, to figure out if forced soft ref clearing is
>> indeed what instigated this, or if this is all just a red herring.
>>
>> Thanks,
>> /Erik
>>
>> From: Anatoly Deyneka <adeyneka at gmail.com>
>> Sent: Tuesday, 9 November 2021 14:13
>> To: Erik Osterlund <erik.osterlund at oracle.com>
>> Cc: zgc-dev at openjdk.java.net
>> Subject: [External] : Re: ZGC unstable behaviour on java 17
>>
>> Hi Erik,
>>
>> Probably we haven't reached this condition on java 15.
>> We see a big difference in metaspace allocation starting from java 16.
>> The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy
>>
>> Here is all our knowledge about the problem:
>>
>> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md
>> <
>> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$
>> >
>>
>> Unfortunately we can reproduce this issue only on our production platform.
>> If we can provide more details or logs please let me know.
>>
>> Regards,
>> Anatoly Deyneka
>>
>> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund <erik.osterlund at oracle.com
>> <mailto:erik.osterlund at oracle.com>> wrote:
>> Hi Anatoly,
>>
>> Thanks for reporting this. This looks rather odd indeed. Unfortunately,
>> the attached images have been stripped from your original email.
>>
>> It’s particularly interesting that you see this behaviour when migrating
>> from JDK 15 to 17. There have been a few bug fixes, but nothing I can
>> easily imagine causing this. I have been surprised before though. One
>> notable thing in the class unloading path that has changed though, is the
>> elastic metaspace. I see that you use the
>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15.
>> This makes me wonder if this reproduces also without that option, i.e. the
>> way that you run the workload is more similar.
>>
>> Anyway, is there any way for me to get access to try to reproduce your
>> issue?
>>
>> Thanks,
>> /Erik
>>
>> > On 4 Nov 2021, at 17:00, Anatoly Deyneka <adeyneka at gmail.com<mailto:
>> adeyneka at gmail.com>> wrote:
>> >
>> > We are running a GC intensive application with allocation rate up to
>> > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
>> > openjdk 17 2021-09-14
>> > OpenJDK Runtime Environment (build 17+35-2724)
>> > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
>> >
>> > During migration from java 15 to java 17 we see unstable behavior when
>> zgc
>> > goes to infinite loop and doesn't return back to the normal state.
>> > We see the following behavior:
>> > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory
>> with
>> > 16% CPU utilization
>> > - something happens and CPU utilization jumps to 40%, used memory jumps
>> to
>> > 19GB, zgc cycles time is increased in 10times
>> > - sometimes app works about 30min before the crash sometimes it's
>> crashed
>> > instantly
>> > - if it's still alive memory is not released and the app works with used
>> > memory 18-20GB
>> > - we see allocation stalls
>> > - we see the following GC stats
>> > [17606.140s][info][gc,phases   ] GC(719) Concurrent Process Non-Strong
>> > References 25781.928ms
>> > [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes Unlink
>> > 14280.772 / 25769.511  1126.563 / 25769.511   217.882 / 68385.750
>>  217.882
>> > / 68385.750   ms
>> > -  we see JVM starts massively unload classes
>> (jvm_classes_unloaded_total
>> > counter keep increasing) .
>> > - we’ve managed to profile JVM in such ‘unstable’ state and see ZTask
>> > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool)
>> > job. See attached image.
>> >
>> > We played with options:
>> > -XX:ZCollectionInterval=5
>> > -XX:SoftMaxHeapSize=12G
>> > -XX:ZAllocationSpikeTolerance=4
>> > -XX:MetaspaceReclaimPolicy=aggressive
>> >
>> > But it doesn't help. it just postpones this event.
>> > Only if we redirect traffic to another node it goes to normal.
>> >
>> > Regards,
>> > Anatoly Deyneka
>> >
>> > [image: image.png]
>> > [image: image.png]
>> > [image: image.png]
>>
>