[External] : Re: ZGC unstable behaviour on java 17

Fri Nov 19 10:45:04 UTC 2021

Thank you,

I'm surprised how fast it has been fixed.
Looking forward to checking it out.

Anatoly

On Thu, 18 Nov 2021 at 13:30, Stefan Karlsson <stefan.karlsson at oracle.com>
wrote:

> Hi Anatoly,
>
> We've identified a 10-year-old bug in the inline cache cleaning code.
> See this pull request:
> https://github.com/openjdk/jdk/pull/6450
>
> It's likely that this is the cause of the problems you are seeing in
> your workload. Note that this bug is particularly bad for concurrent
> class unloading GCs (ZGC, Shenandoah), while the problem seems less
> sever for stop-the-world class unloading GCs (G1, Parallel, Serial).
>
> The plan right now is to get this into JDK 18, and then backport it to
> JDK 17.
>
> Thanks again for reporting this!
>
> StefanK
>
> On 2021-11-12 16:46, Stefan Karlsson wrote:
> > Hi Anatoly,
> >
> > On 2021-11-11 11:19, Anatoly Deyneka wrote:
> >> It's great you can reproduce it.
> >> Is there anything else we can help you with?
> >
> > I think we have everything we need for now. The hints about the
> > problem with refill_ic_stubs were great, and that's what we are
> > currently investigating on our side.
> >
> > Thanks,
> > StefanK
> >
> >>
> >>     On that note, it isn’t clear to me if the 2 orders of magnitude
> >>     higher metaspace allocation rate was observed before or after the
> >>     JSON issues were addressed, as you described.
> >>
> >> Before the fix of ObjectMapper.copy() we had a 250MB/s allocation
> >> rate and OutOfMemory(metaspace).
> >> BTW with shenandoah we have about 6MB/s with rare spikes up to 100MB/s
> >>
> >> Regards,
> >> Anatoly Deyneka
> >>
> >>
> >>
> >> On Wed, 10 Nov 2021 at 20:07, Stefan Karlsson
> >> <stefan.karlsson at oracle.com> wrote:
> >>
> >>     Hi,
> >>
> >>     I've managed to reproduce the long Concurrent Classes Unlink phase
> >>     by running an edited Blackbird microbenchmark.
> >>
> >>     First add an ObjectMapper#copy call to beanArrayDeser:
> >>
> https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
> >> <
> https://urldefense.com/v3/__https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmIMuUEN8$
> >
> >>
> >>     ---
> >>     diff --git
> >>
> a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
> >>
> b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
> >>     index 3675286..af05d54 100644
> >>     ---
> >>
> a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
> >>     +++
> >>
> b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
> >>     @@ -80,7 +80,7 @@ public abstract class BaseBenchmark {
> >>
> >>          @Benchmark
> >>          public SomeBean[] beanArrayDeser() throws Exception {
> >>     -        return mapper.readValue(beansBytes, SomeBean[].class);
> >>     +        return mapper.copy().readValue(beansBytes,
> >> SomeBean[].class);
> >>          }
> >>
> >>          @Benchmark
> >>     ---
> >>
> >>     Then change the ICBuffer size to 100, and run the following:
> >>     java -Xlog:gc* -Xmx20g -XX:+UseZGC -XX:MetaspaceReclaimPolicy=none
> >>     -XX:-ZProactive -jar target/benchmarks.jar
> >>
> >>     First run:
> >>     [180,168s][info][gc,stats ]    Subphase: Concurrent Classes
> >>     Unlink                   495,386 / 495,386    7002,041 /
> >>     24368,855  7002,041 / 24368,855  7002,041 / 24368,855 ms
> >>
> >>     Second run:
> >>     [140,170s][info][gc,stats    ]    Subphase: Concurrent Classes
> >>     Unlink                 67606,542 / 134428,068 67606,542 /
> >>     134428,068 67606,542 / 134428,068 67606,542 / 134428,068 ms
> >>
> >>     StefanK
> >>
> >>     On 2021-11-10 17:18, Erik Osterlund wrote:
> >>>     Hi Anatoly,
> >>>
> >>>     I think that we can now see why a large cluster of classes get
> >>> unloaded, and as you say possibly control that, as a workaround,
> >>> with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question
> >>> is what it is that makes the unloading take so long.
> >>>
> >>>     One theory is that it is related to metaspace uncommitting being
> >>> more aggressive. It seemed though like you had tried the more
> >>> balanced mode and it still was an issue, if I understood you correctly.
> >>>
> >>>     The second theory would be that it is something in the code
> >>> cache unloading that takes a long time. And it seems like when you
> >>> introspected it, the flame graph suggests that it is the code cache
> >>> unloading that takes long. So this seems like a more likely
> >>> explanation. And I can see from the GC logs that you have a *lot* of
> >>> nmethods being loaded. And many thousands of them get unloaded,
> >>> which leads to a huge inline cache cleaning task, unlinking all
> >>> these thousands of nmethods. This inline cache cleaning fills up a
> >>> buffer (ICBuffer) that is currently 10 KB large (rather small and
> >>> has not changed since ancient times, perhaps it is time to change
> >>> it). When this buffer fills up, you need to trigger an IC buffer
> >>> refilling safepoint, as you have observed in the logs. So the fact
> >>> that loads of these safepoints incrementally trigger, seems to
> >>> suggest that for this large number of nmethods being unloaded, and
> >>> the buffer size is too small. When it is way too small, the code
> >>> cache walk during the unlinking phase becomes quadratic with the
> >>> number of nmethods. And you have *loads* of nmethods. So that could
> >>> explain why it takes so long with the code cache unloading.
> >>>
> >>>     In the solution space of scalability of class unloading to these
> >>> large levels, the long-term fix is probably to remove inline caches
> >>> completely, which I have worked on a bit, but it is not going to
> >>> happen over night. A simpler fix that might actually happen over
> >>> night, is to increase the size of the IC buffer, like this:
> >>>
> >>>     diff --git a/src/hotspot/share/code/icBuffer.cpp
> >>> b/src/hotspot/share/code/icBuffer.cpp
> >>>     index 1da48f0bd13..382fa8650c5 100644
> >>>     --- a/src/hotspot/share/code/icBuffer.cpp
> >>>     +++ b/src/hotspot/share/code/icBuffer.cpp
> >>>     @@ -140,7 +140,7 @@ void ICStub::print() {
> >>>
> >>>     void InlineCacheBuffer::initialize() {
> >>>         if (_buffer != NULL) return; // already initialized
> >>>     -  _buffer = new StubQueue(new ICStubInterface, 10*K,
> >>> InlineCacheBuffer_lock, "InlineCacheBuffer");
> >>>     +  _buffer = new StubQueue(new ICStubInterface, 1*M,
> >>> InlineCacheBuffer_lock, "InlineCacheBuffer");
> >>>         assert (_buffer != NULL, "cannot allocate InlineCacheBuffer");
> >>>     }
> >>>
> >>>     That would probably bring down the time spent doing nmethod
> >>> unlinking by a *lot*. If you know how to build the JDK, that could
> >>> be an easy patch to try out, to significantly improve the
> >>> scalability of code cache unlinking.
> >>>
> >>>     However, one relevant question is also if this is really a
> >>> scalability problem in the GC unloading path, or if the real culprit
> >>> is elsewhere. The fact that you see two orders of magnitude higher
> >>> allocation rates in metaspace (!!!) when upgrading from 15 to 17,
> >>> might explain why the unloading code suddenly gets scalability
> >>> stressed so hard in the first place. And perhaps whatever caused
> >>> that 2 orders of magnitude increase, is the real culprit here.
> >>> Having said that, we might want to improve on the class unloading
> >>> scalabiliy anyway, but I do wonder why this is being stressed at
> >>> all, and why there are so many dying classes and nmethods. On that
> >>> note, it isn’t clear to me if the 2 orders of magnitude higher
> >>> metaspace allocation rate was observed before or after the JSON
> >>> issues were addressed, as you described.
> >>>
> >>>     /Erik
> >>>
> >>>     From: Anatoly Deyneka<adeyneka at gmail.com>
> >>> <mailto:adeyneka at gmail.com>
> >>>     Sent: Wednesday, 10 November 2021 14:04
> >>>     To: Erik Osterlund<erik.osterlund at oracle.com>
> >>> <mailto:erik.osterlund at oracle.com>
> >>>     Cc: Thomas Stüfe<thomas.stuefe at gmail.com>
> >>> <mailto:thomas.stuefe at gmail.com>;zgc-dev at openjdk.java.net
> >>>     Subject: Re: [External] : Re: ZGC unstable behaviour on java 17
> >>>
> >>>     Thank you Erik,
> >>>
> >>>     Do you have a understanding of how it can be fixed?
> >>>     SoftRefLRUPolicyMSPerMB - is a workaround and not a fix.
> >>>     We have working shenandoah with default settings. It unloads
> >>> classes without performance regression.
> >>>     BTW ZGC also can return to normal state if we temporary reduce
> >>> allocation rate by redirecting new users to another node(old users
> >>> still stay on the same node).
> >>>     We did that many times when we saw such behaviour. ZGC returned
> >>> to normal state quite fast.
> >>>
> >>>     I guess ZGC does the slow unloading class job in the same
> >>> threads that affects overall performance of GC.
> >>>     It explains why zgc cycles time is dramatically increased.
> >>>
> >>>     According to Risks and Assumptions#Uncommit speed
> >>> (https://openjdk.java.net/jeps/387):
> >>>     "Metaspace reclamation can happen during a garbage-collection
> >>> pause, so this could be a problem.
> >>>     We haven’t observed this problem so far, but if uncommit times
> >>> become an issue then we could offload the uncommitting work to a
> >>> separate thread so that it could be done independently of GC pauses."
> >>>
> >>>     We observed and fixed class allocation problem in some scenarios
> >>> of using jackson+blackbird.
> >>>     So we can try to reproduce this issue.
> >>>
> >>>     Regards,
> >>>     Anatoly Deyneka
> >>>
> >>>     On Wed, 10 Nov 2021 at 00:23, Erik Osterlund
> >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
> >>> <mailto:erik.osterlund at oracle.com>> wrote:
> >>>     Hi Anatoly,
> >>>
> >>>     Thank you for the additional details. Very interesting indeed.
> >>> Everything seems to run well until you get an allocation stall due
> >>> to class unloading taking a long time. The very same GC where the
> >>> first stall happens, nuked quite a lot of soft references, while
> >>> previous GCs did not. It really looks like a clear pattern wrt soft
> >>> reference count and the class unloading headache.
> >>>
> >>>     This seems to support the original intuition that the lambda
> >>> form cache might have caused excessive class unloading after lots of
> >>> soft references get cleared.
> >>>
> >>>     The GC itself didn’t explicitly clear all soft references, but
> >>> they might have timed out at the same time. By setting
> >>> SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref
> >>> clearing won’t get triggered. That might make things better for you.
> >>>
> >>>     Having said that, it seems like this cache really shouldn’t be
> >>> using soft references if the result is this horrible when they get
> >>> cleared. Unless of course that this is a red herring.
> >>>
> >>>     Thanks,
> >>>     /Erik
> >>>
> >>>
> >>>     On 9 Nov 2021, at 17:39, Anatoly Deyneka
> >>> <adeyneka at gmail.com<mailto:adeyneka at gmail.com>
> >>> <mailto:adeyneka at gmail.com>> wrote:
> >>>     
> >>>     Please find the log here:
> >>>
> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0
> >>> <
> https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmBsZKftr$
> ><
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$>
>
> >>> <
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$>
>
> >>>
> >>>
> >>>     - Maybe try without reflection inflation? E.g.
> >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit
> >>> the generation of reflection class loaders and rule out that they
> >>> are the problem.
> >>>     we would like to go with inflation. We use a lot of reflection.
> >>>     - Does it happen with other GCs?
> >>>     we tried shenandoah today and it worked fine(just 1 stall for
> >>> the whole day)
> >>>     - Elastic Metaspace did away with huge pages in Metaspace for
> >>> several reasons. Did the old scenario use
> >>> -XX:+UseLargePagesInMetaspace ?
> >>>     no, we didn't use this flag
> >>>     - Can you switch metaspace reclamation off using
> >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if
> >>> yes, there may be something wrong with the reclamation algorithm.
> >>>     we can try this
> >>>     - If the VM becomes unresponsive, could you also give us the
> >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd
> >>> VM.metaspace show-loaders show-classes` though with 1M classes
> >>> output may be large.
> >>>     good point. we will analize it.
> >>>
> >>>     Regards,
> >>>     Anatoly Deyneka
> >>>
> >>>     On Tue, 9 Nov 2021 at 18:26, Thomas Stüfe
> >>> <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>
> >>> <mailto:thomas.stuefe at gmail.com>> wrote:
> >>>     Additionally to Eriks proposal, some more remarks from the side
> >>> (I'll just lurk in the background otherwise) :
> >>>
> >>>     - Maybe try without reflection inflation? E.g.
> >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit
> >>> the generation of reflection class loaders and rule out that they
> >>> are the problem.
> >>>
> >>>     - Does it happen with other GCs?
> >>>
> >>>     - Elastic Metaspace did away with huge pages in Metaspace for
> >>> several reasons. Did the old scenario use
> >>> -XX:+UseLargePagesInMetaspace ?
> >>>
> >>>     - Can you switch metaspace reclamation off using
> >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if
> >>> yes, there may be something wrong with the reclamation algorithm.
> >>>
> >>>     - If the VM becomes unresponsive, could you also give us the
> >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd
> >>> VM.metaspace show-loaders show-classes` though with 1M classes
> >>> output may be large.
> >>>
> >>>     Thanks, Thomas
> >>>
> >>>     On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund
> >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
> >>> <mailto:erik.osterlund at oracle.com>> wrote:
> >>>     Hi Anatoly,
> >>>
> >>>     Thank you for the additional details. This is very helpful. I
> >>> talked to Per a bit about this, and we feel like understanding what
> >>> increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is
> >>> rather likely to lead us to the problem. That’s more than 3 orders
> >>> of magnitude higher allocation rate of metaspace memory, which might
> >>> explain why class unloading is taking much longer than usual.
> >>>
> >>>     The culprit that is being repeatedly allocated, is rather likely
> >>> to also be repeatedly unloaded, or you would hit an out of memory
> >>> issue. And the log snippet you showed, shows that there is a bunch
> >>> of LambdaForms being unloaded. Interestingly enough, the
> >>> LambdaFormEditor has a SoftReference based cache, so that you don’t
> >>> have to create a new LambdaForm, when there is one in the cache.
> >>> This soft reference based policy might not work very well if all
> >>> SoftReferences are cleared by the GC, because then it would seem
> >>> like a lot more LambdaForms would have to be created from scratch,
> >>> as opposed to being reused from the cache.
> >>>
> >>>     This might explain why it takes a while before this starts
> >>> happening. The interesting thing is figuring out what could have led
> >>> to all the soft references being cleared. This could be because of
> >>> metaspace exhaustion, or due to allocation stalls. In the charts we
> >>> can see that the allocation rate was at the highest point, right
> >>> before the incident. So without seeing the log, it seems like you
> >>> could have gotten an allocation stall, which triggered a full
> >>> collection clearing all soft references, which suddenly leads to
> >>> class unloading taking much longer. If class unloading takes much
> >>> longer, then you are quite likely to reach the same situation
> >>> repeatedly, in a bad cycle.
> >>>
> >>>     Is there any way you could share an -Xlog:gc* log with us? If
> >>> not, finding out if there was an allocation stall or a “Metadata GC
> >>> Clear Soft References” GC just before everything went bad, might
> >>> help with figuring out what went wrong. Something that did happen
> >>> between JDK 15 and JDK 17 is dynamic thread selection for the GC. I
> >>> wonder if that has somehow led to a stall, which you wouldn’t see in
> >>> JDK 15 with the more static thread selection. If so, then maybe
> >>> setting a SoftMaxHeapSize lower than the MaxHeapSize might help
> >>> avoiding stall in the first place, due to fluctuations in the
> >>> workload, or alternatively turning off UseDynamicNumberOfGCThreads
> >>> and setting ConcGCThreads high enough for allocation stalls not happen.
> >>>
> >>>     Anyway, figuring out what kind of GC triggered right before the
> >>> incident would help a lot I think, to figure out if forced soft ref
> >>> clearing is indeed what instigated this, or if this is all just a
> >>> red herring.
> >>>
> >>>     Thanks,
> >>>     /Erik
> >>>
> >>>     From: Anatoly Deyneka
> >>> <adeyneka at gmail.com<mailto:adeyneka at gmail.com>
> >>> <mailto:adeyneka at gmail.com>>
> >>>     Sent: Tuesday, 9 November 2021 14:13
> >>>     To: Erik Osterlund
> >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
> >>> <mailto:erik.osterlund at oracle.com>>
> >>> Cc:zgc-dev at openjdk.java.net<mailto:zgc-dev at openjdk.java.net>
> >>> <mailto:zgc-dev at openjdk.java.net>
> >>>     Subject: [External] : Re: ZGC unstable behaviour on java 17
> >>>
> >>>     Hi Erik,
> >>>
> >>>     Probably we haven't reached this condition on java 15.
> >>>     We see a big difference in metaspace allocation starting from
> >>> java 16.
> >>>     The behaviour is not changed for any option of
> >>> -XX:MetaspaceReclaimPolicy
> >>>
> >>>     Here is all our knowledge about the problem:
> >>>
> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md
> >>> <
> https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmOvvay_E$
> ><
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$>
>
> >>> <
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$
> ><
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$>
>
> >>> <
> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$>
>
> >>>
> >>>
> >>>     Unfortunately we can reproduce this issue only on our production
> >>> platform.
> >>>     If we can provide more details or logs please let me know.
> >>>
> >>>     Regards,
> >>>     Anatoly Deyneka
> >>>
> >>>     On Mon, 8 Nov 2021 at 16:46, Erik Osterlund
> >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
> >>> <mailto:erik.osterlund at oracle.com><mailto:erik.osterlund at oracle.com
> >>> <mailto:erik.osterlund at oracle.com><mailto:erik.osterlund at oracle.com>
> >>> <mailto:erik.osterlund at oracle.com>>> wrote:
> >>>     Hi Anatoly,
> >>>
> >>>     Thanks for reporting this. This looks rather odd indeed.
> >>> Unfortunately, the attached images have been stripped from your
> >>> original email.
> >>>
> >>>     It’s particularly interesting that you see this behaviour when
> >>> migrating from JDK 15 to 17. There have been a few bug fixes, but
> >>> nothing I can easily imagine causing this. I have been surprised
> >>> before though. One notable thing in the class unloading path that
> >>> has changed though, is the elastic metaspace. I see that you use the
> >>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in
> >>> JDK15. This makes me wonder if this reproduces also without that
> >>> option, i.e. the way that you run the workload is more similar.
> >>>
> >>>     Anyway, is there any way for me to get access to try to
> >>> reproduce your issue?
> >>>
> >>>     Thanks,
> >>>     /Erik
> >>>
> >>>>     On 4 Nov 2021, at 17:00, Anatoly Deyneka
> >>>> <adeyneka at gmail.com<mailto:adeyneka at gmail.com>
> >>>> <mailto:adeyneka at gmail.com><mailto:adeyneka at gmail.com
> >>>> <mailto:adeyneka at gmail.com><mailto:adeyneka at gmail.com>
> >>>> <mailto:adeyneka at gmail.com>>> wrote:
> >>>>
> >>>>     We are running a GC intensive application with allocation rate
> >>>> up to
> >>>>     1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
> >>>>     openjdk 17 2021-09-14
> >>>>     OpenJDK Runtime Environment (build 17+35-2724)
> >>>>     OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
> >>>>
> >>>>     During migration from java 15 to java 17 we see unstable
> >>>> behavior when zgc
> >>>>     goes to infinite loop and doesn't return back to the normal state.
> >>>>     We see the following behavior:
> >>>>     - it runs perfectly(~30% better then on java 15) - 5-7GB used
> >>>> memory with
> >>>>     16% CPU utilization
> >>>>     - something happens and CPU utilization jumps to 40%, used
> >>>> memory jumps to
> >>>>     19GB, zgc cycles time is increased in 10times
> >>>>     - sometimes app works about 30min before the crash sometimes
> >>>> it's crashed
> >>>>     instantly
> >>>>     - if it's still alive memory is not released and the app works
> >>>> with used
> >>>>     memory 18-20GB
> >>>>     - we see allocation stalls
> >>>>     - we see the following GC stats
> >>>>     [17606.140s][info][gc,phases   ] GC(719) Concurrent Process
> >>>> Non-Strong
> >>>>     References 25781.928ms
> >>>>     [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes
> >>>> Unlink
> >>>>     14280.772 / 25769.511  1126.563 / 25769.511   217.882 /
> >>>> 68385.750   217.882
> >>>>     / 68385.750   ms
> >>>>     -  we see JVM starts massively unload classes
> >>>> (jvm_classes_unloaded_total
> >>>>     counter keep increasing) .
> >>>>     - we’ve managed to profile JVM in such ‘unstable’ state and see
> >>>> ZTask
> >>>>     consuming a lot of cpu doing
> >>>> CompiledMethod::unload_nmethod_caches(bool)
> >>>>     job. See attached image.
> >>>>
> >>>>     We played with options:
> >>>>     -XX:ZCollectionInterval=5
> >>>>     -XX:SoftMaxHeapSize=12G
> >>>>     -XX:ZAllocationSpikeTolerance=4
> >>>>     -XX:MetaspaceReclaimPolicy=aggressive
> >>>>
> >>>>     But it doesn't help. it just postpones this event.
> >>>>     Only if we redirect traffic to another node it goes to normal.
> >>>>
> >>>>     Regards,
> >>>>     Anatoly Deyneka
> >>>>
> >>>>     [image: image.png]
> >>>>     [image: image.png]
> >>>>     [image: image.png]
> >>
> >
>
>