[External] : Re: ZGC unstable behaviour on java 17

Thu Jan 27 08:49:54 UTC 2022

Hi Anatoly,

JDK 17.0.2 has now been released, and it contains the fix I mentioned 
below. I hope this resolves the issues you were seeing. If
you get around to retest with the latest binaries, feel free to report 
back and tell us if the problem went away or not.

Thanks,
StefanK

On 2021-11-19 11:45, Anatoly Deyneka wrote:
> Thank you,
>
> I'm surprised how fast it has been fixed.
> Looking forward to checking it out.
>
> Anatoly
>
>
> On Thu, 18 Nov 2021 at 13:30, Stefan Karlsson 
> <stefan.karlsson at oracle.com> wrote:
>
>     Hi Anatoly,
>
>     We've identified a 10-year-old bug in the inline cache cleaning code.
>     See this pull request:
>     https://github.com/openjdk/jdk/pull/6450
>     <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/6450__;!!ACWV5N9M2RV99hQ!Y-s2AS9txeUlTkMY36_KkkfTFUkb7yuFvhvAQYJwyIBlNPpzooN8tCWE6462IR6I9OYm$>
>
>     It's likely that this is the cause of the problems you are seeing in
>     your workload. Note that this bug is particularly bad for concurrent
>     class unloading GCs (ZGC, Shenandoah), while the problem seems less
>     sever for stop-the-world class unloading GCs (G1, Parallel, Serial).
>
>     The plan right now is to get this into JDK 18, and then backport
>     it to
>     JDK 17.
>
>     Thanks again for reporting this!
>
>     StefanK
>
>     On 2021-11-12 16:46, Stefan Karlsson wrote:
>     > Hi Anatoly,
>     >
>     > On 2021-11-11 11:19, Anatoly Deyneka wrote:
>     >> It's great you can reproduce it.
>     >> Is there anything else we can help you with?
>     >
>     > I think we have everything we need for now. The hints about the
>     > problem with refill_ic_stubs were great, and that's what we are
>     > currently investigating on our side.
>     >
>     > Thanks,
>     > StefanK
>     >
>     >>
>     >>     On that note, it isn’t clear to me if the 2 orders of magnitude
>     >>     higher metaspace allocation rate was observed before or
>     after the
>     >>     JSON issues were addressed, as you described.
>     >>
>     >> Before the fix of ObjectMapper.copy() we had a 250MB/s allocation
>     >> rate and OutOfMemory(metaspace).
>     >> BTW with shenandoah we have about 6MB/s with rare spikes up to
>     100MB/s
>     >>
>     >> Regards,
>     >> Anatoly Deyneka
>     >>
>     >>
>     >>
>     >> On Wed, 10 Nov 2021 at 20:07, Stefan Karlsson
>     >> <stefan.karlsson at oracle.com> wrote:
>     >>
>     >>     Hi,
>     >>
>     >>     I've managed to reproduce the long Concurrent Classes
>     Unlink phase
>     >>     by running an edited Blackbird microbenchmark.
>     >>
>     >>     First add an ObjectMapper#copy call to beanArrayDeser:
>     >>
>     https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
>     <https://urldefense.com/v3/__https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java__;!!ACWV5N9M2RV99hQ!Y-s2AS9txeUlTkMY36_KkkfTFUkb7yuFvhvAQYJwyIBlNPpzooN8tCWE6462IRbCx0SB$>
>     >>
>     <https://urldefense.com/v3/__https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmIMuUEN8$>
>     >>
>     >>     ---
>     >>     diff --git
>     >>
>     a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
>     >>
>     b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
>     >>     index 3675286..af05d54 100644
>     >>     ---
>     >>
>     a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
>     >>     +++
>     >>
>     b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java
>     >>     @@ -80,7 +80,7 @@ public abstract class BaseBenchmark {
>     >>
>     >>          @Benchmark
>     >>          public SomeBean[] beanArrayDeser() throws Exception {
>     >>     -        return mapper.readValue(beansBytes, SomeBean[].class);
>     >>     +        return mapper.copy().readValue(beansBytes,
>     >> SomeBean[].class);
>     >>          }
>     >>
>     >>          @Benchmark
>     >>     ---
>     >>
>     >>     Then change the ICBuffer size to 100, and run the following:
>     >>     java -Xlog:gc* -Xmx20g -XX:+UseZGC
>     -XX:MetaspaceReclaimPolicy=none
>     >>     -XX:-ZProactive -jar target/benchmarks.jar
>     >>
>     >>     First run:
>     >>     [180,168s][info][gc,stats ]    Subphase: Concurrent Classes
>     >>     Unlink                   495,386 / 495,386 7002,041 /
>     >>     24368,855  7002,041 / 24368,855  7002,041 / 24368,855 ms
>     >>
>     >>     Second run:
>     >>     [140,170s][info][gc,stats    ]    Subphase: Concurrent Classes
>     >>     Unlink                 67606,542 / 134428,068 67606,542 /
>     >>     134428,068 67606,542 / 134428,068 67606,542 / 134428,068 ms
>     >>
>     >>     StefanK
>     >>
>     >>     On 2021-11-10 17:18, Erik Osterlund wrote:
>     >>>     Hi Anatoly,
>     >>>
>     >>>     I think that we can now see why a large cluster of classes
>     get
>     >>> unloaded, and as you say possibly control that, as a workaround,
>     >>> with the SoftRefLRUPolicyMSPerMB flag. However, the bigger
>     question
>     >>> is what it is that makes the unloading take so long.
>     >>>
>     >>>     One theory is that it is related to metaspace uncommitting
>     being
>     >>> more aggressive. It seemed though like you had tried the more
>     >>> balanced mode and it still was an issue, if I understood you
>     correctly.
>     >>>
>     >>>     The second theory would be that it is something in the code
>     >>> cache unloading that takes a long time. And it seems like when
>     you
>     >>> introspected it, the flame graph suggests that it is the code
>     cache
>     >>> unloading that takes long. So this seems like a more likely
>     >>> explanation. And I can see from the GC logs that you have a
>     *lot* of
>     >>> nmethods being loaded. And many thousands of them get unloaded,
>     >>> which leads to a huge inline cache cleaning task, unlinking all
>     >>> these thousands of nmethods. This inline cache cleaning fills
>     up a
>     >>> buffer (ICBuffer) that is currently 10 KB large (rather small and
>     >>> has not changed since ancient times, perhaps it is time to change
>     >>> it). When this buffer fills up, you need to trigger an IC buffer
>     >>> refilling safepoint, as you have observed in the logs. So the
>     fact
>     >>> that loads of these safepoints incrementally trigger, seems to
>     >>> suggest that for this large number of nmethods being unloaded,
>     and
>     >>> the buffer size is too small. When it is way too small, the code
>     >>> cache walk during the unlinking phase becomes quadratic with the
>     >>> number of nmethods. And you have *loads* of nmethods. So that
>     could
>     >>> explain why it takes so long with the code cache unloading.
>     >>>
>     >>>     In the solution space of scalability of class unloading to
>     these
>     >>> large levels, the long-term fix is probably to remove inline
>     caches
>     >>> completely, which I have worked on a bit, but it is not going to
>     >>> happen over night. A simpler fix that might actually happen over
>     >>> night, is to increase the size of the IC buffer, like this:
>     >>>
>     >>>     diff --git a/src/hotspot/share/code/icBuffer.cpp
>     >>> b/src/hotspot/share/code/icBuffer.cpp
>     >>>     index 1da48f0bd13..382fa8650c5 100644
>     >>>     --- a/src/hotspot/share/code/icBuffer.cpp
>     >>>     +++ b/src/hotspot/share/code/icBuffer.cpp
>     >>>     @@ -140,7 +140,7 @@ void ICStub::print() {
>     >>>
>     >>>     void InlineCacheBuffer::initialize() {
>     >>>         if (_buffer != NULL) return; // already initialized
>     >>>     -  _buffer = new StubQueue(new ICStubInterface, 10*K,
>     >>> InlineCacheBuffer_lock, "InlineCacheBuffer");
>     >>>     +  _buffer = new StubQueue(new ICStubInterface, 1*M,
>     >>> InlineCacheBuffer_lock, "InlineCacheBuffer");
>     >>>         assert (_buffer != NULL, "cannot allocate
>     InlineCacheBuffer");
>     >>>     }
>     >>>
>     >>>     That would probably bring down the time spent doing nmethod
>     >>> unlinking by a *lot*. If you know how to build the JDK, that
>     could
>     >>> be an easy patch to try out, to significantly improve the
>     >>> scalability of code cache unlinking.
>     >>>
>     >>>     However, one relevant question is also if this is really a
>     >>> scalability problem in the GC unloading path, or if the real
>     culprit
>     >>> is elsewhere. The fact that you see two orders of magnitude
>     higher
>     >>> allocation rates in metaspace (!!!) when upgrading from 15 to 17,
>     >>> might explain why the unloading code suddenly gets scalability
>     >>> stressed so hard in the first place. And perhaps whatever caused
>     >>> that 2 orders of magnitude increase, is the real culprit here.
>     >>> Having said that, we might want to improve on the class unloading
>     >>> scalabiliy anyway, but I do wonder why this is being stressed at
>     >>> all, and why there are so many dying classes and nmethods. On
>     that
>     >>> note, it isn’t clear to me if the 2 orders of magnitude higher
>     >>> metaspace allocation rate was observed before or after the JSON
>     >>> issues were addressed, as you described.
>     >>>
>     >>>     /Erik
>     >>>
>     >>>     From: Anatoly Deyneka<adeyneka at gmail.com>
>     >>> <mailto:adeyneka at gmail.com>
>     >>>     Sent: Wednesday, 10 November 2021 14:04
>     >>>     To: Erik Osterlund<erik.osterlund at oracle.com>
>     >>> <mailto:erik.osterlund at oracle.com>
>     >>>     Cc: Thomas Stüfe<thomas.stuefe at gmail.com>
>     >>> <mailto:thomas.stuefe at gmail.com>;zgc-dev at openjdk.java.net
>     >>>     Subject: Re: [External] : Re: ZGC unstable behaviour on
>     java 17
>     >>>
>     >>>     Thank you Erik,
>     >>>
>     >>>     Do you have a understanding of how it can be fixed?
>     >>>     SoftRefLRUPolicyMSPerMB - is a workaround and not a fix.
>     >>>     We have working shenandoah with default settings. It unloads
>     >>> classes without performance regression.
>     >>>     BTW ZGC also can return to normal state if we temporary
>     reduce
>     >>> allocation rate by redirecting new users to another node(old
>     users
>     >>> still stay on the same node).
>     >>>     We did that many times when we saw such behaviour. ZGC
>     returned
>     >>> to normal state quite fast.
>     >>>
>     >>>     I guess ZGC does the slow unloading class job in the same
>     >>> threads that affects overall performance of GC.
>     >>>     It explains why zgc cycles time is dramatically increased.
>     >>>
>     >>>     According to Risks and Assumptions#Uncommit speed
>     >>> (https://openjdk.java.net/jeps/387):
>     >>>     "Metaspace reclamation can happen during a garbage-collection
>     >>> pause, so this could be a problem.
>     >>>     We haven’t observed this problem so far, but if uncommit
>     times
>     >>> become an issue then we could offload the uncommitting work to a
>     >>> separate thread so that it could be done independently of GC
>     pauses."
>     >>>
>     >>>     We observed and fixed class allocation problem in some
>     scenarios
>     >>> of using jackson+blackbird.
>     >>>     So we can try to reproduce this issue.
>     >>>
>     >>>     Regards,
>     >>>     Anatoly Deyneka
>     >>>
>     >>>     On Wed, 10 Nov 2021 at 00:23, Erik Osterlund
>     >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
>     >>> <mailto:erik.osterlund at oracle.com>> wrote:
>     >>>     Hi Anatoly,
>     >>>
>     >>>     Thank you for the additional details. Very interesting
>     indeed.
>     >>> Everything seems to run well until you get an allocation stall
>     due
>     >>> to class unloading taking a long time. The very same GC where the
>     >>> first stall happens, nuked quite a lot of soft references, while
>     >>> previous GCs did not. It really looks like a clear pattern wrt
>     soft
>     >>> reference count and the class unloading headache.
>     >>>
>     >>>     This seems to support the original intuition that the lambda
>     >>> form cache might have caused excessive class unloading after
>     lots of
>     >>> soft references get cleared.
>     >>>
>     >>>     The GC itself didn’t explicitly clear all soft references,
>     but
>     >>> they might have timed out at the same time. By setting
>     >>> SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref
>     >>> clearing won’t get triggered. That might make things better
>     for you.
>     >>>
>     >>>     Having said that, it seems like this cache really
>     shouldn’t be
>     >>> using soft references if the result is this horrible when they
>     get
>     >>> cleared. Unless of course that this is a red herring.
>     >>>
>     >>>     Thanks,
>     >>>     /Erik
>     >>>
>     >>>
>     >>>     On 9 Nov 2021, at 17:39, Anatoly Deyneka
>     >>> <adeyneka at gmail.com<mailto:adeyneka at gmail.com>
>     >>> <mailto:adeyneka at gmail.com>> wrote:
>     >>>     
>     >>>     Please find the log here:
>     >>>
>     https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0
>     <https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!Y-s2AS9txeUlTkMY36_KkkfTFUkb7yuFvhvAQYJwyIBlNPpzooN8tCWE6462IULTQgU1$>
>
>     >>>
>     <https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmBsZKftr$><https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$>
>
>     >>>
>     <https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$>
>
>     >>>
>     >>>
>     >>>     - Maybe try without reflection inflation? E.g.
>     >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would
>     inhibit
>     >>> the generation of reflection class loaders and rule out that they
>     >>> are the problem.
>     >>>     we would like to go with inflation. We use a lot of
>     reflection.
>     >>>     - Does it happen with other GCs?
>     >>>     we tried shenandoah today and it worked fine(just 1 stall for
>     >>> the whole day)
>     >>>     - Elastic Metaspace did away with huge pages in Metaspace for
>     >>> several reasons. Did the old scenario use
>     >>> -XX:+UseLargePagesInMetaspace ?
>     >>>     no, we didn't use this flag
>     >>>     - Can you switch metaspace reclamation off using
>     >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this
>     helps - if
>     >>> yes, there may be something wrong with the reclamation algorithm.
>     >>>     we can try this
>     >>>     - If the VM becomes unresponsive, could you also give us the
>     >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd
>     >>> VM.metaspace show-loaders show-classes` though with 1M classes
>     >>> output may be large.
>     >>>     good point. we will analize it.
>     >>>
>     >>>     Regards,
>     >>>     Anatoly Deyneka
>     >>>
>     >>>     On Tue, 9 Nov 2021 at 18:26, Thomas Stüfe
>     >>> <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>
>     >>> <mailto:thomas.stuefe at gmail.com>> wrote:
>     >>>     Additionally to Eriks proposal, some more remarks from the
>     side
>     >>> (I'll just lurk in the background otherwise) :
>     >>>
>     >>>     - Maybe try without reflection inflation? E.g.
>     >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would
>     inhibit
>     >>> the generation of reflection class loaders and rule out that they
>     >>> are the problem.
>     >>>
>     >>>     - Does it happen with other GCs?
>     >>>
>     >>>     - Elastic Metaspace did away with huge pages in Metaspace for
>     >>> several reasons. Did the old scenario use
>     >>> -XX:+UseLargePagesInMetaspace ?
>     >>>
>     >>>     - Can you switch metaspace reclamation off using
>     >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this
>     helps - if
>     >>> yes, there may be something wrong with the reclamation algorithm.
>     >>>
>     >>>     - If the VM becomes unresponsive, could you also give us the
>     >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd
>     >>> VM.metaspace show-loaders show-classes` though with 1M classes
>     >>> output may be large.
>     >>>
>     >>>     Thanks, Thomas
>     >>>
>     >>>     On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund
>     >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
>     >>> <mailto:erik.osterlund at oracle.com>> wrote:
>     >>>     Hi Anatoly,
>     >>>
>     >>>     Thank you for the additional details. This is very helpful. I
>     >>> talked to Per a bit about this, and we feel like understanding
>     what
>     >>> increased the metaspace allocation rate from 1.5 KB/s to 40
>     MB/s, is
>     >>> rather likely to lead us to the problem. That’s more than 3
>     orders
>     >>> of magnitude higher allocation rate of metaspace memory, which
>     might
>     >>> explain why class unloading is taking much longer than usual.
>     >>>
>     >>>     The culprit that is being repeatedly allocated, is rather
>     likely
>     >>> to also be repeatedly unloaded, or you would hit an out of memory
>     >>> issue. And the log snippet you showed, shows that there is a
>     bunch
>     >>> of LambdaForms being unloaded. Interestingly enough, the
>     >>> LambdaFormEditor has a SoftReference based cache, so that you
>     don’t
>     >>> have to create a new LambdaForm, when there is one in the cache.
>     >>> This soft reference based policy might not work very well if all
>     >>> SoftReferences are cleared by the GC, because then it would seem
>     >>> like a lot more LambdaForms would have to be created from
>     scratch,
>     >>> as opposed to being reused from the cache.
>     >>>
>     >>>     This might explain why it takes a while before this starts
>     >>> happening. The interesting thing is figuring out what could
>     have led
>     >>> to all the soft references being cleared. This could be
>     because of
>     >>> metaspace exhaustion, or due to allocation stalls. In the
>     charts we
>     >>> can see that the allocation rate was at the highest point, right
>     >>> before the incident. So without seeing the log, it seems like you
>     >>> could have gotten an allocation stall, which triggered a full
>     >>> collection clearing all soft references, which suddenly leads to
>     >>> class unloading taking much longer. If class unloading takes much
>     >>> longer, then you are quite likely to reach the same situation
>     >>> repeatedly, in a bad cycle.
>     >>>
>     >>>     Is there any way you could share an -Xlog:gc* log with us? If
>     >>> not, finding out if there was an allocation stall or a
>     “Metadata GC
>     >>> Clear Soft References” GC just before everything went bad, might
>     >>> help with figuring out what went wrong. Something that did happen
>     >>> between JDK 15 and JDK 17 is dynamic thread selection for the
>     GC. I
>     >>> wonder if that has somehow led to a stall, which you wouldn’t
>     see in
>     >>> JDK 15 with the more static thread selection. If so, then maybe
>     >>> setting a SoftMaxHeapSize lower than the MaxHeapSize might help
>     >>> avoiding stall in the first place, due to fluctuations in the
>     >>> workload, or alternatively turning off
>     UseDynamicNumberOfGCThreads
>     >>> and setting ConcGCThreads high enough for allocation stalls
>     not happen.
>     >>>
>     >>>     Anyway, figuring out what kind of GC triggered right
>     before the
>     >>> incident would help a lot I think, to figure out if forced
>     soft ref
>     >>> clearing is indeed what instigated this, or if this is all just a
>     >>> red herring.
>     >>>
>     >>>     Thanks,
>     >>>     /Erik
>     >>>
>     >>>     From: Anatoly Deyneka
>     >>> <adeyneka at gmail.com<mailto:adeyneka at gmail.com>
>     >>> <mailto:adeyneka at gmail.com>>
>     >>>     Sent: Tuesday, 9 November 2021 14:13
>     >>>     To: Erik Osterlund
>     >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
>     >>> <mailto:erik.osterlund at oracle.com>>
>     >>> Cc:zgc-dev at openjdk.java.net
>     <mailto:Cc%3Azgc-dev at openjdk.java.net><mailto:zgc-dev at openjdk.java.net>
>
>     >>> <mailto:zgc-dev at openjdk.java.net>
>     >>>     Subject: [External] : Re: ZGC unstable behaviour on java 17
>     >>>
>     >>>     Hi Erik,
>     >>>
>     >>>     Probably we haven't reached this condition on java 15.
>     >>>     We see a big difference in metaspace allocation starting from
>     >>> java 16.
>     >>>     The behaviour is not changed for any option of
>     >>> -XX:MetaspaceReclaimPolicy
>     >>>
>     >>>     Here is all our knowledge about the problem:
>     >>>
>     https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md
>     <https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!Y-s2AS9txeUlTkMY36_KkkfTFUkb7yuFvhvAQYJwyIBlNPpzooN8tCWE6462Icu2dE2x$>
>
>     >>>
>     <https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmOvvay_E$><https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$>
>
>     >>>
>     <https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$><https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$>
>
>     >>>
>     <https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$>
>
>     >>>
>     >>>
>     >>>     Unfortunately we can reproduce this issue only on our
>     production
>     >>> platform.
>     >>>     If we can provide more details or logs please let me know.
>     >>>
>     >>>     Regards,
>     >>>     Anatoly Deyneka
>     >>>
>     >>>     On Mon, 8 Nov 2021 at 16:46, Erik Osterlund
>     >>> <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>
>     >>>
>     <mailto:erik.osterlund at oracle.com><mailto:erik.osterlund at oracle.com
>     >>>
>     <mailto:erik.osterlund at oracle.com><mailto:erik.osterlund at oracle.com>
>     >>> <mailto:erik.osterlund at oracle.com>>> wrote:
>     >>>     Hi Anatoly,
>     >>>
>     >>>     Thanks for reporting this. This looks rather odd indeed.
>     >>> Unfortunately, the attached images have been stripped from your
>     >>> original email.
>     >>>
>     >>>     It’s particularly interesting that you see this behaviour
>     when
>     >>> migrating from JDK 15 to 17. There have been a few bug fixes, but
>     >>> nothing I can easily imagine causing this. I have been surprised
>     >>> before though. One notable thing in the class unloading path that
>     >>> has changed though, is the elastic metaspace. I see that you
>     use the
>     >>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not
>     exist in
>     >>> JDK15. This makes me wonder if this reproduces also without that
>     >>> option, i.e. the way that you run the workload is more similar.
>     >>>
>     >>>     Anyway, is there any way for me to get access to try to
>     >>> reproduce your issue?
>     >>>
>     >>>     Thanks,
>     >>>     /Erik
>     >>>
>     >>>>     On 4 Nov 2021, at 17:00, Anatoly Deyneka
>     >>>> <adeyneka at gmail.com<mailto:adeyneka at gmail.com>
>     >>>> <mailto:adeyneka at gmail.com><mailto:adeyneka at gmail.com
>     >>>> <mailto:adeyneka at gmail.com><mailto:adeyneka at gmail.com>
>     >>>> <mailto:adeyneka at gmail.com>>> wrote:
>     >>>>
>     >>>>     We are running a GC intensive application with
>     allocation rate
>     >>>> up to
>     >>>>     1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G
>     -Xmx20G).
>     >>>>     openjdk 17 2021-09-14
>     >>>>     OpenJDK Runtime Environment (build 17+35-2724)
>     >>>>     OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode,
>     sharing)
>     >>>>
>     >>>>     During migration from java 15 to java 17 we see unstable
>     >>>> behavior when zgc
>     >>>>     goes to infinite loop and doesn't return back to the
>     normal state.
>     >>>>     We see the following behavior:
>     >>>>     - it runs perfectly(~30% better then on java 15) - 5-7GB
>     used
>     >>>> memory with
>     >>>>     16% CPU utilization
>     >>>>     - something happens and CPU utilization jumps to 40%, used
>     >>>> memory jumps to
>     >>>>     19GB, zgc cycles time is increased in 10times
>     >>>>     - sometimes app works about 30min before the crash sometimes
>     >>>> it's crashed
>     >>>>     instantly
>     >>>>     - if it's still alive memory is not released and the app
>     works
>     >>>> with used
>     >>>>     memory 18-20GB
>     >>>>     - we see allocation stalls
>     >>>>     - we see the following GC stats
>     >>>>     [17606.140s][info][gc,phases   ] GC(719) Concurrent Process
>     >>>> Non-Strong
>     >>>>     References 25781.928ms
>     >>>>     [17610.181s][info][gc,stats    ] Subphase: Concurrent
>     Classes
>     >>>> Unlink
>     >>>>     14280.772 / 25769.511  1126.563 / 25769.511   217.882 /
>     >>>> 68385.750   217.882
>     >>>>     / 68385.750   ms
>     >>>>     -  we see JVM starts massively unload classes
>     >>>> (jvm_classes_unloaded_total
>     >>>>     counter keep increasing) .
>     >>>>     - we’ve managed to profile JVM in such ‘unstable’ state
>     and see
>     >>>> ZTask
>     >>>>     consuming a lot of cpu doing
>     >>>> CompiledMethod::unload_nmethod_caches(bool)
>     >>>>     job. See attached image.
>     >>>>
>     >>>>     We played with options:
>     >>>>     -XX:ZCollectionInterval=5
>     >>>>     -XX:SoftMaxHeapSize=12G
>     >>>>     -XX:ZAllocationSpikeTolerance=4
>     >>>>     -XX:MetaspaceReclaimPolicy=aggressive
>     >>>>
>     >>>>     But it doesn't help. it just postpones this event.
>     >>>>     Only if we redirect traffic to another node it goes to
>     normal.
>     >>>>
>     >>>>     Regards,
>     >>>>     Anatoly Deyneka
>     >>>>
>     >>>>     [image: image.png]
>     >>>>     [image: image.png]
>     >>>>     [image: image.png]
>     >>
>     >
>