[External] : Re: ZGC unstable behaviour on java 17

Wed Nov 10 16:18:23 UTC 2021

Hi Anatoly,

I think that we can now see why a large cluster of classes get unloaded, and as you say possibly control that, as a workaround, with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question is what it is that makes the unloading take so long.

One theory is that it is related to metaspace uncommitting being more aggressive. It seemed though like you had tried the more balanced mode and it still was an issue, if I understood you correctly.

The second theory would be that it is something in the code cache unloading that takes a long time. And it seems like when you introspected it, the flame graph suggests that it is the code cache unloading that takes long. So this seems like a more likely explanation. And I can see from the GC logs that you have a *lot* of nmethods being loaded. And many thousands of them get unloaded, which leads to a huge inline cache cleaning task, unlinking all these thousands of nmethods. This inline cache cleaning fills up a buffer (ICBuffer) that is currently 10 KB large (rather small and has not changed since ancient times, perhaps it is time to change it). When this buffer fills up, you need to trigger an IC buffer refilling safepoint, as you have observed in the logs. So the fact that loads of these safepoints incrementally trigger, seems to suggest that for this large number of nmethods being unloaded, and the buffer size is too small. When it is way too small, the code cache walk during the unlinking phase becomes quadratic with the number of nmethods. And you have *loads* of nmethods. So that could explain why it takes so long with the code cache unloading.

In the solution space of scalability of class unloading to these large levels, the long-term fix is probably to remove inline caches completely, which I have worked on a bit, but it is not going to happen over night. A simpler fix that might actually happen over night, is to increase the size of the IC buffer, like this:

diff --git a/src/hotspot/share/code/icBuffer.cpp b/src/hotspot/share/code/icBuffer.cpp
index 1da48f0bd13..382fa8650c5 100644
--- a/src/hotspot/share/code/icBuffer.cpp
+++ b/src/hotspot/share/code/icBuffer.cpp
@@ -140,7 +140,7 @@ void ICStub::print() {

void InlineCacheBuffer::initialize() {
   if (_buffer != NULL) return; // already initialized
-  _buffer = new StubQueue(new ICStubInterface, 10*K, InlineCacheBuffer_lock, "InlineCacheBuffer");
+  _buffer = new StubQueue(new ICStubInterface, 1*M, InlineCacheBuffer_lock, "InlineCacheBuffer");
   assert (_buffer != NULL, "cannot allocate InlineCacheBuffer");
}

That would probably bring down the time spent doing nmethod unlinking by a *lot*. If you know how to build the JDK, that could be an easy patch to try out, to significantly improve the scalability of code cache unlinking.

However, one relevant question is also if this is really a scalability problem in the GC unloading path, or if the real culprit is elsewhere. The fact that you see two orders of magnitude higher allocation rates in metaspace (!!!) when upgrading from 15 to 17, might explain why the unloading code suddenly gets scalability stressed so hard in the first place. And perhaps whatever caused that 2 orders of magnitude increase, is the real culprit here. Having said that, we might want to improve on the class unloading scalabiliy anyway, but I do wonder why this is being stressed at all, and why there are so many dying classes and nmethods. On that note, it isn’t clear to me if the 2 orders of magnitude higher metaspace allocation rate was observed before or after the JSON issues were addressed, as you described.

/Erik

From: Anatoly Deyneka <adeyneka at gmail.com>
Sent: Wednesday, 10 November 2021 14:04
To: Erik Osterlund <erik.osterlund at oracle.com>
Cc: Thomas Stüfe <thomas.stuefe at gmail.com>; zgc-dev at openjdk.java.net
Subject: Re: [External] : Re: ZGC unstable behaviour on java 17

Thank you Erik,

Do you have a understanding of how it can be fixed?
SoftRefLRUPolicyMSPerMB - is a workaround and not a fix.
We have working shenandoah with default settings. It unloads classes without performance regression.
BTW ZGC also can return to normal state if we temporary reduce allocation rate by redirecting new users to another node(old users still stay on the same node).
We did that many times when we saw such behaviour. ZGC returned to normal state quite fast.

I guess ZGC does the slow unloading class job in the same threads that affects overall performance of GC.
It explains why zgc cycles time is dramatically increased.

According to Risks and Assumptions#Uncommit speed (https://openjdk.java.net/jeps/387):
"Metaspace reclamation can happen during a garbage-collection pause, so this could be a problem.
We haven’t observed this problem so far, but if uncommit times become an issue then we could offload the uncommitting work to a separate thread so that it could be done independently of GC pauses."

We observed and fixed class allocation problem in some scenarios of using jackson+blackbird.
So we can try to reproduce this issue.

Regards,
Anatoly Deyneka

On Wed, 10 Nov 2021 at 00:23, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:
Hi Anatoly,

Thank you for the additional details. Very interesting indeed. Everything seems to run well until you get an allocation stall due to class unloading taking a long time. The very same GC where the first stall happens, nuked quite a lot of soft references, while previous GCs did not. It really looks like a clear pattern wrt soft reference count and the class unloading headache.

This seems to support the original intuition that the lambda form cache might have caused excessive class unloading after lots of soft references get cleared.

The GC itself didn’t explicitly clear all soft references, but they might have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref clearing won’t get triggered. That might make things better for you.

Having said that, it seems like this cache really shouldn’t be using soft references if the result is this horrible when they get cleared. Unless of course that this is a red herring.

Thanks,
/Erik


On 9 Nov 2021, at 17:39, Anatoly Deyneka <adeyneka at gmail.com<mailto:adeyneka at gmail.com>> wrote:

Please find the log here:
https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0<https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$>

- Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem.
we would like to go with inflation. We use a lot of reflection.
- Does it happen with other GCs?
we tried shenandoah today and it worked fine(just 1 stall for the whole day)
- Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?
no, we didn't use this flag
- Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm.
we can try this
- If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large.
good point. we will analize it.

Regards,
Anatoly Deyneka

On Tue, 9 Nov 2021 at 18:26, Thomas Stüfe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>> wrote:
Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) :

- Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem.

- Does it happen with other GCs?

- Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ?

- Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm.

- If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large.

Thanks, Thomas

On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:
Hi Anatoly,

Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That’s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual.

The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don’t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache.

This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle.

Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a “Metadata GC Clear Soft References” GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn’t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen.

Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring.

Thanks,
/Erik

From: Anatoly Deyneka <adeyneka at gmail.com<mailto:adeyneka at gmail.com>>
Sent: Tuesday, 9 November 2021 14:13
To: Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>>
Cc: zgc-dev at openjdk.java.net<mailto:zgc-dev at openjdk.java.net>
Subject: [External] : Re: ZGC unstable behaviour on java 17

Hi Erik,

Probably we haven't reached this condition on java 15.
We see a big difference in metaspace allocation starting from java 16.
The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy

Here is all our knowledge about the problem:
https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md<https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$><https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$>

Unfortunately we can reproduce this issue only on our production platform.
If we can provide more details or logs please let me know.

Regards,
Anatoly Deyneka

On Mon, 8 Nov 2021 at 16:46, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com><mailto:erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>>> wrote:
Hi Anatoly,

Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email.

It’s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar.

Anyway, is there any way for me to get access to try to reproduce your issue?

Thanks,
/Erik

> On 4 Nov 2021, at 17:00, Anatoly Deyneka <adeyneka at gmail.com<mailto:adeyneka at gmail.com><mailto:adeyneka at gmail.com<mailto:adeyneka at gmail.com>>> wrote:
>
> We are running a GC intensive application with allocation rate up to
> 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G).
> openjdk 17 2021-09-14
> OpenJDK Runtime Environment (build 17+35-2724)
> OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
>
> During migration from java 15 to java 17 we see unstable behavior when zgc
> goes to infinite loop and doesn't return back to the normal state.
> We see the following behavior:
> - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with
> 16% CPU utilization
> - something happens and CPU utilization jumps to 40%, used memory jumps to
> 19GB, zgc cycles time is increased in 10times
> - sometimes app works about 30min before the crash sometimes it's crashed
> instantly
> - if it's still alive memory is not released and the app works with used
> memory 18-20GB
> - we see allocation stalls
> - we see the following GC stats
> [17606.140s][info][gc,phases   ] GC(719) Concurrent Process Non-Strong
> References 25781.928ms
> [17610.181s][info][gc,stats    ] Subphase: Concurrent Classes Unlink
> 14280.772 / 25769.511  1126.563 / 25769.511   217.882 / 68385.750   217.882
> / 68385.750   ms
> -  we see JVM starts massively unload classes (jvm_classes_unloaded_total
> counter keep increasing) .
> - we’ve managed to profile JVM in such ‘unstable’ state and see ZTask
> consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool)
> job. See attached image.
>
> We played with options:
> -XX:ZCollectionInterval=5
> -XX:SoftMaxHeapSize=12G
> -XX:ZAllocationSpikeTolerance=4
> -XX:MetaspaceReclaimPolicy=aggressive
>
> But it doesn't help. it just postpones this event.
> Only if we redirect traffic to another node it goes to normal.
>
> Regards,
> Anatoly Deyneka
>
> [image: image.png]
> [image: image.png]
> [image: image.png]