From adeyneka at gmail.com Thu Nov 4 16:00:33 2021 From: adeyneka at gmail.com (Anatoly Deyneka) Date: Thu, 4 Nov 2021 19:00:33 +0300 Subject: ZGC unstable behaviour on java 17 Message-ID: We are running a GC intensive application with allocation rate up to 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). openjdk 17 2021-09-14 OpenJDK Runtime Environment (build 17+35-2724) OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) During migration from java 15 to java 17 we see unstable behavior when zgc goes to infinite loop and doesn't return back to the normal state. We see the following behavior: - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with 16% CPU utilization - something happens and CPU utilization jumps to 40%, used memory jumps to 19GB, zgc cycles time is increased in 10times - sometimes app works about 30min before the crash sometimes it's crashed instantly - if it's still alive memory is not released and the app works with used memory 18-20GB - we see allocation stalls - we see the following GC stats [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong References 25781.928ms [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 / 68385.750 ms - we see JVM starts massively unload classes (jvm_classes_unloaded_total counter keep increasing) . - we?ve managed to profile JVM in such ?unstable? state and see ZTask consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) job. See attached image. We played with options: -XX:ZCollectionInterval=5 -XX:SoftMaxHeapSize=12G -XX:ZAllocationSpikeTolerance=4 -XX:MetaspaceReclaimPolicy=aggressive But it doesn't help. it just postpones this event. Only if we redirect traffic to another node it goes to normal. Regards, Anatoly Deyneka [image: image.png] [image: image.png] [image: image.png] From erik.osterlund at oracle.com Mon Nov 8 13:45:55 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Mon, 8 Nov 2021 13:45:55 +0000 Subject: ZGC unstable behaviour on java 17 In-Reply-To: References: Message-ID: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> Hi Anatoly, Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. Anyway, is there any way for me to get access to try to reproduce your issue? Thanks, /Erik > On 4 Nov 2021, at 17:00, Anatoly Deyneka wrote: > > ?We are running a GC intensive application with allocation rate up to > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > openjdk 17 2021-09-14 > OpenJDK Runtime Environment (build 17+35-2724) > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > During migration from java 15 to java 17 we see unstable behavior when zgc > goes to infinite loop and doesn't return back to the normal state. > We see the following behavior: > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > 16% CPU utilization > - something happens and CPU utilization jumps to 40%, used memory jumps to > 19GB, zgc cycles time is increased in 10times > - sometimes app works about 30min before the crash sometimes it's crashed > instantly > - if it's still alive memory is not released and the app works with used > memory 18-20GB > - we see allocation stalls > - we see the following GC stats > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > References 25781.928ms > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 > / 68385.750 ms > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > counter keep increasing) . > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > job. See attached image. > > We played with options: > -XX:ZCollectionInterval=5 > -XX:SoftMaxHeapSize=12G > -XX:ZAllocationSpikeTolerance=4 > -XX:MetaspaceReclaimPolicy=aggressive > > But it doesn't help. it just postpones this event. > Only if we redirect traffic to another node it goes to normal. > > Regards, > Anatoly Deyneka > > [image: image.png] > [image: image.png] > [image: image.png] From adeyneka at gmail.com Tue Nov 9 13:12:49 2021 From: adeyneka at gmail.com (Anatoly Deyneka) Date: Tue, 9 Nov 2021 16:12:49 +0300 Subject: ZGC unstable behaviour on java 17 In-Reply-To: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> Message-ID: Hi Erik, Probably we haven't reached this condition on java 15. We see a big difference in metaspace allocation starting from java 16. The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy Here is all our knowledge about the problem: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md Unfortunately we can reproduce this issue only on our production platform. If we can provide more details or logs please let me know. Regards, Anatoly Deyneka On Mon, 8 Nov 2021 at 16:46, Erik Osterlund wrote: > Hi Anatoly, > > Thanks for reporting this. This looks rather odd indeed. Unfortunately, > the attached images have been stripped from your original email. > > It?s particularly interesting that you see this behaviour when migrating > from JDK 15 to 17. There have been a few bug fixes, but nothing I can > easily imagine causing this. I have been surprised before though. One > notable thing in the class unloading path that has changed though, is the > elastic metaspace. I see that you use the > -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. > This makes me wonder if this reproduces also without that option, i.e. the > way that you run the workload is more similar. > > Anyway, is there any way for me to get access to try to reproduce your > issue? > > Thanks, > /Erik > > > On 4 Nov 2021, at 17:00, Anatoly Deyneka wrote: > > > > ?We are running a GC intensive application with allocation rate up to > > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > > openjdk 17 2021-09-14 > > OpenJDK Runtime Environment (build 17+35-2724) > > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > > > During migration from java 15 to java 17 we see unstable behavior when > zgc > > goes to infinite loop and doesn't return back to the normal state. > > We see the following behavior: > > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > > 16% CPU utilization > > - something happens and CPU utilization jumps to 40%, used memory jumps > to > > 19GB, zgc cycles time is increased in 10times > > - sometimes app works about 30min before the crash sometimes it's crashed > > instantly > > - if it's still alive memory is not released and the app works with used > > memory 18-20GB > > - we see allocation stalls > > - we see the following GC stats > > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > > References 25781.928ms > > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 > 217.882 > > / 68385.750 ms > > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > > counter keep increasing) . > > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > > job. See attached image. > > > > We played with options: > > -XX:ZCollectionInterval=5 > > -XX:SoftMaxHeapSize=12G > > -XX:ZAllocationSpikeTolerance=4 > > -XX:MetaspaceReclaimPolicy=aggressive > > > > But it doesn't help. it just postpones this event. > > Only if we redirect traffic to another node it goes to normal. > > > > Regards, > > Anatoly Deyneka > > > > [image: image.png] > > [image: image.png] > > [image: image.png] > From erik.osterlund at oracle.com Tue Nov 9 14:44:23 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Tue, 9 Nov 2021 14:44:23 +0000 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> Message-ID: Hi Anatoly, Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That?s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual. The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don?t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache. This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle. Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a ?Metadata GC Clear Soft References? GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn?t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen. Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring. Thanks, /Erik From: Anatoly Deyneka Sent: Tuesday, 9 November 2021 14:13 To: Erik Osterlund Cc: zgc-dev at openjdk.java.net Subject: [External] : Re: ZGC unstable behaviour on java 17 Hi Erik, Probably we haven't reached this condition on java 15. We see a big difference in metaspace allocation starting from java 16. The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy Here is all our knowledge about the problem: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md Unfortunately we can reproduce this issue only on our production platform. If we can provide more details or logs please let me know. Regards, Anatoly Deyneka On Mon, 8 Nov 2021 at 16:46, Erik Osterlund > wrote: Hi Anatoly, Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. Anyway, is there any way for me to get access to try to reproduce your issue? Thanks, /Erik > On 4 Nov 2021, at 17:00, Anatoly Deyneka > wrote: > > ?We are running a GC intensive application with allocation rate up to > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > openjdk 17 2021-09-14 > OpenJDK Runtime Environment (build 17+35-2724) > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > During migration from java 15 to java 17 we see unstable behavior when zgc > goes to infinite loop and doesn't return back to the normal state. > We see the following behavior: > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > 16% CPU utilization > - something happens and CPU utilization jumps to 40%, used memory jumps to > 19GB, zgc cycles time is increased in 10times > - sometimes app works about 30min before the crash sometimes it's crashed > instantly > - if it's still alive memory is not released and the app works with used > memory 18-20GB > - we see allocation stalls > - we see the following GC stats > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > References 25781.928ms > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 > / 68385.750 ms > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > counter keep increasing) . > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > job. See attached image. > > We played with options: > -XX:ZCollectionInterval=5 > -XX:SoftMaxHeapSize=12G > -XX:ZAllocationSpikeTolerance=4 > -XX:MetaspaceReclaimPolicy=aggressive > > But it doesn't help. it just postpones this event. > Only if we redirect traffic to another node it goes to normal. > > Regards, > Anatoly Deyneka > > [image: image.png] > [image: image.png] > [image: image.png] From thomas.stuefe at gmail.com Tue Nov 9 15:25:54 2021 From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=) Date: Tue, 9 Nov 2021 16:25:54 +0100 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> Message-ID: Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) : - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. - Does it happen with other GCs? - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. Thanks, Thomas On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund wrote: > Hi Anatoly, > > Thank you for the additional details. This is very helpful. I talked to > Per a bit about this, and we feel like understanding what increased the > metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to > lead us to the problem. That?s more than 3 orders of magnitude higher > allocation rate of metaspace memory, which might explain why class > unloading is taking much longer than usual. > > The culprit that is being repeatedly allocated, is rather likely to also > be repeatedly unloaded, or you would hit an out of memory issue. And the > log snippet you showed, shows that there is a bunch of LambdaForms being > unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference > based cache, so that you don?t have to create a new LambdaForm, when there > is one in the cache. This soft reference based policy might not work very > well if all SoftReferences are cleared by the GC, because then it would > seem like a lot more LambdaForms would have to be created from scratch, as > opposed to being reused from the cache. > > This might explain why it takes a while before this starts happening. The > interesting thing is figuring out what could have led to all the soft > references being cleared. This could be because of metaspace exhaustion, or > due to allocation stalls. In the charts we can see that the allocation rate > was at the highest point, right before the incident. So without seeing the > log, it seems like you could have gotten an allocation stall, which > triggered a full collection clearing all soft references, which suddenly > leads to class unloading taking much longer. If class unloading takes much > longer, then you are quite likely to reach the same situation repeatedly, > in a bad cycle. > > Is there any way you could share an -Xlog:gc* log with us? If not, finding > out if there was an allocation stall or a ?Metadata GC Clear Soft > References? GC just before everything went bad, might help with figuring > out what went wrong. Something that did happen between JDK 15 and JDK 17 is > dynamic thread selection for the GC. I wonder if that has somehow led to a > stall, which you wouldn?t see in JDK 15 with the more static thread > selection. If so, then maybe setting a SoftMaxHeapSize lower than the > MaxHeapSize might help avoiding stall in the first place, due to > fluctuations in the workload, or alternatively turning off > UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for > allocation stalls not happen. > > Anyway, figuring out what kind of GC triggered right before the incident > would help a lot I think, to figure out if forced soft ref clearing is > indeed what instigated this, or if this is all just a red herring. > > Thanks, > /Erik > > From: Anatoly Deyneka > Sent: Tuesday, 9 November 2021 14:13 > To: Erik Osterlund > Cc: zgc-dev at openjdk.java.net > Subject: [External] : Re: ZGC unstable behaviour on java 17 > > Hi Erik, > > Probably we haven't reached this condition on java 15. > We see a big difference in metaspace allocation starting from java 16. > The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy > > Here is all our knowledge about the problem: > > https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md > < > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$ > > > > Unfortunately we can reproduce this issue only on our production platform. > If we can provide more details or logs please let me know. > > Regards, > Anatoly Deyneka > > On Mon, 8 Nov 2021 at 16:46, Erik Osterlund > wrote: > Hi Anatoly, > > Thanks for reporting this. This looks rather odd indeed. Unfortunately, > the attached images have been stripped from your original email. > > It?s particularly interesting that you see this behaviour when migrating > from JDK 15 to 17. There have been a few bug fixes, but nothing I can > easily imagine causing this. I have been surprised before though. One > notable thing in the class unloading path that has changed though, is the > elastic metaspace. I see that you use the > -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. > This makes me wonder if this reproduces also without that option, i.e. the > way that you run the workload is more similar. > > Anyway, is there any way for me to get access to try to reproduce your > issue? > > Thanks, > /Erik > > > On 4 Nov 2021, at 17:00, Anatoly Deyneka adeyneka at gmail.com>> wrote: > > > > ?We are running a GC intensive application with allocation rate up to > > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > > openjdk 17 2021-09-14 > > OpenJDK Runtime Environment (build 17+35-2724) > > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > > > During migration from java 15 to java 17 we see unstable behavior when > zgc > > goes to infinite loop and doesn't return back to the normal state. > > We see the following behavior: > > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > > 16% CPU utilization > > - something happens and CPU utilization jumps to 40%, used memory jumps > to > > 19GB, zgc cycles time is increased in 10times > > - sometimes app works about 30min before the crash sometimes it's crashed > > instantly > > - if it's still alive memory is not released and the app works with used > > memory 18-20GB > > - we see allocation stalls > > - we see the following GC stats > > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > > References 25781.928ms > > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 > 217.882 > > / 68385.750 ms > > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > > counter keep increasing) . > > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > > job. See attached image. > > > > We played with options: > > -XX:ZCollectionInterval=5 > > -XX:SoftMaxHeapSize=12G > > -XX:ZAllocationSpikeTolerance=4 > > -XX:MetaspaceReclaimPolicy=aggressive > > > > But it doesn't help. it just postpones this event. > > Only if we redirect traffic to another node it goes to normal. > > > > Regards, > > Anatoly Deyneka > > > > [image: image.png] > > [image: image.png] > > [image: image.png] > From adeyneka at gmail.com Tue Nov 9 16:39:29 2021 From: adeyneka at gmail.com (Anatoly Deyneka) Date: Tue, 9 Nov 2021 19:39:29 +0300 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> Message-ID: Please find the log here: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. we would like to go with inflation. We use a lot of reflection. - Does it happen with other GCs? we tried shenandoah today and it worked fine(just 1 stall for the whole day) - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? no, we didn't use this flag - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. we can try this - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. good point. we will analize it. Regards, Anatoly Deyneka On Tue, 9 Nov 2021 at 18:26, Thomas St?fe wrote: > Additionally to Eriks proposal, some more remarks from the side (I'll just > lurk in the background otherwise) : > > - Maybe try without reflection inflation? E.g. > `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the > generation of reflection class loaders and rule out that they are the > problem. > > - Does it happen with other GCs? > > - Elastic Metaspace did away with huge pages in Metaspace for several > reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? > > - Can you switch metaspace reclamation off using > `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, > there may be something wrong with the reclamation algorithm. > > - If the VM becomes unresponsive, could you also give us the output of > `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders > show-classes` though with 1M classes output may be large. > > Thanks, Thomas > > On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > wrote: > >> Hi Anatoly, >> >> Thank you for the additional details. This is very helpful. I talked to >> Per a bit about this, and we feel like understanding what increased the >> metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to >> lead us to the problem. That?s more than 3 orders of magnitude higher >> allocation rate of metaspace memory, which might explain why class >> unloading is taking much longer than usual. >> >> The culprit that is being repeatedly allocated, is rather likely to also >> be repeatedly unloaded, or you would hit an out of memory issue. And the >> log snippet you showed, shows that there is a bunch of LambdaForms being >> unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference >> based cache, so that you don?t have to create a new LambdaForm, when there >> is one in the cache. This soft reference based policy might not work very >> well if all SoftReferences are cleared by the GC, because then it would >> seem like a lot more LambdaForms would have to be created from scratch, as >> opposed to being reused from the cache. >> >> This might explain why it takes a while before this starts happening. The >> interesting thing is figuring out what could have led to all the soft >> references being cleared. This could be because of metaspace exhaustion, or >> due to allocation stalls. In the charts we can see that the allocation rate >> was at the highest point, right before the incident. So without seeing the >> log, it seems like you could have gotten an allocation stall, which >> triggered a full collection clearing all soft references, which suddenly >> leads to class unloading taking much longer. If class unloading takes much >> longer, then you are quite likely to reach the same situation repeatedly, >> in a bad cycle. >> >> Is there any way you could share an -Xlog:gc* log with us? If not, >> finding out if there was an allocation stall or a ?Metadata GC Clear Soft >> References? GC just before everything went bad, might help with figuring >> out what went wrong. Something that did happen between JDK 15 and JDK 17 is >> dynamic thread selection for the GC. I wonder if that has somehow led to a >> stall, which you wouldn?t see in JDK 15 with the more static thread >> selection. If so, then maybe setting a SoftMaxHeapSize lower than the >> MaxHeapSize might help avoiding stall in the first place, due to >> fluctuations in the workload, or alternatively turning off >> UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for >> allocation stalls not happen. >> >> Anyway, figuring out what kind of GC triggered right before the incident >> would help a lot I think, to figure out if forced soft ref clearing is >> indeed what instigated this, or if this is all just a red herring. >> >> Thanks, >> /Erik >> >> From: Anatoly Deyneka >> Sent: Tuesday, 9 November 2021 14:13 >> To: Erik Osterlund >> Cc: zgc-dev at openjdk.java.net >> Subject: [External] : Re: ZGC unstable behaviour on java 17 >> >> Hi Erik, >> >> Probably we haven't reached this condition on java 15. >> We see a big difference in metaspace allocation starting from java 16. >> The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy >> >> Here is all our knowledge about the problem: >> >> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md >> < >> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$ >> > >> >> Unfortunately we can reproduce this issue only on our production platform. >> If we can provide more details or logs please let me know. >> >> Regards, >> Anatoly Deyneka >> >> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund > > wrote: >> Hi Anatoly, >> >> Thanks for reporting this. This looks rather odd indeed. Unfortunately, >> the attached images have been stripped from your original email. >> >> It?s particularly interesting that you see this behaviour when migrating >> from JDK 15 to 17. There have been a few bug fixes, but nothing I can >> easily imagine causing this. I have been surprised before though. One >> notable thing in the class unloading path that has changed though, is the >> elastic metaspace. I see that you use the >> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. >> This makes me wonder if this reproduces also without that option, i.e. the >> way that you run the workload is more similar. >> >> Anyway, is there any way for me to get access to try to reproduce your >> issue? >> >> Thanks, >> /Erik >> >> > On 4 Nov 2021, at 17:00, Anatoly Deyneka > adeyneka at gmail.com>> wrote: >> > >> > ?We are running a GC intensive application with allocation rate up to >> > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). >> > openjdk 17 2021-09-14 >> > OpenJDK Runtime Environment (build 17+35-2724) >> > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) >> > >> > During migration from java 15 to java 17 we see unstable behavior when >> zgc >> > goes to infinite loop and doesn't return back to the normal state. >> > We see the following behavior: >> > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory >> with >> > 16% CPU utilization >> > - something happens and CPU utilization jumps to 40%, used memory jumps >> to >> > 19GB, zgc cycles time is increased in 10times >> > - sometimes app works about 30min before the crash sometimes it's >> crashed >> > instantly >> > - if it's still alive memory is not released and the app works with used >> > memory 18-20GB >> > - we see allocation stalls >> > - we see the following GC stats >> > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong >> > References 25781.928ms >> > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink >> > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 >> 217.882 >> > / 68385.750 ms >> > - we see JVM starts massively unload classes >> (jvm_classes_unloaded_total >> > counter keep increasing) . >> > - we?ve managed to profile JVM in such ?unstable? state and see ZTask >> > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) >> > job. See attached image. >> > >> > We played with options: >> > -XX:ZCollectionInterval=5 >> > -XX:SoftMaxHeapSize=12G >> > -XX:ZAllocationSpikeTolerance=4 >> > -XX:MetaspaceReclaimPolicy=aggressive >> > >> > But it doesn't help. it just postpones this event. >> > Only if we redirect traffic to another node it goes to normal. >> > >> > Regards, >> > Anatoly Deyneka >> > >> > [image: image.png] >> > [image: image.png] >> > [image: image.png] >> > From erik.osterlund at oracle.com Tue Nov 9 21:23:03 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Tue, 9 Nov 2021 21:23:03 +0000 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> Message-ID: <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> Hi Anatoly, Thank you for the additional details. Very interesting indeed. Everything seems to run well until you get an allocation stall due to class unloading taking a long time. The very same GC where the first stall happens, nuked quite a lot of soft references, while previous GCs did not. It really looks like a clear pattern wrt soft reference count and the class unloading headache. This seems to support the original intuition that the lambda form cache might have caused excessive class unloading after lots of soft references get cleared. The GC itself didn?t explicitly clear all soft references, but they might have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref clearing won?t get triggered. That might make things better for you. Having said that, it seems like this cache really shouldn?t be using soft references if the result is this horrible when they get cleared. Unless of course that this is a red herring. Thanks, /Erik On 9 Nov 2021, at 17:39, Anatoly Deyneka wrote: ? Please find the log here: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. we would like to go with inflation. We use a lot of reflection. - Does it happen with other GCs? we tried shenandoah today and it worked fine(just 1 stall for the whole day) - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? no, we didn't use this flag - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. we can try this - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. good point. we will analize it. Regards, Anatoly Deyneka On Tue, 9 Nov 2021 at 18:26, Thomas St?fe > wrote: Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) : - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. - Does it happen with other GCs? - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. Thanks, Thomas On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > wrote: Hi Anatoly, Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That?s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual. The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don?t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache. This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle. Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a ?Metadata GC Clear Soft References? GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn?t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen. Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring. Thanks, /Erik From: Anatoly Deyneka > Sent: Tuesday, 9 November 2021 14:13 To: Erik Osterlund > Cc: zgc-dev at openjdk.java.net Subject: [External] : Re: ZGC unstable behaviour on java 17 Hi Erik, Probably we haven't reached this condition on java 15. We see a big difference in metaspace allocation starting from java 16. The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy Here is all our knowledge about the problem: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md Unfortunately we can reproduce this issue only on our production platform. If we can provide more details or logs please let me know. Regards, Anatoly Deyneka On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >> wrote: Hi Anatoly, Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. Anyway, is there any way for me to get access to try to reproduce your issue? Thanks, /Erik > On 4 Nov 2021, at 17:00, Anatoly Deyneka >> wrote: > > ?We are running a GC intensive application with allocation rate up to > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > openjdk 17 2021-09-14 > OpenJDK Runtime Environment (build 17+35-2724) > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > During migration from java 15 to java 17 we see unstable behavior when zgc > goes to infinite loop and doesn't return back to the normal state. > We see the following behavior: > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > 16% CPU utilization > - something happens and CPU utilization jumps to 40%, used memory jumps to > 19GB, zgc cycles time is increased in 10times > - sometimes app works about 30min before the crash sometimes it's crashed > instantly > - if it's still alive memory is not released and the app works with used > memory 18-20GB > - we see allocation stalls > - we see the following GC stats > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > References 25781.928ms > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 > / 68385.750 ms > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > counter keep increasing) . > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > job. See attached image. > > We played with options: > -XX:ZCollectionInterval=5 > -XX:SoftMaxHeapSize=12G > -XX:ZAllocationSpikeTolerance=4 > -XX:MetaspaceReclaimPolicy=aggressive > > But it doesn't help. it just postpones this event. > Only if we redirect traffic to another node it goes to normal. > > Regards, > Anatoly Deyneka > > [image: image.png] > [image: image.png] > [image: image.png] From adeyneka at gmail.com Wed Nov 10 13:03:51 2021 From: adeyneka at gmail.com (Anatoly Deyneka) Date: Wed, 10 Nov 2021 16:03:51 +0300 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> Message-ID: Thank you Erik, Do you have a understanding of how it can be fixed? SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. We have working shenandoah with default settings. It unloads classes without performance regression. BTW ZGC also can return to normal state if we temporary reduce allocation rate by redirecting new users to another node(old users still stay on the same node). We did that many times when we saw such behaviour. ZGC returned to normal state quite fast. I guess ZGC does the slow unloading class job in the same threads that affects overall performance of GC. It explains why zgc cycles time is dramatically increased. According to *Risks and Assumptions#Uncommit speed* ( https://openjdk.java.net/jeps/387): "Metaspace reclamation can happen during a garbage-collection pause, so this could be a problem. We haven?t observed this problem so far, but if uncommit times become an issue then we could offload the uncommitting work to a separate thread so that it could be done independently of GC pauses." We observed and fixed class allocation problem in some scenarios of using jackson+blackbird. So we can try to reproduce this issue. Regards, Anatoly Deyneka On Wed, 10 Nov 2021 at 00:23, Erik Osterlund wrote: > Hi Anatoly, > > Thank you for the additional details. Very interesting indeed. Everything > seems to run well until you get an allocation stall due to class unloading > taking a long time. The very same GC where the first stall happens, nuked > quite a lot of soft references, while previous GCs did not. It really looks > like a clear pattern wrt soft reference count and the class unloading > headache. > > This seems to support the original intuition that the lambda form cache > might have caused excessive class unloading after lots of soft references > get cleared. > > The GC itself didn?t explicitly clear all soft references, but they might > have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a > *very* high value, the soft ref clearing won?t get triggered. That might > make things better for you. > > Having said that, it seems like this cache really shouldn?t be using soft > references if the result is this horrible when they get cleared. Unless of > course that this is a red herring. > > Thanks, > /Erik > > On 9 Nov 2021, at 17:39, Anatoly Deyneka wrote: > > ? > Please find the log here: > > https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 > > > - Maybe try without reflection inflation? E.g. > `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the > generation of reflection class loaders and rule out that they are the > problem. > we would like to go with inflation. We use a lot of reflection. > - Does it happen with other GCs? > we tried shenandoah today and it worked fine(just 1 stall for the whole > day) > - Elastic Metaspace did away with huge pages in Metaspace for several > reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? > no, we didn't use this flag > - Can you switch metaspace reclamation off using > `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, > there may be something wrong with the reclamation algorithm. > we can try this > - If the VM becomes unresponsive, could you also give us the output of > `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders > show-classes` though with 1M classes output may be large. > good point. we will analize it. > > Regards, > Anatoly Deyneka > > On Tue, 9 Nov 2021 at 18:26, Thomas St?fe wrote: > >> Additionally to Eriks proposal, some more remarks from the side (I'll >> just lurk in the background otherwise) : >> >> - Maybe try without reflection inflation? E.g. >> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the >> generation of reflection class loaders and rule out that they are the >> problem. >> >> - Does it happen with other GCs? >> >> - Elastic Metaspace did away with huge pages in Metaspace for several >> reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? >> >> - Can you switch metaspace reclamation off using >> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, >> there may be something wrong with the reclamation algorithm. >> >> - If the VM becomes unresponsive, could you also give us the output of >> `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders >> show-classes` though with 1M classes output may be large. >> >> Thanks, Thomas >> >> On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund >> wrote: >> >>> Hi Anatoly, >>> >>> Thank you for the additional details. This is very helpful. I talked to >>> Per a bit about this, and we feel like understanding what increased the >>> metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to >>> lead us to the problem. That?s more than 3 orders of magnitude higher >>> allocation rate of metaspace memory, which might explain why class >>> unloading is taking much longer than usual. >>> >>> The culprit that is being repeatedly allocated, is rather likely to also >>> be repeatedly unloaded, or you would hit an out of memory issue. And the >>> log snippet you showed, shows that there is a bunch of LambdaForms being >>> unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference >>> based cache, so that you don?t have to create a new LambdaForm, when there >>> is one in the cache. This soft reference based policy might not work very >>> well if all SoftReferences are cleared by the GC, because then it would >>> seem like a lot more LambdaForms would have to be created from scratch, as >>> opposed to being reused from the cache. >>> >>> This might explain why it takes a while before this starts happening. >>> The interesting thing is figuring out what could have led to all the soft >>> references being cleared. This could be because of metaspace exhaustion, or >>> due to allocation stalls. In the charts we can see that the allocation rate >>> was at the highest point, right before the incident. So without seeing the >>> log, it seems like you could have gotten an allocation stall, which >>> triggered a full collection clearing all soft references, which suddenly >>> leads to class unloading taking much longer. If class unloading takes much >>> longer, then you are quite likely to reach the same situation repeatedly, >>> in a bad cycle. >>> >>> Is there any way you could share an -Xlog:gc* log with us? If not, >>> finding out if there was an allocation stall or a ?Metadata GC Clear Soft >>> References? GC just before everything went bad, might help with figuring >>> out what went wrong. Something that did happen between JDK 15 and JDK 17 is >>> dynamic thread selection for the GC. I wonder if that has somehow led to a >>> stall, which you wouldn?t see in JDK 15 with the more static thread >>> selection. If so, then maybe setting a SoftMaxHeapSize lower than the >>> MaxHeapSize might help avoiding stall in the first place, due to >>> fluctuations in the workload, or alternatively turning off >>> UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for >>> allocation stalls not happen. >>> >>> Anyway, figuring out what kind of GC triggered right before the incident >>> would help a lot I think, to figure out if forced soft ref clearing is >>> indeed what instigated this, or if this is all just a red herring. >>> >>> Thanks, >>> /Erik >>> >>> From: Anatoly Deyneka >>> Sent: Tuesday, 9 November 2021 14:13 >>> To: Erik Osterlund >>> Cc: zgc-dev at openjdk.java.net >>> Subject: [External] : Re: ZGC unstable behaviour on java 17 >>> >>> Hi Erik, >>> >>> Probably we haven't reached this condition on java 15. >>> We see a big difference in metaspace allocation starting from java 16. >>> The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy >>> >>> Here is all our knowledge about the problem: >>> >>> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md >>> >>> < >>> https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$ >>> > >>> >>> Unfortunately we can reproduce this issue only on our production >>> platform. >>> If we can provide more details or logs please let me know. >>> >>> Regards, >>> Anatoly Deyneka >>> >>> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >> > wrote: >>> Hi Anatoly, >>> >>> Thanks for reporting this. This looks rather odd indeed. Unfortunately, >>> the attached images have been stripped from your original email. >>> >>> It?s particularly interesting that you see this behaviour when migrating >>> from JDK 15 to 17. There have been a few bug fixes, but nothing I can >>> easily imagine causing this. I have been surprised before though. One >>> notable thing in the class unloading path that has changed though, is the >>> elastic metaspace. I see that you use the >>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. >>> This makes me wonder if this reproduces also without that option, i.e. the >>> way that you run the workload is more similar. >>> >>> Anyway, is there any way for me to get access to try to reproduce your >>> issue? >>> >>> Thanks, >>> /Erik >>> >>> > On 4 Nov 2021, at 17:00, Anatoly Deyneka >> adeyneka at gmail.com>> wrote: >>> > >>> > ?We are running a GC intensive application with allocation rate up to >>> > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). >>> > openjdk 17 2021-09-14 >>> > OpenJDK Runtime Environment (build 17+35-2724) >>> > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) >>> > >>> > During migration from java 15 to java 17 we see unstable behavior when >>> zgc >>> > goes to infinite loop and doesn't return back to the normal state. >>> > We see the following behavior: >>> > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory >>> with >>> > 16% CPU utilization >>> > - something happens and CPU utilization jumps to 40%, used memory >>> jumps to >>> > 19GB, zgc cycles time is increased in 10times >>> > - sometimes app works about 30min before the crash sometimes it's >>> crashed >>> > instantly >>> > - if it's still alive memory is not released and the app works with >>> used >>> > memory 18-20GB >>> > - we see allocation stalls >>> > - we see the following GC stats >>> > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong >>> > References 25781.928ms >>> > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink >>> > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 >>> 217.882 >>> > / 68385.750 ms >>> > - we see JVM starts massively unload classes >>> (jvm_classes_unloaded_total >>> > counter keep increasing) . >>> > - we?ve managed to profile JVM in such ?unstable? state and see ZTask >>> > consuming a lot of cpu doing >>> CompiledMethod::unload_nmethod_caches(bool) >>> > job. See attached image. >>> > >>> > We played with options: >>> > -XX:ZCollectionInterval=5 >>> > -XX:SoftMaxHeapSize=12G >>> > -XX:ZAllocationSpikeTolerance=4 >>> > -XX:MetaspaceReclaimPolicy=aggressive >>> > >>> > But it doesn't help. it just postpones this event. >>> > Only if we redirect traffic to another node it goes to normal. >>> > >>> > Regards, >>> > Anatoly Deyneka >>> > >>> > [image: image.png] >>> > [image: image.png] >>> > [image: image.png] >>> >> From thomas.stuefe at gmail.com Wed Nov 10 13:14:27 2021 From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=) Date: Wed, 10 Nov 2021 14:14:27 +0100 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> Message-ID: > > > > According to *Risks and Assumptions#Uncommit speed* ( > https://openjdk.java.net/jeps/387): > "Metaspace reclamation can happen during a garbage-collection pause, so > this could be a problem. > We haven?t observed this problem so far, but if uncommit times become an > issue then we could offload the uncommitting work to a separate thread so > that it could be done independently of GC pauses." > > This is why I recommended attempting MetaspaceReclaimPolicy=none. If that one mitigates the issue, we have a problem with metaspace uncommit speed. Cheers, Thomas From erik.osterlund at oracle.com Wed Nov 10 16:18:23 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Wed, 10 Nov 2021 16:18:23 +0000 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> Message-ID: Hi Anatoly, I think that we can now see why a large cluster of classes get unloaded, and as you say possibly control that, as a workaround, with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question is what it is that makes the unloading take so long. One theory is that it is related to metaspace uncommitting being more aggressive. It seemed though like you had tried the more balanced mode and it still was an issue, if I understood you correctly. The second theory would be that it is something in the code cache unloading that takes a long time. And it seems like when you introspected it, the flame graph suggests that it is the code cache unloading that takes long. So this seems like a more likely explanation. And I can see from the GC logs that you have a *lot* of nmethods being loaded. And many thousands of them get unloaded, which leads to a huge inline cache cleaning task, unlinking all these thousands of nmethods. This inline cache cleaning fills up a buffer (ICBuffer) that is currently 10 KB large (rather small and has not changed since ancient times, perhaps it is time to change it). When this buffer fills up, you need to trigger an IC buffer refilling safepoint, as you have observed in the logs. So the fact that loads of these safepoints incrementally trigger, seems to suggest that for this large number of nmethods being unloaded, and the buffer size is too small. When it is way too small, the code cache walk during the unlinking phase becomes quadratic with the number of nmethods. And you have *loads* of nmethods. So that could explain why it takes so long with the code cache unloading. In the solution space of scalability of class unloading to these large levels, the long-term fix is probably to remove inline caches completely, which I have worked on a bit, but it is not going to happen over night. A simpler fix that might actually happen over night, is to increase the size of the IC buffer, like this: diff --git a/src/hotspot/share/code/icBuffer.cpp b/src/hotspot/share/code/icBuffer.cpp index 1da48f0bd13..382fa8650c5 100644 --- a/src/hotspot/share/code/icBuffer.cpp +++ b/src/hotspot/share/code/icBuffer.cpp @@ -140,7 +140,7 @@ void ICStub::print() { void InlineCacheBuffer::initialize() { if (_buffer != NULL) return; // already initialized - _buffer = new StubQueue(new ICStubInterface, 10*K, InlineCacheBuffer_lock, "InlineCacheBuffer"); + _buffer = new StubQueue(new ICStubInterface, 1*M, InlineCacheBuffer_lock, "InlineCacheBuffer"); assert (_buffer != NULL, "cannot allocate InlineCacheBuffer"); } That would probably bring down the time spent doing nmethod unlinking by a *lot*. If you know how to build the JDK, that could be an easy patch to try out, to significantly improve the scalability of code cache unlinking. However, one relevant question is also if this is really a scalability problem in the GC unloading path, or if the real culprit is elsewhere. The fact that you see two orders of magnitude higher allocation rates in metaspace (!!!) when upgrading from 15 to 17, might explain why the unloading code suddenly gets scalability stressed so hard in the first place. And perhaps whatever caused that 2 orders of magnitude increase, is the real culprit here. Having said that, we might want to improve on the class unloading scalabiliy anyway, but I do wonder why this is being stressed at all, and why there are so many dying classes and nmethods. On that note, it isn?t clear to me if the 2 orders of magnitude higher metaspace allocation rate was observed before or after the JSON issues were addressed, as you described. /Erik From: Anatoly Deyneka Sent: Wednesday, 10 November 2021 14:04 To: Erik Osterlund Cc: Thomas St?fe ; zgc-dev at openjdk.java.net Subject: Re: [External] : Re: ZGC unstable behaviour on java 17 Thank you Erik, Do you have a understanding of how it can be fixed? SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. We have working shenandoah with default settings. It unloads classes without performance regression. BTW ZGC also can return to normal state if we temporary reduce allocation rate by redirecting new users to another node(old users still stay on the same node). We did that many times when we saw such behaviour. ZGC returned to normal state quite fast. I guess ZGC does the slow unloading class job in the same threads that affects overall performance of GC. It explains why zgc cycles time is dramatically increased. According to Risks and Assumptions#Uncommit speed (https://openjdk.java.net/jeps/387): "Metaspace reclamation can happen during a garbage-collection pause, so this could be a problem. We haven?t observed this problem so far, but if uncommit times become an issue then we could offload the uncommitting work to a separate thread so that it could be done independently of GC pauses." We observed and fixed class allocation problem in some scenarios of using jackson+blackbird. So we can try to reproduce this issue. Regards, Anatoly Deyneka On Wed, 10 Nov 2021 at 00:23, Erik Osterlund > wrote: Hi Anatoly, Thank you for the additional details. Very interesting indeed. Everything seems to run well until you get an allocation stall due to class unloading taking a long time. The very same GC where the first stall happens, nuked quite a lot of soft references, while previous GCs did not. It really looks like a clear pattern wrt soft reference count and the class unloading headache. This seems to support the original intuition that the lambda form cache might have caused excessive class unloading after lots of soft references get cleared. The GC itself didn?t explicitly clear all soft references, but they might have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref clearing won?t get triggered. That might make things better for you. Having said that, it seems like this cache really shouldn?t be using soft references if the result is this horrible when they get cleared. Unless of course that this is a red herring. Thanks, /Erik On 9 Nov 2021, at 17:39, Anatoly Deyneka > wrote: ? Please find the log here: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. we would like to go with inflation. We use a lot of reflection. - Does it happen with other GCs? we tried shenandoah today and it worked fine(just 1 stall for the whole day) - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? no, we didn't use this flag - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. we can try this - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. good point. we will analize it. Regards, Anatoly Deyneka On Tue, 9 Nov 2021 at 18:26, Thomas St?fe > wrote: Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) : - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. - Does it happen with other GCs? - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. Thanks, Thomas On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > wrote: Hi Anatoly, Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That?s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual. The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don?t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache. This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle. Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a ?Metadata GC Clear Soft References? GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn?t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen. Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring. Thanks, /Erik From: Anatoly Deyneka > Sent: Tuesday, 9 November 2021 14:13 To: Erik Osterlund > Cc: zgc-dev at openjdk.java.net Subject: [External] : Re: ZGC unstable behaviour on java 17 Hi Erik, Probably we haven't reached this condition on java 15. We see a big difference in metaspace allocation starting from java 16. The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy Here is all our knowledge about the problem: https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md Unfortunately we can reproduce this issue only on our production platform. If we can provide more details or logs please let me know. Regards, Anatoly Deyneka On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >> wrote: Hi Anatoly, Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. Anyway, is there any way for me to get access to try to reproduce your issue? Thanks, /Erik > On 4 Nov 2021, at 17:00, Anatoly Deyneka >> wrote: > > ?We are running a GC intensive application with allocation rate up to > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > openjdk 17 2021-09-14 > OpenJDK Runtime Environment (build 17+35-2724) > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > During migration from java 15 to java 17 we see unstable behavior when zgc > goes to infinite loop and doesn't return back to the normal state. > We see the following behavior: > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > 16% CPU utilization > - something happens and CPU utilization jumps to 40%, used memory jumps to > 19GB, zgc cycles time is increased in 10times > - sometimes app works about 30min before the crash sometimes it's crashed > instantly > - if it's still alive memory is not released and the app works with used > memory 18-20GB > - we see allocation stalls > - we see the following GC stats > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > References 25781.928ms > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 > / 68385.750 ms > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > counter keep increasing) . > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > job. See attached image. > > We played with options: > -XX:ZCollectionInterval=5 > -XX:SoftMaxHeapSize=12G > -XX:ZAllocationSpikeTolerance=4 > -XX:MetaspaceReclaimPolicy=aggressive > > But it doesn't help. it just postpones this event. > Only if we redirect traffic to another node it goes to normal. > > Regards, > Anatoly Deyneka > > [image: image.png] > [image: image.png] > [image: image.png] From stefan.karlsson at oracle.com Wed Nov 10 17:06:48 2021 From: stefan.karlsson at oracle.com (Stefan Karlsson) Date: Wed, 10 Nov 2021 18:06:48 +0100 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> Message-ID: <57b16ac3-2083-6544-9822-aacdac934314@oracle.com> Hi, I've managed to reproduce the long Concurrent Classes Unlink phase by running an edited Blackbird microbenchmark. First add an ObjectMapper#copy call to beanArrayDeser: https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java --- diff --git a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java index 3675286..af05d54 100644 --- a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java +++ b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java @@ -80,7 +80,7 @@ public abstract class BaseBenchmark { ???? @Benchmark ???? public SomeBean[] beanArrayDeser() throws Exception { -??????? return mapper.readValue(beansBytes, SomeBean[].class); +??????? return mapper.copy().readValue(beansBytes, SomeBean[].class); ???? } ???? @Benchmark --- Then change the ICBuffer size to 100, and run the following: java -Xlog:gc* -Xmx20g -XX:+UseZGC -XX:MetaspaceReclaimPolicy=none -XX:-ZProactive -jar target/benchmarks.jar First run: [180,168s][info][gc,stats ]??? Subphase: Concurrent Classes Unlink 495,386 / 495,386??? 7002,041 / 24368,855? 7002,041 / 24368,855 7002,041 / 24368,855?? ms Second run: [140,170s][info][gc,stats??? ]??? Subphase: Concurrent Classes Unlink???????????????? 67606,542 / 134428,068 67606,542 / 134428,068 67606,542 / 134428,068 67606,542 / 134428,068?? ms StefanK On 2021-11-10 17:18, Erik Osterlund wrote: > Hi Anatoly, > > I think that we can now see why a large cluster of classes get unloaded, and as you say possibly control that, as a workaround, with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question is what it is that makes the unloading take so long. > > One theory is that it is related to metaspace uncommitting being more aggressive. It seemed though like you had tried the more balanced mode and it still was an issue, if I understood you correctly. > > The second theory would be that it is something in the code cache unloading that takes a long time. And it seems like when you introspected it, the flame graph suggests that it is the code cache unloading that takes long. So this seems like a more likely explanation. And I can see from the GC logs that you have a *lot* of nmethods being loaded. And many thousands of them get unloaded, which leads to a huge inline cache cleaning task, unlinking all these thousands of nmethods. This inline cache cleaning fills up a buffer (ICBuffer) that is currently 10 KB large (rather small and has not changed since ancient times, perhaps it is time to change it). When this buffer fills up, you need to trigger an IC buffer refilling safepoint, as you have observed in the logs. So the fact that loads of these safepoints incrementally trigger, seems to suggest that for this large number of nmethods being unloaded, and the buffer size is too small. When it is way too small, the code cache walk during the unlinking phase becomes quadratic with the number of nmethods. And you have *loads* of nmethods. So that could explain why it takes so long with the code cache unloading. > > In the solution space of scalability of class unloading to these large levels, the long-term fix is probably to remove inline caches completely, which I have worked on a bit, but it is not going to happen over night. A simpler fix that might actually happen over night, is to increase the size of the IC buffer, like this: > > diff --git a/src/hotspot/share/code/icBuffer.cpp b/src/hotspot/share/code/icBuffer.cpp > index 1da48f0bd13..382fa8650c5 100644 > --- a/src/hotspot/share/code/icBuffer.cpp > +++ b/src/hotspot/share/code/icBuffer.cpp > @@ -140,7 +140,7 @@ void ICStub::print() { > > void InlineCacheBuffer::initialize() { > if (_buffer != NULL) return; // already initialized > - _buffer = new StubQueue(new ICStubInterface, 10*K, InlineCacheBuffer_lock, "InlineCacheBuffer"); > + _buffer = new StubQueue(new ICStubInterface, 1*M, InlineCacheBuffer_lock, "InlineCacheBuffer"); > assert (_buffer != NULL, "cannot allocate InlineCacheBuffer"); > } > > That would probably bring down the time spent doing nmethod unlinking by a *lot*. If you know how to build the JDK, that could be an easy patch to try out, to significantly improve the scalability of code cache unlinking. > > However, one relevant question is also if this is really a scalability problem in the GC unloading path, or if the real culprit is elsewhere. The fact that you see two orders of magnitude higher allocation rates in metaspace (!!!) when upgrading from 15 to 17, might explain why the unloading code suddenly gets scalability stressed so hard in the first place. And perhaps whatever caused that 2 orders of magnitude increase, is the real culprit here. Having said that, we might want to improve on the class unloading scalabiliy anyway, but I do wonder why this is being stressed at all, and why there are so many dying classes and nmethods. On that note, it isn?t clear to me if the 2 orders of magnitude higher metaspace allocation rate was observed before or after the JSON issues were addressed, as you described. > > /Erik > > From: Anatoly Deyneka > Sent: Wednesday, 10 November 2021 14:04 > To: Erik Osterlund > Cc: Thomas St?fe;zgc-dev at openjdk.java.net > Subject: Re: [External] : Re: ZGC unstable behaviour on java 17 > > Thank you Erik, > > Do you have a understanding of how it can be fixed? > SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. > We have working shenandoah with default settings. It unloads classes without performance regression. > BTW ZGC also can return to normal state if we temporary reduce allocation rate by redirecting new users to another node(old users still stay on the same node). > We did that many times when we saw such behaviour. ZGC returned to normal state quite fast. > > I guess ZGC does the slow unloading class job in the same threads that affects overall performance of GC. > It explains why zgc cycles time is dramatically increased. > > According to Risks and Assumptions#Uncommit speed (https://openjdk.java.net/jeps/387): > "Metaspace reclamation can happen during a garbage-collection pause, so this could be a problem. > We haven?t observed this problem so far, but if uncommit times become an issue then we could offload the uncommitting work to a separate thread so that it could be done independently of GC pauses." > > We observed and fixed class allocation problem in some scenarios of using jackson+blackbird. > So we can try to reproduce this issue. > > Regards, > Anatoly Deyneka > > On Wed, 10 Nov 2021 at 00:23, Erik Osterlund > wrote: > Hi Anatoly, > > Thank you for the additional details. Very interesting indeed. Everything seems to run well until you get an allocation stall due to class unloading taking a long time. The very same GC where the first stall happens, nuked quite a lot of soft references, while previous GCs did not. It really looks like a clear pattern wrt soft reference count and the class unloading headache. > > This seems to support the original intuition that the lambda form cache might have caused excessive class unloading after lots of soft references get cleared. > > The GC itself didn?t explicitly clear all soft references, but they might have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref clearing won?t get triggered. That might make things better for you. > > Having said that, it seems like this cache really shouldn?t be using soft references if the result is this horrible when they get cleared. Unless of course that this is a red herring. > > Thanks, > /Erik > > > On 9 Nov 2021, at 17:39, Anatoly Deyneka > wrote: > ? > Please find the log here: > https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 > > - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. > we would like to go with inflation. We use a lot of reflection. > - Does it happen with other GCs? > we tried shenandoah today and it worked fine(just 1 stall for the whole day) > - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? > no, we didn't use this flag > - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. > we can try this > - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. > good point. we will analize it. > > Regards, > Anatoly Deyneka > > On Tue, 9 Nov 2021 at 18:26, Thomas St?fe > wrote: > Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) : > > - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. > > - Does it happen with other GCs? > > - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? > > - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. > > - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. > > Thanks, Thomas > > On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > wrote: > Hi Anatoly, > > Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That?s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual. > > The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don?t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache. > > This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle. > > Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a ?Metadata GC Clear Soft References? GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn?t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen. > > Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring. > > Thanks, > /Erik > > From: Anatoly Deyneka > > Sent: Tuesday, 9 November 2021 14:13 > To: Erik Osterlund > > Cc:zgc-dev at openjdk.java.net > Subject: [External] : Re: ZGC unstable behaviour on java 17 > > Hi Erik, > > Probably we haven't reached this condition on java 15. > We see a big difference in metaspace allocation starting from java 16. > The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy > > Here is all our knowledge about the problem: > https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md > > Unfortunately we can reproduce this issue only on our production platform. > If we can provide more details or logs please let me know. > > Regards, > Anatoly Deyneka > > On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >> wrote: > Hi Anatoly, > > Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. > > It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. > > Anyway, is there any way for me to get access to try to reproduce your issue? > > Thanks, > /Erik > >> On 4 Nov 2021, at 17:00, Anatoly Deyneka >> wrote: >> >> ?We are running a GC intensive application with allocation rate up to >> 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). >> openjdk 17 2021-09-14 >> OpenJDK Runtime Environment (build 17+35-2724) >> OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) >> >> During migration from java 15 to java 17 we see unstable behavior when zgc >> goes to infinite loop and doesn't return back to the normal state. >> We see the following behavior: >> - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with >> 16% CPU utilization >> - something happens and CPU utilization jumps to 40%, used memory jumps to >> 19GB, zgc cycles time is increased in 10times >> - sometimes app works about 30min before the crash sometimes it's crashed >> instantly >> - if it's still alive memory is not released and the app works with used >> memory 18-20GB >> - we see allocation stalls >> - we see the following GC stats >> [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong >> References 25781.928ms >> [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink >> 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 >> / 68385.750 ms >> - we see JVM starts massively unload classes (jvm_classes_unloaded_total >> counter keep increasing) . >> - we?ve managed to profile JVM in such ?unstable? state and see ZTask >> consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) >> job. See attached image. >> >> We played with options: >> -XX:ZCollectionInterval=5 >> -XX:SoftMaxHeapSize=12G >> -XX:ZAllocationSpikeTolerance=4 >> -XX:MetaspaceReclaimPolicy=aggressive >> >> But it doesn't help. it just postpones this event. >> Only if we redirect traffic to another node it goes to normal. >> >> Regards, >> Anatoly Deyneka >> >> [image: image.png] >> [image: image.png] >> [image: image.png] From adeyneka at gmail.com Thu Nov 11 10:19:05 2021 From: adeyneka at gmail.com (Anatoly Deyneka) Date: Thu, 11 Nov 2021 13:19:05 +0300 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: <57b16ac3-2083-6544-9822-aacdac934314@oracle.com> References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> <57b16ac3-2083-6544-9822-aacdac934314@oracle.com> Message-ID: It's great you can reproduce it. Is there anything else we can help you with? On that note, it isn?t clear to me if the 2 orders of magnitude higher > metaspace allocation rate was observed before or after the JSON issues were > addressed, as you described. Before the fix of ObjectMapper.copy() we had a 250MB/s allocation rate and OutOfMemory(metaspace). BTW with shenandoah we have about 6MB/s with rare spikes up to 100MB/s Regards, Anatoly Deyneka On Wed, 10 Nov 2021 at 20:07, Stefan Karlsson wrote: > Hi, > > I've managed to reproduce the long Concurrent Classes Unlink phase by > running an edited Blackbird microbenchmark. > > First add an ObjectMapper#copy call to beanArrayDeser: > > https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > > --- > diff --git > a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > index 3675286..af05d54 100644 > --- > a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > +++ > b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > @@ -80,7 +80,7 @@ public abstract class BaseBenchmark { > > @Benchmark > public SomeBean[] beanArrayDeser() throws Exception { > - return mapper.readValue(beansBytes, SomeBean[].class); > + return mapper.copy().readValue(beansBytes, SomeBean[].class); > } > > @Benchmark > --- > > Then change the ICBuffer size to 100, and run the following: > java -Xlog:gc* -Xmx20g -XX:+UseZGC -XX:MetaspaceReclaimPolicy=none > -XX:-ZProactive -jar target/benchmarks.jar > > First run: > [180,168s][info][gc,stats ] Subphase: Concurrent Classes > Unlink 495,386 / 495,386 7002,041 / 24368,855 > 7002,041 / 24368,855 7002,041 / 24368,855 ms > > Second run: > [140,170s][info][gc,stats ] Subphase: Concurrent Classes > Unlink 67606,542 / 134428,068 67606,542 / 134428,068 > 67606,542 / 134428,068 67606,542 / 134428,068 ms > > StefanK > > On 2021-11-10 17:18, Erik Osterlund wrote: > > Hi Anatoly, > > I think that we can now see why a large cluster of classes get unloaded, and as you say possibly control that, as a workaround, with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question is what it is that makes the unloading take so long. > > One theory is that it is related to metaspace uncommitting being more aggressive. It seemed though like you had tried the more balanced mode and it still was an issue, if I understood you correctly. > > The second theory would be that it is something in the code cache unloading that takes a long time. And it seems like when you introspected it, the flame graph suggests that it is the code cache unloading that takes long. So this seems like a more likely explanation. And I can see from the GC logs that you have a *lot* of nmethods being loaded. And many thousands of them get unloaded, which leads to a huge inline cache cleaning task, unlinking all these thousands of nmethods. This inline cache cleaning fills up a buffer (ICBuffer) that is currently 10 KB large (rather small and has not changed since ancient times, perhaps it is time to change it). When this buffer fills up, you need to trigger an IC buffer refilling safepoint, as you have observed in the logs. So the fact that loads of these safepoints incrementally trigger, seems to suggest that for this large number of nmethods being unloaded, and the buffer size is too small. When it is way too small, the code cache walk during the unlinking phase becomes quadratic with the number of nmethods. And you have *loads* of nmethods. So that could explain why it takes so long with the code cache unloading. > > In the solution space of scalability of class unloading to these large levels, the long-term fix is probably to remove inline caches completely, which I have worked on a bit, but it is not going to happen over night. A simpler fix that might actually happen over night, is to increase the size of the IC buffer, like this: > > diff --git a/src/hotspot/share/code/icBuffer.cpp b/src/hotspot/share/code/icBuffer.cpp > index 1da48f0bd13..382fa8650c5 100644 > --- a/src/hotspot/share/code/icBuffer.cpp > +++ b/src/hotspot/share/code/icBuffer.cpp > @@ -140,7 +140,7 @@ void ICStub::print() { > > void InlineCacheBuffer::initialize() { > if (_buffer != NULL) return; // already initialized > - _buffer = new StubQueue(new ICStubInterface, 10*K, InlineCacheBuffer_lock, "InlineCacheBuffer"); > + _buffer = new StubQueue(new ICStubInterface, 1*M, InlineCacheBuffer_lock, "InlineCacheBuffer"); > assert (_buffer != NULL, "cannot allocate InlineCacheBuffer"); > } > > That would probably bring down the time spent doing nmethod unlinking by a *lot*. If you know how to build the JDK, that could be an easy patch to try out, to significantly improve the scalability of code cache unlinking. > > However, one relevant question is also if this is really a scalability problem in the GC unloading path, or if the real culprit is elsewhere. The fact that you see two orders of magnitude higher allocation rates in metaspace (!!!) when upgrading from 15 to 17, might explain why the unloading code suddenly gets scalability stressed so hard in the first place. And perhaps whatever caused that 2 orders of magnitude increase, is the real culprit here. Having said that, we might want to improve on the class unloading scalabiliy anyway, but I do wonder why this is being stressed at all, and why there are so many dying classes and nmethods. On that note, it isn?t clear to me if the 2 orders of magnitude higher metaspace allocation rate was observed before or after the JSON issues were addressed, as you described. > > /Erik > > From: Anatoly Deyneka > Sent: Wednesday, 10 November 2021 14:04 > To: Erik Osterlund > Cc: Thomas St?fe ; zgc-dev at openjdk.java.net > Subject: Re: [External] : Re: ZGC unstable behaviour on java 17 > > Thank you Erik, > > Do you have a understanding of how it can be fixed? > SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. > We have working shenandoah with default settings. It unloads classes without performance regression. > BTW ZGC also can return to normal state if we temporary reduce allocation rate by redirecting new users to another node(old users still stay on the same node). > We did that many times when we saw such behaviour. ZGC returned to normal state quite fast. > > I guess ZGC does the slow unloading class job in the same threads that affects overall performance of GC. > It explains why zgc cycles time is dramatically increased. > > According to Risks and Assumptions#Uncommit speed (https://openjdk.java.net/jeps/387): > "Metaspace reclamation can happen during a garbage-collection pause, so this could be a problem. > We haven?t observed this problem so far, but if uncommit times become an issue then we could offload the uncommitting work to a separate thread so that it could be done independently of GC pauses." > > We observed and fixed class allocation problem in some scenarios of using jackson+blackbird. > So we can try to reproduce this issue. > > Regards, > Anatoly Deyneka > > On Wed, 10 Nov 2021 at 00:23, Erik Osterlund > wrote: > Hi Anatoly, > > Thank you for the additional details. Very interesting indeed. Everything seems to run well until you get an allocation stall due to class unloading taking a long time. The very same GC where the first stall happens, nuked quite a lot of soft references, while previous GCs did not. It really looks like a clear pattern wrt soft reference count and the class unloading headache. > > This seems to support the original intuition that the lambda form cache might have caused excessive class unloading after lots of soft references get cleared. > > The GC itself didn?t explicitly clear all soft references, but they might have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref clearing won?t get triggered. That might make things better for you. > > Having said that, it seems like this cache really shouldn?t be using soft references if the result is this horrible when they get cleared. Unless of course that this is a red herring. > > Thanks, > /Erik > > > On 9 Nov 2021, at 17:39, Anatoly Deyneka > wrote: > ? > Please find the log here:https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 > > - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. > we would like to go with inflation. We use a lot of reflection. > - Does it happen with other GCs? > we tried shenandoah today and it worked fine(just 1 stall for the whole day) > - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? > no, we didn't use this flag > - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. > we can try this > - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. > good point. we will analize it. > > Regards, > Anatoly Deyneka > > On Tue, 9 Nov 2021 at 18:26, Thomas St?fe > wrote: > Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) : > > - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. > > - Does it happen with other GCs? > > - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? > > - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. > > - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. > > Thanks, Thomas > > On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > wrote: > Hi Anatoly, > > Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That?s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual. > > The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don?t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache. > > This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle. > > Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a ?Metadata GC Clear Soft References? GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn?t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen. > > Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring. > > Thanks, > /Erik > > From: Anatoly Deyneka > > Sent: Tuesday, 9 November 2021 14:13 > To: Erik Osterlund > > Cc: zgc-dev at openjdk.java.net > Subject: [External] : Re: ZGC unstable behaviour on java 17 > > Hi Erik, > > Probably we haven't reached this condition on java 15. > We see a big difference in metaspace allocation starting from java 16. > The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy > > Here is all our knowledge about the problem:https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md > > Unfortunately we can reproduce this issue only on our production platform. > If we can provide more details or logs please let me know. > > Regards, > Anatoly Deyneka > > On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >> wrote: > Hi Anatoly, > > Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. > > It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. > > Anyway, is there any way for me to get access to try to reproduce your issue? > > Thanks, > /Erik > > > On 4 Nov 2021, at 17:00, Anatoly Deyneka >> wrote: > > ?We are running a GC intensive application with allocation rate up to > 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > openjdk 17 2021-09-14 > OpenJDK Runtime Environment (build 17+35-2724) > OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > > During migration from java 15 to java 17 we see unstable behavior when zgc > goes to infinite loop and doesn't return back to the normal state. > We see the following behavior: > - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with > 16% CPU utilization > - something happens and CPU utilization jumps to 40%, used memory jumps to > 19GB, zgc cycles time is increased in 10times > - sometimes app works about 30min before the crash sometimes it's crashed > instantly > - if it's still alive memory is not released and the app works with used > memory 18-20GB > - we see allocation stalls > - we see the following GC stats > [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong > References 25781.928ms > [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink > 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 > / 68385.750 ms > - we see JVM starts massively unload classes (jvm_classes_unloaded_total > counter keep increasing) . > - we?ve managed to profile JVM in such ?unstable? state and see ZTask > consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) > job. See attached image. > > We played with options: > -XX:ZCollectionInterval=5 > -XX:SoftMaxHeapSize=12G > -XX:ZAllocationSpikeTolerance=4 > -XX:MetaspaceReclaimPolicy=aggressive > > But it doesn't help. it just postpones this event. > Only if we redirect traffic to another node it goes to normal. > > Regards, > Anatoly Deyneka > > [image: image.png] > [image: image.png] > [image: image.png] > > > From stefan.karlsson at oracle.com Fri Nov 12 15:46:20 2021 From: stefan.karlsson at oracle.com (Stefan Karlsson) Date: Fri, 12 Nov 2021 16:46:20 +0100 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> <57b16ac3-2083-6544-9822-aacdac934314@oracle.com> Message-ID: Hi Anatoly, On 2021-11-11 11:19, Anatoly Deyneka wrote: > It's great you can reproduce it. > Is there anything else we can help you with? I think we have everything we need for now. The hints about the problem with refill_ic_stubs were great, and that's what we are currently investigating on our side. Thanks, StefanK > > On that note, it isn?t clear to me if the 2 orders of magnitude > higher metaspace allocation rate was observed before or after the > JSON issues were addressed, as you described. > > Before the fix of ObjectMapper.copy() we had a 250MB/s allocation rate > and OutOfMemory(metaspace). > BTW with?shenandoah we have about 6MB/s with rare spikes up to 100MB/s > > Regards, > Anatoly Deyneka > > > > On Wed, 10 Nov 2021 at 20:07, Stefan Karlsson > wrote: > > Hi, > > I've managed to reproduce the long Concurrent Classes Unlink phase > by running an edited Blackbird microbenchmark. > > First add an ObjectMapper#copy call to beanArrayDeser: > https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > > > --- > diff --git > a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > index 3675286..af05d54 100644 > --- > a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > +++ > b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > @@ -80,7 +80,7 @@ public abstract class BaseBenchmark { > > ???? @Benchmark > ???? public SomeBean[] beanArrayDeser() throws Exception { > -??????? return mapper.readValue(beansBytes, SomeBean[].class); > +??????? return mapper.copy().readValue(beansBytes, SomeBean[].class); > ???? } > > ???? @Benchmark > --- > > Then change the ICBuffer size to 100, and run the following: > java -Xlog:gc* -Xmx20g -XX:+UseZGC -XX:MetaspaceReclaimPolicy=none > -XX:-ZProactive -jar target/benchmarks.jar > > First run: > [180,168s][info][gc,stats ]??? Subphase: Concurrent Classes > Unlink?????????????????? 495,386 / 495,386??? 7002,041 / > 24368,855? 7002,041 / 24368,855? 7002,041 / 24368,855 ms > > Second run: > [140,170s][info][gc,stats??? ]??? Subphase: Concurrent Classes > Unlink???????????????? 67606,542 / 134428,068 67606,542 / > 134428,068 67606,542 / 134428,068 67606,542 / 134428,068?? ms > > StefanK > > On 2021-11-10 17:18, Erik Osterlund wrote: >> Hi Anatoly, >> >> I think that we can now see why a large cluster of classes get unloaded, and as you say possibly control that, as a workaround, with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question is what it is that makes the unloading take so long. >> >> One theory is that it is related to metaspace uncommitting being more aggressive. It seemed though like you had tried the more balanced mode and it still was an issue, if I understood you correctly. >> >> The second theory would be that it is something in the code cache unloading that takes a long time. And it seems like when you introspected it, the flame graph suggests that it is the code cache unloading that takes long. So this seems like a more likely explanation. And I can see from the GC logs that you have a *lot* of nmethods being loaded. And many thousands of them get unloaded, which leads to a huge inline cache cleaning task, unlinking all these thousands of nmethods. This inline cache cleaning fills up a buffer (ICBuffer) that is currently 10 KB large (rather small and has not changed since ancient times, perhaps it is time to change it). When this buffer fills up, you need to trigger an IC buffer refilling safepoint, as you have observed in the logs. So the fact that loads of these safepoints incrementally trigger, seems to suggest that for this large number of nmethods being unloaded, and the buffer size is too small. When it is way too small, the code cache walk during the unlinking phase becomes quadratic with the number of nmethods. And you have *loads* of nmethods. So that could explain why it takes so long with the code cache unloading. >> >> In the solution space of scalability of class unloading to these large levels, the long-term fix is probably to remove inline caches completely, which I have worked on a bit, but it is not going to happen over night. A simpler fix that might actually happen over night, is to increase the size of the IC buffer, like this: >> >> diff --git a/src/hotspot/share/code/icBuffer.cpp b/src/hotspot/share/code/icBuffer.cpp >> index 1da48f0bd13..382fa8650c5 100644 >> --- a/src/hotspot/share/code/icBuffer.cpp >> +++ b/src/hotspot/share/code/icBuffer.cpp >> @@ -140,7 +140,7 @@ void ICStub::print() { >> >> void InlineCacheBuffer::initialize() { >> if (_buffer != NULL) return; // already initialized >> - _buffer = new StubQueue(new ICStubInterface, 10*K, InlineCacheBuffer_lock, "InlineCacheBuffer"); >> + _buffer = new StubQueue(new ICStubInterface, 1*M, InlineCacheBuffer_lock, "InlineCacheBuffer"); >> assert (_buffer != NULL, "cannot allocate InlineCacheBuffer"); >> } >> >> That would probably bring down the time spent doing nmethod unlinking by a *lot*. If you know how to build the JDK, that could be an easy patch to try out, to significantly improve the scalability of code cache unlinking. >> >> However, one relevant question is also if this is really a scalability problem in the GC unloading path, or if the real culprit is elsewhere. The fact that you see two orders of magnitude higher allocation rates in metaspace (!!!) when upgrading from 15 to 17, might explain why the unloading code suddenly gets scalability stressed so hard in the first place. And perhaps whatever caused that 2 orders of magnitude increase, is the real culprit here. Having said that, we might want to improve on the class unloading scalabiliy anyway, but I do wonder why this is being stressed at all, and why there are so many dying classes and nmethods. On that note, it isn?t clear to me if the 2 orders of magnitude higher metaspace allocation rate was observed before or after the JSON issues were addressed, as you described. >> >> /Erik >> >> From: Anatoly Deyneka >> Sent: Wednesday, 10 November 2021 14:04 >> To: Erik Osterlund >> Cc: Thomas St?fe ;zgc-dev at openjdk.java.net >> Subject: Re: [External] : Re: ZGC unstable behaviour on java 17 >> >> Thank you Erik, >> >> Do you have a understanding of how it can be fixed? >> SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. >> We have working shenandoah with default settings. It unloads classes without performance regression. >> BTW ZGC also can return to normal state if we temporary reduce allocation rate by redirecting new users to another node(old users still stay on the same node). >> We did that many times when we saw such behaviour. ZGC returned to normal state quite fast. >> >> I guess ZGC does the slow unloading class job in the same threads that affects overall performance of GC. >> It explains why zgc cycles time is dramatically increased. >> >> According to Risks and Assumptions#Uncommit speed (https://openjdk.java.net/jeps/387): >> "Metaspace reclamation can happen during a garbage-collection pause, so this could be a problem. >> We haven?t observed this problem so far, but if uncommit times become an issue then we could offload the uncommitting work to a separate thread so that it could be done independently of GC pauses." >> >> We observed and fixed class allocation problem in some scenarios of using jackson+blackbird. >> So we can try to reproduce this issue. >> >> Regards, >> Anatoly Deyneka >> >> On Wed, 10 Nov 2021 at 00:23, Erik Osterlund > wrote: >> Hi Anatoly, >> >> Thank you for the additional details. Very interesting indeed. Everything seems to run well until you get an allocation stall due to class unloading taking a long time. The very same GC where the first stall happens, nuked quite a lot of soft references, while previous GCs did not. It really looks like a clear pattern wrt soft reference count and the class unloading headache. >> >> This seems to support the original intuition that the lambda form cache might have caused excessive class unloading after lots of soft references get cleared. >> >> The GC itself didn?t explicitly clear all soft references, but they might have timed out at the same time. By setting SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref clearing won?t get triggered. That might make things better for you. >> >> Having said that, it seems like this cache really shouldn?t be using soft references if the result is this horrible when they get cleared. Unless of course that this is a red herring. >> >> Thanks, >> /Erik >> >> >> On 9 Nov 2021, at 17:39, Anatoly Deyneka > wrote: >> ? >> Please find the log here: >> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 >> >> - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. >> we would like to go with inflation. We use a lot of reflection. >> - Does it happen with other GCs? >> we tried shenandoah today and it worked fine(just 1 stall for the whole day) >> - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? >> no, we didn't use this flag >> - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. >> we can try this >> - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. >> good point. we will analize it. >> >> Regards, >> Anatoly Deyneka >> >> On Tue, 9 Nov 2021 at 18:26, Thomas St?fe > wrote: >> Additionally to Eriks proposal, some more remarks from the side (I'll just lurk in the background otherwise) : >> >> - Maybe try without reflection inflation? E.g. `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit the generation of reflection class loaders and rule out that they are the problem. >> >> - Does it happen with other GCs? >> >> - Elastic Metaspace did away with huge pages in Metaspace for several reasons. Did the old scenario use -XX:+UseLargePagesInMetaspace ? >> >> - Can you switch metaspace reclamation off using `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if yes, there may be something wrong with the reclamation algorithm. >> >> - If the VM becomes unresponsive, could you also give us the output of `jcmd VM.metaspace` ? Also, if possible, `jcmd VM.metaspace show-loaders show-classes` though with 1M classes output may be large. >> >> Thanks, Thomas >> >> On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > wrote: >> Hi Anatoly, >> >> Thank you for the additional details. This is very helpful. I talked to Per a bit about this, and we feel like understanding what increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is rather likely to lead us to the problem. That?s more than 3 orders of magnitude higher allocation rate of metaspace memory, which might explain why class unloading is taking much longer than usual. >> >> The culprit that is being repeatedly allocated, is rather likely to also be repeatedly unloaded, or you would hit an out of memory issue. And the log snippet you showed, shows that there is a bunch of LambdaForms being unloaded. Interestingly enough, the LambdaFormEditor has a SoftReference based cache, so that you don?t have to create a new LambdaForm, when there is one in the cache. This soft reference based policy might not work very well if all SoftReferences are cleared by the GC, because then it would seem like a lot more LambdaForms would have to be created from scratch, as opposed to being reused from the cache. >> >> This might explain why it takes a while before this starts happening. The interesting thing is figuring out what could have led to all the soft references being cleared. This could be because of metaspace exhaustion, or due to allocation stalls. In the charts we can see that the allocation rate was at the highest point, right before the incident. So without seeing the log, it seems like you could have gotten an allocation stall, which triggered a full collection clearing all soft references, which suddenly leads to class unloading taking much longer. If class unloading takes much longer, then you are quite likely to reach the same situation repeatedly, in a bad cycle. >> >> Is there any way you could share an -Xlog:gc* log with us? If not, finding out if there was an allocation stall or a ?Metadata GC Clear Soft References? GC just before everything went bad, might help with figuring out what went wrong. Something that did happen between JDK 15 and JDK 17 is dynamic thread selection for the GC. I wonder if that has somehow led to a stall, which you wouldn?t see in JDK 15 with the more static thread selection. If so, then maybe setting a SoftMaxHeapSize lower than the MaxHeapSize might help avoiding stall in the first place, due to fluctuations in the workload, or alternatively turning off UseDynamicNumberOfGCThreads and setting ConcGCThreads high enough for allocation stalls not happen. >> >> Anyway, figuring out what kind of GC triggered right before the incident would help a lot I think, to figure out if forced soft ref clearing is indeed what instigated this, or if this is all just a red herring. >> >> Thanks, >> /Erik >> >> From: Anatoly Deyneka > >> Sent: Tuesday, 9 November 2021 14:13 >> To: Erik Osterlund > >> Cc:zgc-dev at openjdk.java.net >> Subject: [External] : Re: ZGC unstable behaviour on java 17 >> >> Hi Erik, >> >> Probably we haven't reached this condition on java 15. >> We see a big difference in metaspace allocation starting from java 16. >> The behaviour is not changed for any option of -XX:MetaspaceReclaimPolicy >> >> Here is all our knowledge about the problem: >> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md >> >> Unfortunately we can reproduce this issue only on our production platform. >> If we can provide more details or logs please let me know. >> >> Regards, >> Anatoly Deyneka >> >> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >> wrote: >> Hi Anatoly, >> >> Thanks for reporting this. This looks rather odd indeed. Unfortunately, the attached images have been stripped from your original email. >> >> It?s particularly interesting that you see this behaviour when migrating from JDK 15 to 17. There have been a few bug fixes, but nothing I can easily imagine causing this. I have been surprised before though. One notable thing in the class unloading path that has changed though, is the elastic metaspace. I see that you use the -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in JDK15. This makes me wonder if this reproduces also without that option, i.e. the way that you run the workload is more similar. >> >> Anyway, is there any way for me to get access to try to reproduce your issue? >> >> Thanks, >> /Erik >> >>> On 4 Nov 2021, at 17:00, Anatoly Deyneka >> wrote: >>> >>> ?We are running a GC intensive application with allocation rate up to >>> 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). >>> openjdk 17 2021-09-14 >>> OpenJDK Runtime Environment (build 17+35-2724) >>> OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) >>> >>> During migration from java 15 to java 17 we see unstable behavior when zgc >>> goes to infinite loop and doesn't return back to the normal state. >>> We see the following behavior: >>> - it runs perfectly(~30% better then on java 15) - 5-7GB used memory with >>> 16% CPU utilization >>> - something happens and CPU utilization jumps to 40%, used memory jumps to >>> 19GB, zgc cycles time is increased in 10times >>> - sometimes app works about 30min before the crash sometimes it's crashed >>> instantly >>> - if it's still alive memory is not released and the app works with used >>> memory 18-20GB >>> - we see allocation stalls >>> - we see the following GC stats >>> [17606.140s][info][gc,phases ] GC(719) Concurrent Process Non-Strong >>> References 25781.928ms >>> [17610.181s][info][gc,stats ] Subphase: Concurrent Classes Unlink >>> 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / 68385.750 217.882 >>> / 68385.750 ms >>> - we see JVM starts massively unload classes (jvm_classes_unloaded_total >>> counter keep increasing) . >>> - we?ve managed to profile JVM in such ?unstable? state and see ZTask >>> consuming a lot of cpu doing CompiledMethod::unload_nmethod_caches(bool) >>> job. See attached image. >>> >>> We played with options: >>> -XX:ZCollectionInterval=5 >>> -XX:SoftMaxHeapSize=12G >>> -XX:ZAllocationSpikeTolerance=4 >>> -XX:MetaspaceReclaimPolicy=aggressive >>> >>> But it doesn't help. it just postpones this event. >>> Only if we redirect traffic to another node it goes to normal. >>> >>> Regards, >>> Anatoly Deyneka >>> >>> [image: image.png] >>> [image: image.png] >>> [image: image.png] > From stefan.karlsson at oracle.com Thu Nov 18 10:30:16 2021 From: stefan.karlsson at oracle.com (Stefan Karlsson) Date: Thu, 18 Nov 2021 11:30:16 +0100 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> <57b16ac3-2083-6544-9822-aacdac934314@oracle.com> Message-ID: <44682fad-a1de-5ba7-e281-f4d832368977@oracle.com> Hi Anatoly, We've identified a 10-year-old bug in the inline cache cleaning code. See this pull request: https://github.com/openjdk/jdk/pull/6450 It's likely that this is the cause of the problems you are seeing in your workload. Note that this bug is particularly bad for concurrent class unloading GCs (ZGC, Shenandoah), while the problem seems less sever for stop-the-world class unloading GCs (G1, Parallel, Serial). The plan right now is to get this into JDK 18, and then backport it to JDK 17. Thanks again for reporting this! StefanK On 2021-11-12 16:46, Stefan Karlsson wrote: > Hi Anatoly, > > On 2021-11-11 11:19, Anatoly Deyneka wrote: >> It's great you can reproduce it. >> Is there anything else we can help you with? > > I think we have everything we need for now. The hints about the > problem with refill_ic_stubs were great, and that's what we are > currently investigating on our side. > > Thanks, > StefanK > >> >> ??? On that note, it isn?t clear to me if the 2 orders of magnitude >> ??? higher metaspace allocation rate was observed before or after the >> ??? JSON issues were addressed, as you described. >> >> Before the fix of ObjectMapper.copy() we had a 250MB/s allocation >> rate and OutOfMemory(metaspace). >> BTW with?shenandoah we have about 6MB/s with rare spikes up to 100MB/s >> >> Regards, >> Anatoly Deyneka >> >> >> >> On Wed, 10 Nov 2021 at 20:07, Stefan Karlsson >> wrote: >> >> ??? Hi, >> >> ??? I've managed to reproduce the long Concurrent Classes Unlink phase >> ??? by running an edited Blackbird microbenchmark. >> >> ??? First add an ObjectMapper#copy call to beanArrayDeser: >> https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java >> >> >> ??? --- >> ??? diff --git >> a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java >> b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java >> ??? index 3675286..af05d54 100644 >> ??? --- >> a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java >> ??? +++ >> b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java >> ??? @@ -80,7 +80,7 @@ public abstract class BaseBenchmark { >> >> ??? ???? @Benchmark >> ??? ???? public SomeBean[] beanArrayDeser() throws Exception { >> ??? -??????? return mapper.readValue(beansBytes, SomeBean[].class); >> ??? +??????? return mapper.copy().readValue(beansBytes, >> SomeBean[].class); >> ??? ???? } >> >> ??? ???? @Benchmark >> ??? --- >> >> ??? Then change the ICBuffer size to 100, and run the following: >> ??? java -Xlog:gc* -Xmx20g -XX:+UseZGC -XX:MetaspaceReclaimPolicy=none >> ??? -XX:-ZProactive -jar target/benchmarks.jar >> >> ??? First run: >> ??? [180,168s][info][gc,stats ]??? Subphase: Concurrent Classes >> ??? Unlink?????????????????? 495,386 / 495,386??? 7002,041 / >> ??? 24368,855? 7002,041 / 24368,855? 7002,041 / 24368,855 ms >> >> ??? Second run: >> ??? [140,170s][info][gc,stats??? ]??? Subphase: Concurrent Classes >> ??? Unlink???????????????? 67606,542 / 134428,068 67606,542 / >> ??? 134428,068 67606,542 / 134428,068 67606,542 / 134428,068 ms >> >> ??? StefanK >> >> ??? On 2021-11-10 17:18, Erik Osterlund wrote: >>> ??? Hi Anatoly, >>> >>> ??? I think that we can now see why a large cluster of classes get >>> unloaded, and as you say possibly control that, as a workaround, >>> with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question >>> is what it is that makes the unloading take so long. >>> >>> ??? One theory is that it is related to metaspace uncommitting being >>> more aggressive. It seemed though like you had tried the more >>> balanced mode and it still was an issue, if I understood you correctly. >>> >>> ??? The second theory would be that it is something in the code >>> cache unloading that takes a long time. And it seems like when you >>> introspected it, the flame graph suggests that it is the code cache >>> unloading that takes long. So this seems like a more likely >>> explanation. And I can see from the GC logs that you have a *lot* of >>> nmethods being loaded. And many thousands of them get unloaded, >>> which leads to a huge inline cache cleaning task, unlinking all >>> these thousands of nmethods. This inline cache cleaning fills up a >>> buffer (ICBuffer) that is currently 10 KB large (rather small and >>> has not changed since ancient times, perhaps it is time to change >>> it). When this buffer fills up, you need to trigger an IC buffer >>> refilling safepoint, as you have observed in the logs. So the fact >>> that loads of these safepoints incrementally trigger, seems to >>> suggest that for this large number of nmethods being unloaded, and >>> the buffer size is too small. When it is way too small, the code >>> cache walk during the unlinking phase becomes quadratic with the >>> number of nmethods. And you have *loads* of nmethods. So that could >>> explain why it takes so long with the code cache unloading. >>> >>> ??? In the solution space of scalability of class unloading to these >>> large levels, the long-term fix is probably to remove inline caches >>> completely, which I have worked on a bit, but it is not going to >>> happen over night. A simpler fix that might actually happen over >>> night, is to increase the size of the IC buffer, like this: >>> >>> ??? diff --git a/src/hotspot/share/code/icBuffer.cpp >>> b/src/hotspot/share/code/icBuffer.cpp >>> ??? index 1da48f0bd13..382fa8650c5 100644 >>> ??? --- a/src/hotspot/share/code/icBuffer.cpp >>> ??? +++ b/src/hotspot/share/code/icBuffer.cpp >>> ??? @@ -140,7 +140,7 @@ void ICStub::print() { >>> >>> ??? void InlineCacheBuffer::initialize() { >>> ??????? if (_buffer != NULL) return; // already initialized >>> ??? -? _buffer = new StubQueue(new ICStubInterface, 10*K, >>> InlineCacheBuffer_lock, "InlineCacheBuffer"); >>> ??? +? _buffer = new StubQueue(new ICStubInterface, 1*M, >>> InlineCacheBuffer_lock, "InlineCacheBuffer"); >>> ??????? assert (_buffer != NULL, "cannot allocate InlineCacheBuffer"); >>> ??? } >>> >>> ??? That would probably bring down the time spent doing nmethod >>> unlinking by a *lot*. If you know how to build the JDK, that could >>> be an easy patch to try out, to significantly improve the >>> scalability of code cache unlinking. >>> >>> ??? However, one relevant question is also if this is really a >>> scalability problem in the GC unloading path, or if the real culprit >>> is elsewhere. The fact that you see two orders of magnitude higher >>> allocation rates in metaspace (!!!) when upgrading from 15 to 17, >>> might explain why the unloading code suddenly gets scalability >>> stressed so hard in the first place. And perhaps whatever caused >>> that 2 orders of magnitude increase, is the real culprit here. >>> Having said that, we might want to improve on the class unloading >>> scalabiliy anyway, but I do wonder why this is being stressed at >>> all, and why there are so many dying classes and nmethods. On that >>> note, it isn?t clear to me if the 2 orders of magnitude higher >>> metaspace allocation rate was observed before or after the JSON >>> issues were addressed, as you described. >>> >>> ??? /Erik >>> >>> ??? From: Anatoly Deyneka >>> >>> ??? Sent: Wednesday, 10 November 2021 14:04 >>> ??? To: Erik Osterlund >>> >>> ??? Cc: Thomas St?fe >>> ;zgc-dev at openjdk.java.net >>> ??? Subject: Re: [External] : Re: ZGC unstable behaviour on java 17 >>> >>> ??? Thank you Erik, >>> >>> ??? Do you have a understanding of how it can be fixed? >>> ??? SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. >>> ??? We have working shenandoah with default settings. It unloads >>> classes without performance regression. >>> ??? BTW ZGC also can return to normal state if we temporary reduce >>> allocation rate by redirecting new users to another node(old users >>> still stay on the same node). >>> ??? We did that many times when we saw such behaviour. ZGC returned >>> to normal state quite fast. >>> >>> ??? I guess ZGC does the slow unloading class job in the same >>> threads that affects overall performance of GC. >>> ??? It explains why zgc cycles time is dramatically increased. >>> >>> ??? According to Risks and Assumptions#Uncommit speed >>> (https://openjdk.java.net/jeps/387): >>> ??? "Metaspace reclamation can happen during a garbage-collection >>> pause, so this could be a problem. >>> ??? We haven?t observed this problem so far, but if uncommit times >>> become an issue then we could offload the uncommitting work to a >>> separate thread so that it could be done independently of GC pauses." >>> >>> ??? We observed and fixed class allocation problem in some scenarios >>> of using jackson+blackbird. >>> ??? So we can try to reproduce this issue. >>> >>> ??? Regards, >>> ??? Anatoly Deyneka >>> >>> ??? On Wed, 10 Nov 2021 at 00:23, Erik Osterlund >>> >>> > wrote: >>> ??? Hi Anatoly, >>> >>> ??? Thank you for the additional details. Very interesting indeed. >>> Everything seems to run well until you get an allocation stall due >>> to class unloading taking a long time. The very same GC where the >>> first stall happens, nuked quite a lot of soft references, while >>> previous GCs did not. It really looks like a clear pattern wrt soft >>> reference count and the class unloading headache. >>> >>> ??? This seems to support the original intuition that the lambda >>> form cache might have caused excessive class unloading after lots of >>> soft references get cleared. >>> >>> ??? The GC itself didn?t explicitly clear all soft references, but >>> they might have timed out at the same time. By setting >>> SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref >>> clearing won?t get triggered. That might make things better for you. >>> >>> ??? Having said that, it seems like this cache really shouldn?t be >>> using soft references if the result is this horrible when they get >>> cleared. Unless of course that this is a red herring. >>> >>> ??? Thanks, >>> ??? /Erik >>> >>> >>> ??? On 9 Nov 2021, at 17:39, Anatoly Deyneka >>> >>> > wrote: >>> ??? ? >>> ??? Please find the log here: >>> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 >>> >>> >>> >>> >>> ??? - Maybe try without reflection inflation? E.g. >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit >>> the generation of reflection class loaders and rule out that they >>> are the problem. >>> ??? we would like to go with inflation. We use a lot of reflection. >>> ??? - Does it happen with other GCs? >>> ??? we tried shenandoah today and it worked fine(just 1 stall for >>> the whole day) >>> ??? - Elastic Metaspace did away with huge pages in Metaspace for >>> several reasons. Did the old scenario use >>> -XX:+UseLargePagesInMetaspace ? >>> ??? no, we didn't use this flag >>> ??? - Can you switch metaspace reclamation off using >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if >>> yes, there may be something wrong with the reclamation algorithm. >>> ??? we can try this >>> ??? - If the VM becomes unresponsive, could you also give us the >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd >>> VM.metaspace show-loaders show-classes` though with 1M classes >>> output may be large. >>> ??? good point. we will analize it. >>> >>> ??? Regards, >>> ??? Anatoly Deyneka >>> >>> ??? On Tue, 9 Nov 2021 at 18:26, Thomas St?fe >>> >>> > wrote: >>> ??? Additionally to Eriks proposal, some more remarks from the side >>> (I'll just lurk in the background otherwise) : >>> >>> ??? - Maybe try without reflection inflation? E.g. >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit >>> the generation of reflection class loaders and rule out that they >>> are the problem. >>> >>> ??? - Does it happen with other GCs? >>> >>> ??? - Elastic Metaspace did away with huge pages in Metaspace for >>> several reasons. Did the old scenario use >>> -XX:+UseLargePagesInMetaspace ? >>> >>> ??? - Can you switch metaspace reclamation off using >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if >>> yes, there may be something wrong with the reclamation algorithm. >>> >>> ??? - If the VM becomes unresponsive, could you also give us the >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd >>> VM.metaspace show-loaders show-classes` though with 1M classes >>> output may be large. >>> >>> ??? Thanks, Thomas >>> >>> ??? On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund >>> >>> > wrote: >>> ??? Hi Anatoly, >>> >>> ??? Thank you for the additional details. This is very helpful. I >>> talked to Per a bit about this, and we feel like understanding what >>> increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is >>> rather likely to lead us to the problem. That?s more than 3 orders >>> of magnitude higher allocation rate of metaspace memory, which might >>> explain why class unloading is taking much longer than usual. >>> >>> ??? The culprit that is being repeatedly allocated, is rather likely >>> to also be repeatedly unloaded, or you would hit an out of memory >>> issue. And the log snippet you showed, shows that there is a bunch >>> of LambdaForms being unloaded. Interestingly enough, the >>> LambdaFormEditor has a SoftReference based cache, so that you don?t >>> have to create a new LambdaForm, when there is one in the cache. >>> This soft reference based policy might not work very well if all >>> SoftReferences are cleared by the GC, because then it would seem >>> like a lot more LambdaForms would have to be created from scratch, >>> as opposed to being reused from the cache. >>> >>> ??? This might explain why it takes a while before this starts >>> happening. The interesting thing is figuring out what could have led >>> to all the soft references being cleared. This could be because of >>> metaspace exhaustion, or due to allocation stalls. In the charts we >>> can see that the allocation rate was at the highest point, right >>> before the incident. So without seeing the log, it seems like you >>> could have gotten an allocation stall, which triggered a full >>> collection clearing all soft references, which suddenly leads to >>> class unloading taking much longer. If class unloading takes much >>> longer, then you are quite likely to reach the same situation >>> repeatedly, in a bad cycle. >>> >>> ??? Is there any way you could share an -Xlog:gc* log with us? If >>> not, finding out if there was an allocation stall or a ?Metadata GC >>> Clear Soft References? GC just before everything went bad, might >>> help with figuring out what went wrong. Something that did happen >>> between JDK 15 and JDK 17 is dynamic thread selection for the GC. I >>> wonder if that has somehow led to a stall, which you wouldn?t see in >>> JDK 15 with the more static thread selection. If so, then maybe >>> setting a SoftMaxHeapSize lower than the MaxHeapSize might help >>> avoiding stall in the first place, due to fluctuations in the >>> workload, or alternatively turning off UseDynamicNumberOfGCThreads >>> and setting ConcGCThreads high enough for allocation stalls not happen. >>> >>> ??? Anyway, figuring out what kind of GC triggered right before the >>> incident would help a lot I think, to figure out if forced soft ref >>> clearing is indeed what instigated this, or if this is all just a >>> red herring. >>> >>> ??? Thanks, >>> ??? /Erik >>> >>> ??? From: Anatoly Deyneka >>> >>> > >>> ??? Sent: Tuesday, 9 November 2021 14:13 >>> ??? To: Erik Osterlund >>> >>> > >>> Cc:zgc-dev at openjdk.java.net >>> >>> ??? Subject: [External] : Re: ZGC unstable behaviour on java 17 >>> >>> ??? Hi Erik, >>> >>> ??? Probably we haven't reached this condition on java 15. >>> ??? We see a big difference in metaspace allocation starting from >>> java 16. >>> ??? The behaviour is not changed for any option of >>> -XX:MetaspaceReclaimPolicy >>> >>> ??? Here is all our knowledge about the problem: >>> https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md >>> >>> >>> >>> >>> >>> ??? Unfortunately we can reproduce this issue only on our production >>> platform. >>> ??? If we can provide more details or logs please let me know. >>> >>> ??? Regards, >>> ??? Anatoly Deyneka >>> >>> ??? On Mon, 8 Nov 2021 at 16:46, Erik Osterlund >>> >>> >> >>> >> wrote: >>> ??? Hi Anatoly, >>> >>> ??? Thanks for reporting this. This looks rather odd indeed. >>> Unfortunately, the attached images have been stripped from your >>> original email. >>> >>> ??? It?s particularly interesting that you see this behaviour when >>> migrating from JDK 15 to 17. There have been a few bug fixes, but >>> nothing I can easily imagine causing this. I have been surprised >>> before though. One notable thing in the class unloading path that >>> has changed though, is the elastic metaspace. I see that you use the >>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in >>> JDK15. This makes me wonder if this reproduces also without that >>> option, i.e. the way that you run the workload is more similar. >>> >>> ??? Anyway, is there any way for me to get access to try to >>> reproduce your issue? >>> >>> ??? Thanks, >>> ??? /Erik >>> >>>> ??? On 4 Nov 2021, at 17:00, Anatoly Deyneka >>>> >>>> >>> >>>> >> wrote: >>>> >>>> ??? ?We are running a GC intensive application with allocation rate >>>> up to >>>> ??? 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). >>>> ??? openjdk 17 2021-09-14 >>>> ??? OpenJDK Runtime Environment (build 17+35-2724) >>>> ??? OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) >>>> >>>> ??? During migration from java 15 to java 17 we see unstable >>>> behavior when zgc >>>> ??? goes to infinite loop and doesn't return back to the normal state. >>>> ??? We see the following behavior: >>>> ??? - it runs perfectly(~30% better then on java 15) - 5-7GB used >>>> memory with >>>> ??? 16% CPU utilization >>>> ??? - something happens and CPU utilization jumps to 40%, used >>>> memory jumps to >>>> ??? 19GB, zgc cycles time is increased in 10times >>>> ??? - sometimes app works about 30min before the crash sometimes >>>> it's crashed >>>> ??? instantly >>>> ??? - if it's still alive memory is not released and the app works >>>> with used >>>> ??? memory 18-20GB >>>> ??? - we see allocation stalls >>>> ??? - we see the following GC stats >>>> ??? [17606.140s][info][gc,phases?? ] GC(719) Concurrent Process >>>> Non-Strong >>>> ??? References 25781.928ms >>>> ??? [17610.181s][info][gc,stats??? ] Subphase: Concurrent Classes >>>> Unlink >>>> ??? 14280.772 / 25769.511? 1126.563 / 25769.511?? 217.882 / >>>> 68385.750?? 217.882 >>>> ??? / 68385.750?? ms >>>> ??? -? we see JVM starts massively unload classes >>>> (jvm_classes_unloaded_total >>>> ??? counter keep increasing) . >>>> ??? - we?ve managed to profile JVM in such ?unstable? state and see >>>> ZTask >>>> ??? consuming a lot of cpu doing >>>> CompiledMethod::unload_nmethod_caches(bool) >>>> ??? job. See attached image. >>>> >>>> ??? We played with options: >>>> ??? -XX:ZCollectionInterval=5 >>>> ??? -XX:SoftMaxHeapSize=12G >>>> ??? -XX:ZAllocationSpikeTolerance=4 >>>> ??? -XX:MetaspaceReclaimPolicy=aggressive >>>> >>>> ??? But it doesn't help. it just postpones this event. >>>> ??? Only if we redirect traffic to another node it goes to normal. >>>> >>>> ??? Regards, >>>> ??? Anatoly Deyneka >>>> >>>> ??? [image: image.png] >>>> ??? [image: image.png] >>>> ??? [image: image.png] >> > From adeyneka at gmail.com Fri Nov 19 10:45:04 2021 From: adeyneka at gmail.com (Anatoly Deyneka) Date: Fri, 19 Nov 2021 13:45:04 +0300 Subject: [External] : Re: ZGC unstable behaviour on java 17 In-Reply-To: <44682fad-a1de-5ba7-e281-f4d832368977@oracle.com> References: <7A1AE5B9-04D8-4BD9-8582-5E158141C5A9@oracle.com> <048B6073-AEF0-488D-A674-56FB64B6779F@oracle.com> <57b16ac3-2083-6544-9822-aacdac934314@oracle.com> <44682fad-a1de-5ba7-e281-f4d832368977@oracle.com> Message-ID: Thank you, I'm surprised how fast it has been fixed. Looking forward to checking it out. Anatoly On Thu, 18 Nov 2021 at 13:30, Stefan Karlsson wrote: > Hi Anatoly, > > We've identified a 10-year-old bug in the inline cache cleaning code. > See this pull request: > https://github.com/openjdk/jdk/pull/6450 > > It's likely that this is the cause of the problems you are seeing in > your workload. Note that this bug is particularly bad for concurrent > class unloading GCs (ZGC, Shenandoah), while the problem seems less > sever for stop-the-world class unloading GCs (G1, Parallel, Serial). > > The plan right now is to get this into JDK 18, and then backport it to > JDK 17. > > Thanks again for reporting this! > > StefanK > > On 2021-11-12 16:46, Stefan Karlsson wrote: > > Hi Anatoly, > > > > On 2021-11-11 11:19, Anatoly Deyneka wrote: > >> It's great you can reproduce it. > >> Is there anything else we can help you with? > > > > I think we have everything we need for now. The hints about the > > problem with refill_ic_stubs were great, and that's what we are > > currently investigating on our side. > > > > Thanks, > > StefanK > > > >> > >> On that note, it isn?t clear to me if the 2 orders of magnitude > >> higher metaspace allocation rate was observed before or after the > >> JSON issues were addressed, as you described. > >> > >> Before the fix of ObjectMapper.copy() we had a 250MB/s allocation > >> rate and OutOfMemory(metaspace). > >> BTW with shenandoah we have about 6MB/s with rare spikes up to 100MB/s > >> > >> Regards, > >> Anatoly Deyneka > >> > >> > >> > >> On Wed, 10 Nov 2021 at 20:07, Stefan Karlsson > >> wrote: > >> > >> Hi, > >> > >> I've managed to reproduce the long Concurrent Classes Unlink phase > >> by running an edited Blackbird microbenchmark. > >> > >> First add an ObjectMapper#copy call to beanArrayDeser: > >> > https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > >> < > https://urldefense.com/v3/__https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmIMuUEN8$ > > > >> > >> --- > >> diff --git > >> > a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > >> > b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > >> index 3675286..af05d54 100644 > >> --- > >> > a/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > >> +++ > >> > b/blackbird/benchmarks/src/main/java/com/fasterxml/jackson/module/blackbird/BaseBenchmark.java > >> @@ -80,7 +80,7 @@ public abstract class BaseBenchmark { > >> > >> @Benchmark > >> public SomeBean[] beanArrayDeser() throws Exception { > >> - return mapper.readValue(beansBytes, SomeBean[].class); > >> + return mapper.copy().readValue(beansBytes, > >> SomeBean[].class); > >> } > >> > >> @Benchmark > >> --- > >> > >> Then change the ICBuffer size to 100, and run the following: > >> java -Xlog:gc* -Xmx20g -XX:+UseZGC -XX:MetaspaceReclaimPolicy=none > >> -XX:-ZProactive -jar target/benchmarks.jar > >> > >> First run: > >> [180,168s][info][gc,stats ] Subphase: Concurrent Classes > >> Unlink 495,386 / 495,386 7002,041 / > >> 24368,855 7002,041 / 24368,855 7002,041 / 24368,855 ms > >> > >> Second run: > >> [140,170s][info][gc,stats ] Subphase: Concurrent Classes > >> Unlink 67606,542 / 134428,068 67606,542 / > >> 134428,068 67606,542 / 134428,068 67606,542 / 134428,068 ms > >> > >> StefanK > >> > >> On 2021-11-10 17:18, Erik Osterlund wrote: > >>> Hi Anatoly, > >>> > >>> I think that we can now see why a large cluster of classes get > >>> unloaded, and as you say possibly control that, as a workaround, > >>> with the SoftRefLRUPolicyMSPerMB flag. However, the bigger question > >>> is what it is that makes the unloading take so long. > >>> > >>> One theory is that it is related to metaspace uncommitting being > >>> more aggressive. It seemed though like you had tried the more > >>> balanced mode and it still was an issue, if I understood you correctly. > >>> > >>> The second theory would be that it is something in the code > >>> cache unloading that takes a long time. And it seems like when you > >>> introspected it, the flame graph suggests that it is the code cache > >>> unloading that takes long. So this seems like a more likely > >>> explanation. And I can see from the GC logs that you have a *lot* of > >>> nmethods being loaded. And many thousands of them get unloaded, > >>> which leads to a huge inline cache cleaning task, unlinking all > >>> these thousands of nmethods. This inline cache cleaning fills up a > >>> buffer (ICBuffer) that is currently 10 KB large (rather small and > >>> has not changed since ancient times, perhaps it is time to change > >>> it). When this buffer fills up, you need to trigger an IC buffer > >>> refilling safepoint, as you have observed in the logs. So the fact > >>> that loads of these safepoints incrementally trigger, seems to > >>> suggest that for this large number of nmethods being unloaded, and > >>> the buffer size is too small. When it is way too small, the code > >>> cache walk during the unlinking phase becomes quadratic with the > >>> number of nmethods. And you have *loads* of nmethods. So that could > >>> explain why it takes so long with the code cache unloading. > >>> > >>> In the solution space of scalability of class unloading to these > >>> large levels, the long-term fix is probably to remove inline caches > >>> completely, which I have worked on a bit, but it is not going to > >>> happen over night. A simpler fix that might actually happen over > >>> night, is to increase the size of the IC buffer, like this: > >>> > >>> diff --git a/src/hotspot/share/code/icBuffer.cpp > >>> b/src/hotspot/share/code/icBuffer.cpp > >>> index 1da48f0bd13..382fa8650c5 100644 > >>> --- a/src/hotspot/share/code/icBuffer.cpp > >>> +++ b/src/hotspot/share/code/icBuffer.cpp > >>> @@ -140,7 +140,7 @@ void ICStub::print() { > >>> > >>> void InlineCacheBuffer::initialize() { > >>> if (_buffer != NULL) return; // already initialized > >>> - _buffer = new StubQueue(new ICStubInterface, 10*K, > >>> InlineCacheBuffer_lock, "InlineCacheBuffer"); > >>> + _buffer = new StubQueue(new ICStubInterface, 1*M, > >>> InlineCacheBuffer_lock, "InlineCacheBuffer"); > >>> assert (_buffer != NULL, "cannot allocate InlineCacheBuffer"); > >>> } > >>> > >>> That would probably bring down the time spent doing nmethod > >>> unlinking by a *lot*. If you know how to build the JDK, that could > >>> be an easy patch to try out, to significantly improve the > >>> scalability of code cache unlinking. > >>> > >>> However, one relevant question is also if this is really a > >>> scalability problem in the GC unloading path, or if the real culprit > >>> is elsewhere. The fact that you see two orders of magnitude higher > >>> allocation rates in metaspace (!!!) when upgrading from 15 to 17, > >>> might explain why the unloading code suddenly gets scalability > >>> stressed so hard in the first place. And perhaps whatever caused > >>> that 2 orders of magnitude increase, is the real culprit here. > >>> Having said that, we might want to improve on the class unloading > >>> scalabiliy anyway, but I do wonder why this is being stressed at > >>> all, and why there are so many dying classes and nmethods. On that > >>> note, it isn?t clear to me if the 2 orders of magnitude higher > >>> metaspace allocation rate was observed before or after the JSON > >>> issues were addressed, as you described. > >>> > >>> /Erik > >>> > >>> From: Anatoly Deyneka > >>> > >>> Sent: Wednesday, 10 November 2021 14:04 > >>> To: Erik Osterlund > >>> > >>> Cc: Thomas St?fe > >>> ;zgc-dev at openjdk.java.net > >>> Subject: Re: [External] : Re: ZGC unstable behaviour on java 17 > >>> > >>> Thank you Erik, > >>> > >>> Do you have a understanding of how it can be fixed? > >>> SoftRefLRUPolicyMSPerMB - is a workaround and not a fix. > >>> We have working shenandoah with default settings. It unloads > >>> classes without performance regression. > >>> BTW ZGC also can return to normal state if we temporary reduce > >>> allocation rate by redirecting new users to another node(old users > >>> still stay on the same node). > >>> We did that many times when we saw such behaviour. ZGC returned > >>> to normal state quite fast. > >>> > >>> I guess ZGC does the slow unloading class job in the same > >>> threads that affects overall performance of GC. > >>> It explains why zgc cycles time is dramatically increased. > >>> > >>> According to Risks and Assumptions#Uncommit speed > >>> (https://openjdk.java.net/jeps/387): > >>> "Metaspace reclamation can happen during a garbage-collection > >>> pause, so this could be a problem. > >>> We haven?t observed this problem so far, but if uncommit times > >>> become an issue then we could offload the uncommitting work to a > >>> separate thread so that it could be done independently of GC pauses." > >>> > >>> We observed and fixed class allocation problem in some scenarios > >>> of using jackson+blackbird. > >>> So we can try to reproduce this issue. > >>> > >>> Regards, > >>> Anatoly Deyneka > >>> > >>> On Wed, 10 Nov 2021 at 00:23, Erik Osterlund > >>> > >>> > wrote: > >>> Hi Anatoly, > >>> > >>> Thank you for the additional details. Very interesting indeed. > >>> Everything seems to run well until you get an allocation stall due > >>> to class unloading taking a long time. The very same GC where the > >>> first stall happens, nuked quite a lot of soft references, while > >>> previous GCs did not. It really looks like a clear pattern wrt soft > >>> reference count and the class unloading headache. > >>> > >>> This seems to support the original intuition that the lambda > >>> form cache might have caused excessive class unloading after lots of > >>> soft references get cleared. > >>> > >>> The GC itself didn?t explicitly clear all soft references, but > >>> they might have timed out at the same time. By setting > >>> SoftRefLRUPolicyMSPerMB to a *very* high value, the soft ref > >>> clearing won?t get triggered. That might make things better for you. > >>> > >>> Having said that, it seems like this cache really shouldn?t be > >>> using soft references if the result is this horrible when they get > >>> cleared. Unless of course that this is a red herring. > >>> > >>> Thanks, > >>> /Erik > >>> > >>> > >>> On 9 Nov 2021, at 17:39, Anatoly Deyneka > >>> > >>> > wrote: > >>> ? > >>> Please find the log here: > >>> > https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0 > >>> < > https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmBsZKftr$ > >< > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$> > > >>> < > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/gc-1-2021-10-28_09-14-36-blue.log.0__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiyA0IE_N4$> > > >>> > >>> > >>> - Maybe try without reflection inflation? E.g. > >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit > >>> the generation of reflection class loaders and rule out that they > >>> are the problem. > >>> we would like to go with inflation. We use a lot of reflection. > >>> - Does it happen with other GCs? > >>> we tried shenandoah today and it worked fine(just 1 stall for > >>> the whole day) > >>> - Elastic Metaspace did away with huge pages in Metaspace for > >>> several reasons. Did the old scenario use > >>> -XX:+UseLargePagesInMetaspace ? > >>> no, we didn't use this flag > >>> - Can you switch metaspace reclamation off using > >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if > >>> yes, there may be something wrong with the reclamation algorithm. > >>> we can try this > >>> - If the VM becomes unresponsive, could you also give us the > >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd > >>> VM.metaspace show-loaders show-classes` though with 1M classes > >>> output may be large. > >>> good point. we will analize it. > >>> > >>> Regards, > >>> Anatoly Deyneka > >>> > >>> On Tue, 9 Nov 2021 at 18:26, Thomas St?fe > >>> > >>> > wrote: > >>> Additionally to Eriks proposal, some more remarks from the side > >>> (I'll just lurk in the background otherwise) : > >>> > >>> - Maybe try without reflection inflation? E.g. > >>> `-Dsun.reflect.inflationThreshold=2147483647`. Which would inhibit > >>> the generation of reflection class loaders and rule out that they > >>> are the problem. > >>> > >>> - Does it happen with other GCs? > >>> > >>> - Elastic Metaspace did away with huge pages in Metaspace for > >>> several reasons. Did the old scenario use > >>> -XX:+UseLargePagesInMetaspace ? > >>> > >>> - Can you switch metaspace reclamation off using > >>> `-XX:MetaspaceReclaimPolicy=none`? I'd be curious if this helps - if > >>> yes, there may be something wrong with the reclamation algorithm. > >>> > >>> - If the VM becomes unresponsive, could you also give us the > >>> output of `jcmd VM.metaspace` ? Also, if possible, `jcmd > >>> VM.metaspace show-loaders show-classes` though with 1M classes > >>> output may be large. > >>> > >>> Thanks, Thomas > >>> > >>> On Tue, Nov 9, 2021 at 3:45 PM Erik Osterlund > >>> > >>> > wrote: > >>> Hi Anatoly, > >>> > >>> Thank you for the additional details. This is very helpful. I > >>> talked to Per a bit about this, and we feel like understanding what > >>> increased the metaspace allocation rate from 1.5 KB/s to 40 MB/s, is > >>> rather likely to lead us to the problem. That?s more than 3 orders > >>> of magnitude higher allocation rate of metaspace memory, which might > >>> explain why class unloading is taking much longer than usual. > >>> > >>> The culprit that is being repeatedly allocated, is rather likely > >>> to also be repeatedly unloaded, or you would hit an out of memory > >>> issue. And the log snippet you showed, shows that there is a bunch > >>> of LambdaForms being unloaded. Interestingly enough, the > >>> LambdaFormEditor has a SoftReference based cache, so that you don?t > >>> have to create a new LambdaForm, when there is one in the cache. > >>> This soft reference based policy might not work very well if all > >>> SoftReferences are cleared by the GC, because then it would seem > >>> like a lot more LambdaForms would have to be created from scratch, > >>> as opposed to being reused from the cache. > >>> > >>> This might explain why it takes a while before this starts > >>> happening. The interesting thing is figuring out what could have led > >>> to all the soft references being cleared. This could be because of > >>> metaspace exhaustion, or due to allocation stalls. In the charts we > >>> can see that the allocation rate was at the highest point, right > >>> before the incident. So without seeing the log, it seems like you > >>> could have gotten an allocation stall, which triggered a full > >>> collection clearing all soft references, which suddenly leads to > >>> class unloading taking much longer. If class unloading takes much > >>> longer, then you are quite likely to reach the same situation > >>> repeatedly, in a bad cycle. > >>> > >>> Is there any way you could share an -Xlog:gc* log with us? If > >>> not, finding out if there was an allocation stall or a ?Metadata GC > >>> Clear Soft References? GC just before everything went bad, might > >>> help with figuring out what went wrong. Something that did happen > >>> between JDK 15 and JDK 17 is dynamic thread selection for the GC. I > >>> wonder if that has somehow led to a stall, which you wouldn?t see in > >>> JDK 15 with the more static thread selection. If so, then maybe > >>> setting a SoftMaxHeapSize lower than the MaxHeapSize might help > >>> avoiding stall in the first place, due to fluctuations in the > >>> workload, or alternatively turning off UseDynamicNumberOfGCThreads > >>> and setting ConcGCThreads high enough for allocation stalls not happen. > >>> > >>> Anyway, figuring out what kind of GC triggered right before the > >>> incident would help a lot I think, to figure out if forced soft ref > >>> clearing is indeed what instigated this, or if this is all just a > >>> red herring. > >>> > >>> Thanks, > >>> /Erik > >>> > >>> From: Anatoly Deyneka > >>> > >>> > > >>> Sent: Tuesday, 9 November 2021 14:13 > >>> To: Erik Osterlund > >>> > >>> > > >>> Cc:zgc-dev at openjdk.java.net > >>> > >>> Subject: [External] : Re: ZGC unstable behaviour on java 17 > >>> > >>> Hi Erik, > >>> > >>> Probably we haven't reached this condition on java 15. > >>> We see a big difference in metaspace allocation starting from > >>> java 16. > >>> The behaviour is not changed for any option of > >>> -XX:MetaspaceReclaimPolicy > >>> > >>> Here is all our knowledge about the problem: > >>> > https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md > >>> < > https://urldefense.com/v3/__https://github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!aVQ1g1KY8-TN5Yjuvx8qsBOlt--iQzL4LrUyJBf4QKIkyntGlvpoq0XLuC1BmOvvay_E$ > >< > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$> > > >>> < > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!a6mYXhBdfLivRQIJkyn10Z2tlh6iDn_TfHQpHBldzNBIJexWK-nR7FeTgxiy18HZddw$ > >< > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$> > > >>> < > https://urldefense.com/v3/__https:/github.com/adeyneka/ZGC-java-16-17-unstable-behaviour/blob/main/README.md__;!!ACWV5N9M2RV99hQ!cjW1jni1cEfXVuMQMjBMHiYL-HeLf0WosxB4QltNGKaNiAf1NynZSboJeEbNiPRrUpo$> > > >>> > >>> > >>> Unfortunately we can reproduce this issue only on our production > >>> platform. > >>> If we can provide more details or logs please let me know. > >>> > >>> Regards, > >>> Anatoly Deyneka > >>> > >>> On Mon, 8 Nov 2021 at 16:46, Erik Osterlund > >>> > >>> >>> > >>> >> wrote: > >>> Hi Anatoly, > >>> > >>> Thanks for reporting this. This looks rather odd indeed. > >>> Unfortunately, the attached images have been stripped from your > >>> original email. > >>> > >>> It?s particularly interesting that you see this behaviour when > >>> migrating from JDK 15 to 17. There have been a few bug fixes, but > >>> nothing I can easily imagine causing this. I have been surprised > >>> before though. One notable thing in the class unloading path that > >>> has changed though, is the elastic metaspace. I see that you use the > >>> -XX:MetaspaceReclaimPolicy=aggressive option, which did not exist in > >>> JDK15. This makes me wonder if this reproduces also without that > >>> option, i.e. the way that you run the workload is more similar. > >>> > >>> Anyway, is there any way for me to get access to try to > >>> reproduce your issue? > >>> > >>> Thanks, > >>> /Erik > >>> > >>>> On 4 Nov 2021, at 17:00, Anatoly Deyneka > >>>> > >>>> >>>> > >>>> >> wrote: > >>>> > >>>> ?We are running a GC intensive application with allocation rate > >>>> up to > >>>> 1.5GB/s(4k requests/sec) on 32v cores instance (-Xms20G -Xmx20G). > >>>> openjdk 17 2021-09-14 > >>>> OpenJDK Runtime Environment (build 17+35-2724) > >>>> OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) > >>>> > >>>> During migration from java 15 to java 17 we see unstable > >>>> behavior when zgc > >>>> goes to infinite loop and doesn't return back to the normal state. > >>>> We see the following behavior: > >>>> - it runs perfectly(~30% better then on java 15) - 5-7GB used > >>>> memory with > >>>> 16% CPU utilization > >>>> - something happens and CPU utilization jumps to 40%, used > >>>> memory jumps to > >>>> 19GB, zgc cycles time is increased in 10times > >>>> - sometimes app works about 30min before the crash sometimes > >>>> it's crashed > >>>> instantly > >>>> - if it's still alive memory is not released and the app works > >>>> with used > >>>> memory 18-20GB > >>>> - we see allocation stalls > >>>> - we see the following GC stats > >>>> [17606.140s][info][gc,phases ] GC(719) Concurrent Process > >>>> Non-Strong > >>>> References 25781.928ms > >>>> [17610.181s][info][gc,stats ] Subphase: Concurrent Classes > >>>> Unlink > >>>> 14280.772 / 25769.511 1126.563 / 25769.511 217.882 / > >>>> 68385.750 217.882 > >>>> / 68385.750 ms > >>>> - we see JVM starts massively unload classes > >>>> (jvm_classes_unloaded_total > >>>> counter keep increasing) . > >>>> - we?ve managed to profile JVM in such ?unstable? state and see > >>>> ZTask > >>>> consuming a lot of cpu doing > >>>> CompiledMethod::unload_nmethod_caches(bool) > >>>> job. See attached image. > >>>> > >>>> We played with options: > >>>> -XX:ZCollectionInterval=5 > >>>> -XX:SoftMaxHeapSize=12G > >>>> -XX:ZAllocationSpikeTolerance=4 > >>>> -XX:MetaspaceReclaimPolicy=aggressive > >>>> > >>>> But it doesn't help. it just postpones this event. > >>>> Only if we redirect traffic to another node it goes to normal. > >>>> > >>>> Regards, > >>>> Anatoly Deyneka > >>>> > >>>> [image: image.png] > >>>> [image: image.png] > >>>> [image: image.png] > >> > > > >