zgc-dev

Download

zgc-dev@openjdk.org

March 2025

1 participants
1 discussions

Re: jcmd VM.native_memory extremely large numbers when using ZGC
by Thomas Stüfe 14 Mar '25

14 Mar '25

Hi Marcel, the GC log shows a contiguous log spanning times from 10:59 to 11:29. This does not correspond to your graphs, where the yellow lines indicate that the pod had been killed at about 11:10. You sure this is the right GC log? The spikes look strange, and I don't see anything in the gc log that explains them. /Thomas On Tue, Oct 29, 2024 at 3:18 PM Marçal Perapoch Amadó < marcal.perapoch(a)gmail.com> wrote: > Hello again, > > Thanks a lot for having a look! > > The logs I shared earlier were from a testing environment. I initially > thought we could replicate the issue there, and I wanted to provide more > insights from our experiments quickly, so I ran the test in that > environment. However, in hindsight, this may not have been the best > approach. > > Today, we've repeated the experiment with a pod from our live environment. > I've attached an image that shows four Kubernetes metrics, which I believe > highlight differences between the pod running ZGC and the one running G1. > > As Florian mentioned, the issue might stem from how Kubernetes or the > container host interprets these metrics, so I’m not sure if anything can be > done from the application side to address this. I just wanted to share > this, in case these additional insights ring a bell and help identify any > potential issues. > > Description of the metrics shown in the attached image: > > * CLOUD_GKE: Memory Working Set (bytes): corresponds to the k8s > `container_memory_working_set_bytes` which represents the amount of memory > that the container is actively using and cannot be evicted. This is what > the OOM killer is watching for. > * CLOUD_GKE: Resident Set Size (bytes): corresponds to the k8s > `container_memory_rss` which is the size of RSS in bytes. > * CLOUD_GKE: Page cache memory (bytes): corresponds to the k8s > `container_memory_cache` - number of bytes of page cache memory > * CLOUD_GKE: Active page cache (bytes): corresponds to the k8s Active page > cache computed as > `container_memory_working_set_bytes - container_memory_rss`. It contains > memory pages that are frequently accessed and currently in use by processes. > > The yellow line is our canary pod using the following jvm args: > ``` > -XX:+UseZGC > -XX:+ZGenerational > -XX:InitialRAMPercentage=50.0 > -XX:MaxRAMPercentage=50.0 > -XX:NativeMemoryTracking=summary > -XX:+HeapDumpOnOutOfMemoryError > ``` > > The green line corresponds to a regular pod using G1 and the same heap > size. > > Both share the same specs, 12GB ram, 4 CPU, and `OpenJDK 64-Bit Server VM > Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)` > > As shown in the attached image, the main difference between the pod > running with G1 and the pod using ZGC (canary) is that the one with ZGC > starts with an active page cache of 6GB. This seems to correspond to the > initial/max heap size of the JVM. As a result, the ZGC pod has a much > higher baseline for its **Memory Working Set** right from the start. > > Over time, as the application continues to run, this higher baseline > causes Kubernetes to eventually kill and restart the pod due to Out Of > Memory errors. This occurred twice because the pod exceeded the 12GB memory > limit. > I have also attached the gc log and NMT summary for this run. > > Cheers, > > > Missatge de Florian Weimer <fweimer(a)redhat.com> del dia dl., 28 d’oct. > 2024 a les 16:58: > >> * Marçal Perapoch Amadó: >> >> >> As in, Java OOMEs? OOM killer? Or the pod being killed from the pod >> management? >> >> > Our canary pods using ZGC were OOM killed, yes. It's also visible in >> > our metrics how the "container_memory_working_set_bytes" of the pods >> > using zgc went above 20GB even though they were set to use a max heap >> > of 6GB. >> >> I think some container hosts kill processes based on RSS alone, so even >> memory-mapped I/O can trigger this. From the hosts perspective, it >> doesn't matter if the memory is just used for caching and could be >> discarded any time because it's a read-only MAP_SHARED mapping from a >> file. >> >> Thanks, >> Florian >> >>

2 2