jcmd VM.native_memory extremely large numbers when using ZGC
Marçal Perapoch Amadó
marcal.perapoch at gmail.com
Fri Mar 14 10:15:21 UTC 2025
Hey everyone!
It's been a while since the last update but I'm happy to share that we
finally manage to run ZGC in our production clusters!
Here's our story in case it can help someone else.
In our company we run our servers using Java OpenJDK Temurin-21.0.6+7 in
Google Kubernetes Engine.
During our first experiments, we ran canary pods using ZGC in some of our
services, as opposed to the non-canary ones that were using G1GC.
Concretely, we used the following arguments:
```
-XX:+UseZGC
-XX:+ZGenerational
-XX:+DisableExplicitGC
-XX:InitialRAMPercentage=50.0
-XX:MaxRAMPercentage=50.0
-XX:+AlwaysPreTouch
-XX:+UseLargePages
```
Even though we set the `-XX:+UseLargePage` JVM flag, our Kubernetes nodes
were not set to support Huge Pages. Therefore, ZGC was defaulting to
`tempfs` as Heap Backing Filesystem as one could observe in the following
gc log lines:
```
[info ][gc,init] Heap Backing File: /memfd:java_heap
[info ][gc,init] Heap Backing Filesystem: tmpfs (0x1021994)
```
While pause time and cpu usage were looking very promising, to our surprise
we saw how **our containers were killed** after an hour by the Out Of
Memory (**OOM**) killer.
Tracking down the reason it was killed, we saw how
`container_memory_working_set_bytes` (used by the OOM killer) was reaching
the configured limit (12Gi) for our canary pods compared to our non-canary
ones. Moreover, studying it a bit more we saw how two metrics were
influencing that one: `container_memory_cache` and
`container_memory_mapped_file`.
With our current canaries setup, ZGC was [creating an anonymous mapped
file](
https://github.com/openjdk/zgc/blob/master/src/hotspot/os/linux/gc/z/zPhysicalMemoryBacking_linux.cpp#L208)
on tmpfs, which meant reserving a large contiguous block of memory and
committing it immediately, meaning the OS marked all those pages as in use
right from the start—even if the application wasn't utilising all of that
space. As a result, our application started with a high baseline memory
consumption, and even normal memory growth quickly pushed it to the limit.
After understanding all this, we decided to enable **huge/large page**
support in our [Kubernetes node pool](
https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/),
and configure our canaries to request them following the [OpenJDK
guidelines](
https://wiki.openjdk.org/pages/viewpage.action?pageId=35455001#Main-EnablingLargePages
).
We configured our nodes to utilize huge pages with a page size of 2048KiB
(2MiB), a significant increase over the standard 4KiB pages. Using the same
JVM arguments as before, our GC logs now display the following entries:
```
[info ][gc,init] Heap Backing File: /memfd:java_heap.hugetlb
[info ][gc,init] Heap Backing Filesystem: hugetlbfs (0x958458f6)
```
So huge pages were finally being used!
Marçal
As a result, `container_memory_cache` and `container_memory_mapped_file`
metrics dropped significantly and so did the
`container_memory_working_set_bytes`.
This prevented OOM killer from terminating our canaries, allowing them to
enjoy the **under 1ms GC pause times** while also experiencing a notable
**reduction in both JVM hiccups (~99%) and CPU usage (~40%)**.
Hope that can be useful to someone!
Best,
Missatge de Marçal Perapoch Amadó <marcal.perapoch at gmail.com> del dia dc.,
30 d’oct. 2024 a les 16:00:
> Hello everyone,
>
> I wanted to provide an update on my recent findings.
>
> First, I’m attaching an updated screenshot of the graphs I previously
> shared, titled "experiment_results_corrected.png." In this updated view:
>
> - The *green line* represents one of our regular pods running G1.
> - The *yellow line* represents our canary pod using ZGC.
>
> One notable difference is in the *Page Cache memory*, which starts at
> approximately the initial/max heap size for ZGC and significantly
> influences the *Memory Working Set* value. Over time, as the application
> continued to run, this higher baseline led Kubernetes to kill and restart
> the container due to OOM, exceeding the 12GB memory limit twice. The first
> occurrence happened around 11:00. You’ll see a large spike there, which is
> simply the aggregation of the previous container (killed by OOM) and the
> newly restarted container, as Kubernetes immediately restarts the pod after
> killing it.
>
> That’s essentially what I reported last time, but I was still curious
> about the source of this high *Active Page Cache*.
>
> After further investigation, I noticed another metric showing a
> significant difference between the canary and regular pods:
> *container_memory_mapped_file*, which reflects the size of memory-mapped
> files. This can be seen in the attached "memory_mapped_files.png," where
> the pod with ZGC shows a notably higher mapped file size than the one with
> G1.
>
> While researching, I came across two relevant quotes that might shed light
> on this behavior:
>
> > On Linux the _physical_ memory is basically an anonymous file that is
> only stored in RAM (and not on disk), ZGC uses [memfd_create](
> https://www.systutorials.com/docs/linux/man/2-memfd_create/) to create
> it. The file can then be extended with [ftruncate](
> https://www.systutorials.com/docs/linux/man/2-ftruncate/), ZGC is allowed
> to extend the physical memory (= the anonymous file) up to the maximum heap
> size. Physical memory is then [mmap](
> https://www.systutorials.com/docs/linux/man/2-mmap/)ed into the virtual
> address space.
> source: https://dinfuehr.github.io/blog/a-first-look-into-zgc/
>
> > As part of the Java memory footprint, we have `MappedByteBuffers` - the
> files mapped to virtual memory of a process. **NMT does not track them**,
> however, MappedByteBuffers can also take physical memory.
> source:
> https://stackoverflow.com/questions/53451103/java-using-much-more-memory-than-heap-size-or-size-correctly-docker-memory-limi
>
> So I took a couple of pmap snapshots, one from the canary and one from a
> regular pod, and guess what - the canary one is full of:
> ```
> Address Kbytes RSS Dirty Mode Mapping
> 0000040000000000 6144 6100 6100 rw-s- memfd:java_heap (deleted)
> 0000040000600000 2048 0 0 ----- [ anon ]
> 0000040000800000 2048 2048 2048 rw-s- memfd:java_heap (deleted)
> 0000040000a00000 4096 0 0 ----- [ anon ]
> 0000040000e00000 71680 69212 69212 rw-s- memfd:java_heap (deleted)
> 0000040005400000 2048 0 0 ----- [ anon ]
> 0000040005600000 2048 2048 2048 rw-s- memfd:java_heap (deleted)
> 0000040005800000 2048 0 0 ----- [ anon ]
> 0000040005a00000 14336 14336 14336 rw-s- memfd:java_heap (deleted)
> ```
> While the non-canary one has no entry like that one.
>
> Summing the RSS values for entries with names starting with
> memfd:java_heap matches the values we see in the
> *container_memory_mapped_file* metric. This metric, in turn, drives the
> active page cache, which is the main difference we observe between ZGC and
> G1.
>
> In short, it seems that Kubernetes includes these memory-mapped files in
> the *memory working set* calculation, impacting the decision to kill the
> container when it exceeds limits.
>
> I hope this summary makes sense and may be helpful to others. Any feedback
> or thoughts would be very welcome, though I understand if this is outside
> the main topic.
>
> Thanks a lot,
>
>
> As a reminder, here's a summary of the metrics shown in the graph (which
> btw I think my previous email is not visible in the mailing list):
> * CLOUD_GKE: Memory Working Set (bytes): corresponds to the k8s
> `container_memory_working_set_bytes` which represents the amount of memory
> that the container is actively using and cannot be evicted.
> --> This is what the OOM killer is watching for.
> * CLOUD_GKE: Resident Set Size (bytes): corresponds to the k8s
> `container_memory_rss` which is the size of RSS in bytes. Represents the
> amount of physical memory that a process has in the main memory that
> doesn’t correspond to anything on disk. Typically includes stacks, heaps,
> and anonymous memory maps.
> * CLOUD_GKE: Page cache memory (bytes): corresponds to the k8s
> `container_memory_cache` - number of bytes of page cache memory
> * CLOUD_GKE: Active page cache (bytes): corresponds to the k8s Active page
> cache computed as
> `container_memory_working_set_bytes - container_memory_rss`. It contains
> memory pages that are frequently accessed and currently in use by processes.
>
>
> Marçal
>
>
> Missatge de Thomas Stüfe <thomas.stuefe at gmail.com> del dia dt., 29 d’oct.
> 2024 a les 16:29:
>
>> Hi Marcel,
>>
>> the GC log shows a contiguous log spanning times from 10:59 to 11:29.
>> This does not correspond to your graphs, where the yellow lines indicate
>> that the pod had been killed at about 11:10. You sure this is the right GC
>> log?
>>
>> The spikes look strange, and I don't see anything in the gc log that
>> explains them.
>>
>> /Thomas
>>
>> On Tue, Oct 29, 2024 at 3:18 PM Marçal Perapoch Amadó <
>> marcal.perapoch at gmail.com> wrote:
>>
>>> Hello again,
>>>
>>> Thanks a lot for having a look!
>>>
>>> The logs I shared earlier were from a testing environment. I initially
>>> thought we could replicate the issue there, and I wanted to provide more
>>> insights from our experiments quickly, so I ran the test in that
>>> environment. However, in hindsight, this may not have been the best
>>> approach.
>>>
>>> Today, we've repeated the experiment with a pod from our live
>>> environment. I've attached an image that shows four Kubernetes metrics,
>>> which I believe highlight differences between the pod running ZGC and the
>>> one running G1.
>>>
>>> As Florian mentioned, the issue might stem from how Kubernetes or the
>>> container host interprets these metrics, so I’m not sure if anything can be
>>> done from the application side to address this. I just wanted to share
>>> this, in case these additional insights ring a bell and help identify any
>>> potential issues.
>>>
>>> Description of the metrics shown in the attached image:
>>>
>>> * CLOUD_GKE: Memory Working Set (bytes): corresponds to the k8s
>>> `container_memory_working_set_bytes` which represents the amount of memory
>>> that the container is actively using and cannot be evicted. This is what
>>> the OOM killer is watching for.
>>> * CLOUD_GKE: Resident Set Size (bytes): corresponds to the k8s
>>> `container_memory_rss` which is the size of RSS in bytes.
>>> * CLOUD_GKE: Page cache memory (bytes): corresponds to the k8s
>>> `container_memory_cache` - number of bytes of page cache memory
>>> * CLOUD_GKE: Active page cache (bytes): corresponds to the k8s Active
>>> page cache computed as
>>> `container_memory_working_set_bytes - container_memory_rss`. It contains
>>> memory pages that are frequently accessed and currently in use by processes.
>>>
>>> The yellow line is our canary pod using the following jvm args:
>>> ```
>>> -XX:+UseZGC
>>> -XX:+ZGenerational
>>> -XX:InitialRAMPercentage=50.0
>>> -XX:MaxRAMPercentage=50.0
>>> -XX:NativeMemoryTracking=summary
>>> -XX:+HeapDumpOnOutOfMemoryError
>>> ```
>>>
>>> The green line corresponds to a regular pod using G1 and the same heap
>>> size.
>>>
>>> Both share the same specs, 12GB ram, 4 CPU, and `OpenJDK 64-Bit Server
>>> VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)`
>>>
>>> As shown in the attached image, the main difference between the pod
>>> running with G1 and the pod using ZGC (canary) is that the one with ZGC
>>> starts with an active page cache of 6GB. This seems to correspond to the
>>> initial/max heap size of the JVM. As a result, the ZGC pod has a much
>>> higher baseline for its **Memory Working Set** right from the start.
>>>
>>> Over time, as the application continues to run, this higher baseline
>>> causes Kubernetes to eventually kill and restart the pod due to Out Of
>>> Memory errors. This occurred twice because the pod exceeded the 12GB memory
>>> limit.
>>> I have also attached the gc log and NMT summary for this run.
>>>
>>> Cheers,
>>>
>>>
>>> Missatge de Florian Weimer <fweimer at redhat.com> del dia dl., 28 d’oct.
>>> 2024 a les 16:58:
>>>
>>>> * Marçal Perapoch Amadó:
>>>>
>>>> >> As in, Java OOMEs? OOM killer? Or the pod being killed from the pod
>>>> management?
>>>>
>>>> > Our canary pods using ZGC were OOM killed, yes. It's also visible in
>>>> > our metrics how the "container_memory_working_set_bytes" of the pods
>>>> > using zgc went above 20GB even though they were set to use a max heap
>>>> > of 6GB.
>>>>
>>>> I think some container hosts kill processes based on RSS alone, so even
>>>> memory-mapped I/O can trigger this. From the hosts perspective, it
>>>> doesn't matter if the memory is just used for caching and could be
>>>> discarded any time because it's a read-only MAP_SHARED mapping from a
>>>> file.
>>>>
>>>> Thanks,
>>>> Florian
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20250314/e3ac2bb3/attachment-0001.htm>
More information about the zgc-dev
mailing list