jcmd VM.native_memory extremely large numbers when using ZGC

Wed Oct 30 15:00:39 UTC 2024

Hello everyone,

I wanted to provide an update on my recent findings.

First, I’m attaching an updated screenshot of the graphs I previously
shared, titled "experiment_results_corrected.png." In this updated view:

   - The *green line* represents one of our regular pods running G1.
   - The *yellow line* represents our canary pod using ZGC.

One notable difference is in the *Page Cache memory*, which starts at
approximately the initial/max heap size for ZGC and significantly
influences the *Memory Working Set* value. Over time, as the application
continued to run, this higher baseline led Kubernetes to kill and restart
the container due to OOM, exceeding the 12GB memory limit twice. The first
occurrence happened around 11:00. You’ll see a large spike there, which is
simply the aggregation of the previous container (killed by OOM) and the
newly restarted container, as Kubernetes immediately restarts the pod after
killing it.

That’s essentially what I reported last time, but I was still curious about
the source of this high *Active Page Cache*.

After further investigation, I noticed another metric showing a significant
difference between the canary and regular pods:
*container_memory_mapped_file*, which reflects the size of memory-mapped
files. This can be seen in the attached "memory_mapped_files.png," where
the pod with ZGC shows a notably higher mapped file size than the one with
G1.

While researching, I came across two relevant quotes that might shed light
on this behavior:

>  On Linux the _physical_ memory is basically an anonymous file that is
only stored in RAM (and not on disk), ZGC uses [memfd_create](
https://www.systutorials.com/docs/linux/man/2-memfd_create/) to create it.
The file can then be extended with [ftruncate](
https://www.systutorials.com/docs/linux/man/2-ftruncate/), ZGC is allowed
to extend the physical memory (= the anonymous file) up to the maximum heap
size. Physical memory is then [mmap](
https://www.systutorials.com/docs/linux/man/2-mmap/)ed into the virtual
address space.
source: https://dinfuehr.github.io/blog/a-first-look-into-zgc/

> As part of the Java memory footprint, we have `MappedByteBuffers` - the
files mapped to virtual memory of a process. **NMT does not track them**,
however, MappedByteBuffers can also take physical memory.
source:
https://stackoverflow.com/questions/53451103/java-using-much-more-memory-than-heap-size-or-size-correctly-docker-memory-limi

So I took a couple of pmap snapshots, one from the canary and one from a
regular pod, and guess what - the canary one is full of:
```
Address           Kbytes     RSS   Dirty Mode  Mapping
0000040000000000    6144    6100    6100 rw-s- memfd:java_heap (deleted)
0000040000600000    2048       0       0 -----   [ anon ]
0000040000800000    2048    2048    2048 rw-s- memfd:java_heap (deleted)
0000040000a00000    4096       0       0 -----   [ anon ]
0000040000e00000   71680   69212   69212 rw-s- memfd:java_heap (deleted)
0000040005400000    2048       0       0 -----   [ anon ]
0000040005600000    2048    2048    2048 rw-s- memfd:java_heap (deleted)
0000040005800000    2048       0       0 -----   [ anon ]
0000040005a00000   14336   14336   14336 rw-s- memfd:java_heap (deleted)
```
While the non-canary one has no entry like that one.

Summing the RSS values for entries with names starting with memfd:java_heap
matches the values we see in the *container_memory_mapped_file* metric.
This metric, in turn, drives the active page cache, which is the main
difference we observe between ZGC and G1.

In short, it seems that Kubernetes includes these memory-mapped files in
the *memory working set* calculation, impacting the decision to kill the
container when it exceeds limits.

I hope this summary makes sense and may be helpful to others. Any feedback
or thoughts would be very welcome, though I understand if this is outside
the main topic.

Thanks a lot,

As a reminder, here's a summary of the metrics shown in the graph (which
btw I think my previous email is not visible in the mailing list):
* CLOUD_GKE: Memory Working Set (bytes): corresponds to the k8s
`container_memory_working_set_bytes` which represents the amount of memory
that the container is actively using and cannot be evicted.
    --> This is what the OOM killer is watching for.
* CLOUD_GKE: Resident Set Size (bytes): corresponds to the k8s
`container_memory_rss` which is the size of RSS in bytes. Represents the
amount of physical memory that a process has in the main memory that
doesn’t correspond to anything on disk.  Typically includes stacks, heaps,
and anonymous memory maps.
* CLOUD_GKE: Page cache memory (bytes): corresponds to the k8s
`container_memory_cache` - number of bytes of page cache memory
* CLOUD_GKE: Active page cache (bytes): corresponds to the k8s Active page
cache computed as
`container_memory_working_set_bytes - container_memory_rss`. It contains
memory pages that are frequently accessed and currently in use by processes.

Marçal

Missatge de Thomas Stüfe <thomas.stuefe at gmail.com> del dia dt., 29 d’oct.
2024 a les 16:29:

> Hi Marcel,
>
> the GC log shows a contiguous log spanning times from 10:59 to 11:29. This
> does not correspond to your graphs, where the yellow lines indicate that
> the pod had been killed at about 11:10. You sure this is the right GC log?
>
> The spikes look strange, and I don't see anything in the gc log that
> explains them.
>
> /Thomas
>
> On Tue, Oct 29, 2024 at 3:18 PM Marçal Perapoch Amadó <
> marcal.perapoch at gmail.com> wrote:
>
>> Hello again,
>>
>> Thanks a lot for having a look!
>>
>> The logs I shared earlier were from a testing environment. I initially
>> thought we could replicate the issue there, and I wanted to provide more
>> insights from our experiments quickly, so I ran the test in that
>> environment. However, in hindsight, this may not have been the best
>> approach.
>>
>> Today, we've repeated the experiment with a pod from our live
>> environment. I've attached an image that shows four Kubernetes metrics,
>> which I believe highlight differences between the pod running ZGC and the
>> one running G1.
>>
>> As Florian mentioned, the issue might stem from how Kubernetes or the
>> container host interprets these metrics, so I’m not sure if anything can be
>> done from the application side to address this. I just wanted to share
>> this, in case these additional insights ring a bell and help identify any
>> potential issues.
>>
>> Description of the metrics shown in the attached image:
>>
>> * CLOUD_GKE: Memory Working Set (bytes): corresponds to the k8s
>> `container_memory_working_set_bytes` which represents the amount of memory
>> that the container is actively using and cannot be evicted. This is what
>> the OOM killer is watching for.
>> * CLOUD_GKE: Resident Set Size (bytes): corresponds to the k8s
>> `container_memory_rss` which is the size of RSS in bytes.
>> * CLOUD_GKE: Page cache memory (bytes): corresponds to the k8s
>> `container_memory_cache` - number of bytes of page cache memory
>> * CLOUD_GKE: Active page cache (bytes): corresponds to the k8s Active
>> page cache computed as
>> `container_memory_working_set_bytes - container_memory_rss`. It contains
>> memory pages that are frequently accessed and currently in use by processes.
>>
>> The yellow line is our canary pod using the following jvm args:
>> ```
>> -XX:+UseZGC
>> -XX:+ZGenerational
>> -XX:InitialRAMPercentage=50.0
>> -XX:MaxRAMPercentage=50.0
>> -XX:NativeMemoryTracking=summary
>> -XX:+HeapDumpOnOutOfMemoryError
>> ```
>>
>> The green line corresponds to a regular pod using G1 and the same heap
>> size.
>>
>> Both share the same specs, 12GB ram, 4 CPU, and `OpenJDK 64-Bit Server VM
>> Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)`
>>
>> As shown in the attached image, the main difference between the pod
>> running with G1 and the pod using ZGC (canary) is that the one with ZGC
>> starts with an active page cache of 6GB. This seems to correspond to the
>> initial/max heap size of the JVM. As a result, the ZGC pod has a much
>> higher baseline for its **Memory Working Set** right from the start.
>>
>> Over time, as the application continues to run, this higher baseline
>> causes Kubernetes to eventually kill and restart the pod due to Out Of
>> Memory errors. This occurred twice because the pod exceeded the 12GB memory
>> limit.
>> I have also attached the gc log and NMT summary for this run.
>>
>> Cheers,
>>
>>
>> Missatge de Florian Weimer <fweimer at redhat.com> del dia dl., 28 d’oct.
>> 2024 a les 16:58:
>>
>>> * Marçal Perapoch Amadó:
>>>
>>> >> As in, Java OOMEs? OOM killer? Or the pod being killed from the pod
>>> management?
>>>
>>> > Our canary pods using ZGC were OOM killed, yes. It's also visible in
>>> > our metrics how the "container_memory_working_set_bytes" of the pods
>>> > using zgc went above 20GB even though they were set to use a max heap
>>> > of 6GB.
>>>
>>> I think some container hosts kill processes based on RSS alone, so even
>>> memory-mapped I/O can trigger this.  From the hosts perspective, it
>>> doesn't matter if the memory is just used for caching and could be
>>> discarded any time because it's a read-only MAP_SHARED mapping from a
>>> file.
>>>
>>> Thanks,
>>> Florian
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20241030/05c0f59b/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory_mapped_files.png
Type: image/png
Size: 82555 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20241030/05c0f59b/memory_mapped_files-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: experiment_results_corrected.png
Type: image/png
Size: 333244 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20241030/05c0f59b/experiment_results_corrected-0001.png>