jcmd VM.native_memory extremely large numbers when using ZGC

Tue Oct 29 15:28:52 UTC 2024

Hi Marcel,

the GC log shows a contiguous log spanning times from 10:59 to 11:29. This
does not correspond to your graphs, where the yellow lines indicate that
the pod had been killed at about 11:10. You sure this is the right GC log?

The spikes look strange, and I don't see anything in the gc log that
explains them.

/Thomas

On Tue, Oct 29, 2024 at 3:18 PM Marçal Perapoch Amadó <
marcal.perapoch at gmail.com> wrote:

> Hello again,
>
> Thanks a lot for having a look!
>
> The logs I shared earlier were from a testing environment. I initially
> thought we could replicate the issue there, and I wanted to provide more
> insights from our experiments quickly, so I ran the test in that
> environment. However, in hindsight, this may not have been the best
> approach.
>
> Today, we've repeated the experiment with a pod from our live environment.
> I've attached an image that shows four Kubernetes metrics, which I believe
> highlight differences between the pod running ZGC and the one running G1.
>
> As Florian mentioned, the issue might stem from how Kubernetes or the
> container host interprets these metrics, so I’m not sure if anything can be
> done from the application side to address this. I just wanted to share
> this, in case these additional insights ring a bell and help identify any
> potential issues.
>
> Description of the metrics shown in the attached image:
>
> * CLOUD_GKE: Memory Working Set (bytes): corresponds to the k8s
> `container_memory_working_set_bytes` which represents the amount of memory
> that the container is actively using and cannot be evicted. This is what
> the OOM killer is watching for.
> * CLOUD_GKE: Resident Set Size (bytes): corresponds to the k8s
> `container_memory_rss` which is the size of RSS in bytes.
> * CLOUD_GKE: Page cache memory (bytes): corresponds to the k8s
> `container_memory_cache` - number of bytes of page cache memory
> * CLOUD_GKE: Active page cache (bytes): corresponds to the k8s Active page
> cache computed as
> `container_memory_working_set_bytes - container_memory_rss`. It contains
> memory pages that are frequently accessed and currently in use by processes.
>
> The yellow line is our canary pod using the following jvm args:
> ```
> -XX:+UseZGC
> -XX:+ZGenerational
> -XX:InitialRAMPercentage=50.0
> -XX:MaxRAMPercentage=50.0
> -XX:NativeMemoryTracking=summary
> -XX:+HeapDumpOnOutOfMemoryError
> ```
>
> The green line corresponds to a regular pod using G1 and the same heap
> size.
>
> Both share the same specs, 12GB ram, 4 CPU, and `OpenJDK 64-Bit Server VM
> Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)`
>
> As shown in the attached image, the main difference between the pod
> running with G1 and the pod using ZGC (canary) is that the one with ZGC
> starts with an active page cache of 6GB. This seems to correspond to the
> initial/max heap size of the JVM. As a result, the ZGC pod has a much
> higher baseline for its **Memory Working Set** right from the start.
>
> Over time, as the application continues to run, this higher baseline
> causes Kubernetes to eventually kill and restart the pod due to Out Of
> Memory errors. This occurred twice because the pod exceeded the 12GB memory
> limit.
> I have also attached the gc log and NMT summary for this run.
>
> Cheers,
>
>
> Missatge de Florian Weimer <fweimer at redhat.com> del dia dl., 28 d’oct.
> 2024 a les 16:58:
>
>> * Marçal Perapoch Amadó:
>>
>> >> As in, Java OOMEs? OOM killer? Or the pod being killed from the pod
>> management?
>>
>> > Our canary pods using ZGC were OOM killed, yes. It's also visible in
>> > our metrics how the "container_memory_working_set_bytes" of the pods
>> > using zgc went above 20GB even though they were set to use a max heap
>> > of 6GB.
>>
>> I think some container hosts kill processes based on RSS alone, so even
>> memory-mapped I/O can trigger this.  From the hosts perspective, it
>> doesn't matter if the memory is just used for caching and could be
>> discarded any time because it's a read-only MAP_SHARED mapping from a
>> file.
>>
>> Thanks,
>> Florian
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20241029/3ed4b028/attachment.htm>