jcmd VM.native_memory extremely large numbers when using ZGC

Mon Oct 28 10:11:21 UTC 2024

Hi Marcel,

Too little information to say anything - would need NMT report, possible
jcmd System.map, and possibly the GC log. I am also not aware of any sizing
recommendations when switching from G1 to ZGC, but they probably exist and
the ZGC devs that normally frequent this ML know this stuff better than I
do.

Cheers, Thomas

On Mon, Oct 28, 2024 at 10:58 AM Marçal Perapoch Amadó <
marcal.perapoch at gmail.com> wrote:

> Hey Thomas,
>
> Thanks a lot for your answer and the information you provided. I think you
> are right about generational not using multi-mapping (
> https://openjdk.org/jeps/439 - "No multi-mapped memory") also I didn't
> know about the max heap size * 16, which does seems to match the numbers
> I was seeing in my computer. Good info, thanks again!
>
> > As in, Java OOMEs? OOM killer? Or the pod being killed from the pod
> management?
> Our canary pods using ZGC were OOM killed, yes. It's also visible in our
> metrics how the "container_memory_working_set_bytes" of the pods using zgc
> went above 20GB even though they were set to use a max heap of 6GB.
>
> Also, I forgot to mention (in case it helps) we are running:
> openjdk 21.0.4 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode,
> sharing)
>
> Best,
> Marçal
>
>
> Missatge de Thomas Stüfe <thomas.stuefe at gmail.com> del dia dl., 28 d’oct.
> 2024 a les 10:25:
>
>> Hi Marcal,
>>
>> likely a red herring - "reserved" should not matter unless you
>> artificially limit the address space size of the process (e.g. with ulimit
>> -v). And even then, ZGC should just work around this limit. Reserved is
>> just address space, and modern 64-bit OSes don't penalize you for
>> allocating large swathes of address space. It should not cost any real
>> memory.
>>
>> About the large number: AFAIK ZGC in generational mode does not do
>> multi-mapping anymore. Both Generational and Single Gen, however, do
>> over-allocate address space (max heap size * 16) - that number may be
>> smaller if capped by whatever is physically possible on the machine. It
>> does that because it rolls its own variant of physical-to-virtual memory
>> mapping, and needs room to maneuver. This is done to fight fragmentation
>> effects.
>>
>> If you want to know how much memory the process uses, the "committed"
>> numbers in NMT are a lot closer to the truth. They are not the truth,
>> however, since memory can be committed but still untouched and therefore
>> not live, for example when pre-committing with -Xmx==-Xms. In that case,
>> "committed" probably also overreports memory use.
>>
>> We are working on improving NMT; future versions will report the live
>> memory size too, if it can be cheaply obtained. The upcoming version of
>> Java 24 also contains an improved variant of jcmd System.map, which tells
>> you the live size for each memory segment, and at the end the actual live
>> size of all memory. At least on Linux.
>>
>> > our canary nodes were suddenly killed by OOM
>>
>> As in, Java OOMEs? OOM killer? Or the pod being killed from the pod
>> management?
>>
>> HTH,
>>
>> Cheers, Thomas
>>
>>
>>
>>
>> On Mon, Oct 28, 2024 at 9:11 AM Marçal Perapoch Amadó <
>> marcal.perapoch at gmail.com> wrote:
>>
>>> Hello!
>>> First of all, congratulations on all the hard work with ZGC!
>>>
>>> TLDR: Running a simple java main with generational ZGC, and NMT reports
>>> 221GB of reserved memory on a 32GB machine.
>>>
>>> *Context*: at my current company, we're keen on switching from G1GC to
>>> ZGC due to its ability to maintain very low pause times. Our problem in
>>> particular, is that when we scale up our application, the new nodes get so
>>> much traffic in that little time that even the node is technically ready to
>>> accept new traffic, the amount of new allocations end up adding a lot of
>>> pressure to g1 and that translates to multiple over the second pauses. So
>>> we decided to give ZGC a try and although the numbers for those pauses were
>>> looking amazing, our canary nodes were suddenly killed by OOM.
>>> I've read about the ZGC multi-mapping technique and how that can trick
>>> the Linux kernel. I found particularly useful this topic from this same
>>> mailing list:
>>> https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html
>>> and also read about using the -XX:+UseLargePages flag. Even saw a mailing
>>> topic about kubernetes and containers having issues with ZCG here:
>>> https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html.
>>> However, despite this research, I have not been able to find a solution
>>> to the issue. So I decided to reproduce the problem locally for further
>>> investigation. Although my local environment is quite different from our
>>> live setup, I encountered the same high reserved memory behavior.
>>>
>>> I created a very simple java application (just a Main that loops forever
>>> waiting for a number from the console and performs some allocations based
>>> on that, but I don't think that matters that much).
>>> I run my application with the following JVM args:
>>> -XX:+UseZGC
>>> -XX:+ZGenerational
>>> -Xms12g
>>> -Xmx12g
>>> -XX:NativeMemoryTracking=summary
>>> -Xlog:gc*:gc.log
>>>
>>> And that produces the following report on my MacBook Pro M2, 32GB.
>>>
>>> *Native Memory Tracking*:
>>> (Omitting categories weighting less than 1GB)
>>>
>>> Total: reserved=221GB, committed=12GB
>>>        malloc: 0GB #38256
>>>        mmap:   reserved=221GB, committed=12GB
>>>
>>> -                 Java Heap (reserved=192GB, committed=12GB)
>>>                             (mmap: reserved=192GB, committed=12GB, at
>>> peak)
>>>
>>> -                     Class (reserved=1GB, committed=0GB)
>>>                             (classes #2376)
>>>                             (  instance classes #2142, array classes
>>> #234)
>>>                             (mmap: reserved=1GB, committed=0GB, at peak)
>>>                             (  Metadata:   )
>>>                             (    reserved=0GB, committed=0GB)
>>>                             (    used=0GB)
>>>                             (    waste=0GB =0.79%)
>>>                             (  Class space:)
>>>                             (    reserved=1GB, committed=0GB)
>>>                             (    used=0GB)
>>>                             (    waste=0GB =7.49%)
>>>
>>> -                        GC (reserved=16GB, committed=0GB)
>>>                             (mmap: reserved=16GB, committed=0GB, at peak)
>>>
>>> -                   Unknown (reserved=12GB, committed=0GB)
>>>                             (mmap: reserved=12GB, committed=0GB,
>>> peak=0GB)
>>>
>>> As you can see, it is reporting a total reserved of 221GB, which I find
>>> very confusing. I understand it is related to the muli-mapping technique,
>>> but my question is, how can I be sure how much memory my app is using if
>>> even with jcmd I get reports like this one?
>>>
>>> Also, launching the same application with G1, reports Total:
>>> reserved=14GB, committed=12GB.
>>>
>>> Sorry if that has already been reported/answered, I really tried to
>>> inform myself before wasting your time, but I do have the impression that I
>>> am missing something here.
>>>
>>> Could you please provide any insights or suggestions on what might be
>>> happening, or how we could mitigate this issue?
>>> If not jcmd, which tool/command would you recommend to measure
>>> the memory consumption? We’d greatly appreciate your advice on how to move
>>> forward.
>>>
>>> Thank you very much for your time and help!
>>>
>>>
>>> Marçal
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20241028/d5c1c015/attachment.htm>