jcmd VM.native_memory extremely large numbers when using ZGC
Hello! First of all, congratulations on all the hard work with ZGC! TLDR: Running a simple java main with generational ZGC, and NMT reports 221GB of reserved memory on a 32GB machine. *Context*: at my current company, we're keen on switching from G1GC to ZGC due to its ability to maintain very low pause times. Our problem in particular, is that when we scale up our application, the new nodes get so much traffic in that little time that even the node is technically ready to accept new traffic, the amount of new allocations end up adding a lot of pressure to g1 and that translates to multiple over the second pauses. So we decided to give ZGC a try and although the numbers for those pauses were looking amazing, our canary nodes were suddenly killed by OOM. I've read about the ZGC multi-mapping technique and how that can trick the Linux kernel. I found particularly useful this topic from this same mailing list: https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html and also read about using the -XX:+UseLargePages flag. Even saw a mailing topic about kubernetes and containers having issues with ZCG here: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html. However, despite this research, I have not been able to find a solution to the issue. So I decided to reproduce the problem locally for further investigation. Although my local environment is quite different from our live setup, I encountered the same high reserved memory behavior. I created a very simple java application (just a Main that loops forever waiting for a number from the console and performs some allocations based on that, but I don't think that matters that much). I run my application with the following JVM args: -XX:+UseZGC -XX:+ZGenerational -Xms12g -Xmx12g -XX:NativeMemoryTracking=summary -Xlog:gc*:gc.log And that produces the following report on my MacBook Pro M2, 32GB. *Native Memory Tracking*: (Omitting categories weighting less than 1GB) Total: reserved=221GB, committed=12GB malloc: 0GB #38256 mmap: reserved=221GB, committed=12GB - Java Heap (reserved=192GB, committed=12GB) (mmap: reserved=192GB, committed=12GB, at peak) - Class (reserved=1GB, committed=0GB) (classes #2376) ( instance classes #2142, array classes #234) (mmap: reserved=1GB, committed=0GB, at peak) ( Metadata: ) ( reserved=0GB, committed=0GB) ( used=0GB) ( waste=0GB =0.79%) ( Class space:) ( reserved=1GB, committed=0GB) ( used=0GB) ( waste=0GB =7.49%) - GC (reserved=16GB, committed=0GB) (mmap: reserved=16GB, committed=0GB, at peak) - Unknown (reserved=12GB, committed=0GB) (mmap: reserved=12GB, committed=0GB, peak=0GB) As you can see, it is reporting a total reserved of 221GB, which I find very confusing. I understand it is related to the muli-mapping technique, but my question is, how can I be sure how much memory my app is using if even with jcmd I get reports like this one? Also, launching the same application with G1, reports Total: reserved=14GB, committed=12GB. Sorry if that has already been reported/answered, I really tried to inform myself before wasting your time, but I do have the impression that I am missing something here. Could you please provide any insights or suggestions on what might be happening, or how we could mitigate this issue? If not jcmd, which tool/command would you recommend to measure the memory consumption? We’d greatly appreciate your advice on how to move forward. Thank you very much for your time and help! Marçal
Hi Marcal, likely a red herring - "reserved" should not matter unless you artificially limit the address space size of the process (e.g. with ulimit -v). And even then, ZGC should just work around this limit. Reserved is just address space, and modern 64-bit OSes don't penalize you for allocating large swathes of address space. It should not cost any real memory. About the large number: AFAIK ZGC in generational mode does not do multi-mapping anymore. Both Generational and Single Gen, however, do over-allocate address space (max heap size * 16) - that number may be smaller if capped by whatever is physically possible on the machine. It does that because it rolls its own variant of physical-to-virtual memory mapping, and needs room to maneuver. This is done to fight fragmentation effects. If you want to know how much memory the process uses, the "committed" numbers in NMT are a lot closer to the truth. They are not the truth, however, since memory can be committed but still untouched and therefore not live, for example when pre-committing with -Xmx==-Xms. In that case, "committed" probably also overreports memory use. We are working on improving NMT; future versions will report the live memory size too, if it can be cheaply obtained. The upcoming version of Java 24 also contains an improved variant of jcmd System.map, which tells you the live size for each memory segment, and at the end the actual live size of all memory. At least on Linux.
our canary nodes were suddenly killed by OOM
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management? HTH, Cheers, Thomas On Mon, Oct 28, 2024 at 9:11 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hello! First of all, congratulations on all the hard work with ZGC!
TLDR: Running a simple java main with generational ZGC, and NMT reports 221GB of reserved memory on a 32GB machine.
*Context*: at my current company, we're keen on switching from G1GC to ZGC due to its ability to maintain very low pause times. Our problem in particular, is that when we scale up our application, the new nodes get so much traffic in that little time that even the node is technically ready to accept new traffic, the amount of new allocations end up adding a lot of pressure to g1 and that translates to multiple over the second pauses. So we decided to give ZGC a try and although the numbers for those pauses were looking amazing, our canary nodes were suddenly killed by OOM. I've read about the ZGC multi-mapping technique and how that can trick the Linux kernel. I found particularly useful this topic from this same mailing list: https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html and also read about using the -XX:+UseLargePages flag. Even saw a mailing topic about kubernetes and containers having issues with ZCG here: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html. However, despite this research, I have not been able to find a solution to the issue. So I decided to reproduce the problem locally for further investigation. Although my local environment is quite different from our live setup, I encountered the same high reserved memory behavior.
I created a very simple java application (just a Main that loops forever waiting for a number from the console and performs some allocations based on that, but I don't think that matters that much). I run my application with the following JVM args: -XX:+UseZGC -XX:+ZGenerational -Xms12g -Xmx12g -XX:NativeMemoryTracking=summary -Xlog:gc*:gc.log
And that produces the following report on my MacBook Pro M2, 32GB.
*Native Memory Tracking*: (Omitting categories weighting less than 1GB)
Total: reserved=221GB, committed=12GB malloc: 0GB #38256 mmap: reserved=221GB, committed=12GB
- Java Heap (reserved=192GB, committed=12GB) (mmap: reserved=192GB, committed=12GB, at peak)
- Class (reserved=1GB, committed=0GB) (classes #2376) ( instance classes #2142, array classes #234) (mmap: reserved=1GB, committed=0GB, at peak) ( Metadata: ) ( reserved=0GB, committed=0GB) ( used=0GB) ( waste=0GB =0.79%) ( Class space:) ( reserved=1GB, committed=0GB) ( used=0GB) ( waste=0GB =7.49%)
- GC (reserved=16GB, committed=0GB) (mmap: reserved=16GB, committed=0GB, at peak)
- Unknown (reserved=12GB, committed=0GB) (mmap: reserved=12GB, committed=0GB, peak=0GB)
As you can see, it is reporting a total reserved of 221GB, which I find very confusing. I understand it is related to the muli-mapping technique, but my question is, how can I be sure how much memory my app is using if even with jcmd I get reports like this one?
Also, launching the same application with G1, reports Total: reserved=14GB, committed=12GB.
Sorry if that has already been reported/answered, I really tried to inform myself before wasting your time, but I do have the impression that I am missing something here.
Could you please provide any insights or suggestions on what might be happening, or how we could mitigate this issue? If not jcmd, which tool/command would you recommend to measure the memory consumption? We’d greatly appreciate your advice on how to move forward.
Thank you very much for your time and help!
Marçal
Hey Thomas, Thanks a lot for your answer and the information you provided. I think you are right about generational not using multi-mapping ( https://openjdk.org/jeps/439 - "No multi-mapped memory") also I didn't know about the max heap size * 16, which does seems to match the numbers I was seeing in my computer. Good info, thanks again!
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management? Our canary pods using ZGC were OOM killed, yes. It's also visible in our metrics how the "container_memory_working_set_bytes" of the pods using zgc went above 20GB even though they were set to use a max heap of 6GB.
Also, I forgot to mention (in case it helps) we are running: openjdk 21.0.4 2024-07-16 LTS OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing) Best, Marçal Missatge de Thomas Stüfe <thomas.stuefe@gmail.com> del dia dl., 28 d’oct. 2024 a les 10:25:
Hi Marcal,
likely a red herring - "reserved" should not matter unless you artificially limit the address space size of the process (e.g. with ulimit -v). And even then, ZGC should just work around this limit. Reserved is just address space, and modern 64-bit OSes don't penalize you for allocating large swathes of address space. It should not cost any real memory.
About the large number: AFAIK ZGC in generational mode does not do multi-mapping anymore. Both Generational and Single Gen, however, do over-allocate address space (max heap size * 16) - that number may be smaller if capped by whatever is physically possible on the machine. It does that because it rolls its own variant of physical-to-virtual memory mapping, and needs room to maneuver. This is done to fight fragmentation effects.
If you want to know how much memory the process uses, the "committed" numbers in NMT are a lot closer to the truth. They are not the truth, however, since memory can be committed but still untouched and therefore not live, for example when pre-committing with -Xmx==-Xms. In that case, "committed" probably also overreports memory use.
We are working on improving NMT; future versions will report the live memory size too, if it can be cheaply obtained. The upcoming version of Java 24 also contains an improved variant of jcmd System.map, which tells you the live size for each memory segment, and at the end the actual live size of all memory. At least on Linux.
our canary nodes were suddenly killed by OOM
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management?
HTH,
Cheers, Thomas
On Mon, Oct 28, 2024 at 9:11 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hello! First of all, congratulations on all the hard work with ZGC!
TLDR: Running a simple java main with generational ZGC, and NMT reports 221GB of reserved memory on a 32GB machine.
*Context*: at my current company, we're keen on switching from G1GC to ZGC due to its ability to maintain very low pause times. Our problem in particular, is that when we scale up our application, the new nodes get so much traffic in that little time that even the node is technically ready to accept new traffic, the amount of new allocations end up adding a lot of pressure to g1 and that translates to multiple over the second pauses. So we decided to give ZGC a try and although the numbers for those pauses were looking amazing, our canary nodes were suddenly killed by OOM. I've read about the ZGC multi-mapping technique and how that can trick the Linux kernel. I found particularly useful this topic from this same mailing list: https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html and also read about using the -XX:+UseLargePages flag. Even saw a mailing topic about kubernetes and containers having issues with ZCG here: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html. However, despite this research, I have not been able to find a solution to the issue. So I decided to reproduce the problem locally for further investigation. Although my local environment is quite different from our live setup, I encountered the same high reserved memory behavior.
I created a very simple java application (just a Main that loops forever waiting for a number from the console and performs some allocations based on that, but I don't think that matters that much). I run my application with the following JVM args: -XX:+UseZGC -XX:+ZGenerational -Xms12g -Xmx12g -XX:NativeMemoryTracking=summary -Xlog:gc*:gc.log
And that produces the following report on my MacBook Pro M2, 32GB.
*Native Memory Tracking*: (Omitting categories weighting less than 1GB)
Total: reserved=221GB, committed=12GB malloc: 0GB #38256 mmap: reserved=221GB, committed=12GB
- Java Heap (reserved=192GB, committed=12GB) (mmap: reserved=192GB, committed=12GB, at peak)
- Class (reserved=1GB, committed=0GB) (classes #2376) ( instance classes #2142, array classes #234) (mmap: reserved=1GB, committed=0GB, at peak) ( Metadata: ) ( reserved=0GB, committed=0GB) ( used=0GB) ( waste=0GB =0.79%) ( Class space:) ( reserved=1GB, committed=0GB) ( used=0GB) ( waste=0GB =7.49%)
- GC (reserved=16GB, committed=0GB) (mmap: reserved=16GB, committed=0GB, at peak)
- Unknown (reserved=12GB, committed=0GB) (mmap: reserved=12GB, committed=0GB, peak=0GB)
As you can see, it is reporting a total reserved of 221GB, which I find very confusing. I understand it is related to the muli-mapping technique, but my question is, how can I be sure how much memory my app is using if even with jcmd I get reports like this one?
Also, launching the same application with G1, reports Total: reserved=14GB, committed=12GB.
Sorry if that has already been reported/answered, I really tried to inform myself before wasting your time, but I do have the impression that I am missing something here.
Could you please provide any insights or suggestions on what might be happening, or how we could mitigate this issue? If not jcmd, which tool/command would you recommend to measure the memory consumption? We’d greatly appreciate your advice on how to move forward.
Thank you very much for your time and help!
Marçal
Hi Marcel, Too little information to say anything - would need NMT report, possible jcmd System.map, and possibly the GC log. I am also not aware of any sizing recommendations when switching from G1 to ZGC, but they probably exist and the ZGC devs that normally frequent this ML know this stuff better than I do. Cheers, Thomas On Mon, Oct 28, 2024 at 10:58 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hey Thomas,
Thanks a lot for your answer and the information you provided. I think you are right about generational not using multi-mapping ( https://openjdk.org/jeps/439 - "No multi-mapped memory") also I didn't know about the max heap size * 16, which does seems to match the numbers I was seeing in my computer. Good info, thanks again!
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management? Our canary pods using ZGC were OOM killed, yes. It's also visible in our metrics how the "container_memory_working_set_bytes" of the pods using zgc went above 20GB even though they were set to use a max heap of 6GB.
Also, I forgot to mention (in case it helps) we are running: openjdk 21.0.4 2024-07-16 LTS OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)
Best, Marçal
Missatge de Thomas Stüfe <thomas.stuefe@gmail.com> del dia dl., 28 d’oct. 2024 a les 10:25:
Hi Marcal,
likely a red herring - "reserved" should not matter unless you artificially limit the address space size of the process (e.g. with ulimit -v). And even then, ZGC should just work around this limit. Reserved is just address space, and modern 64-bit OSes don't penalize you for allocating large swathes of address space. It should not cost any real memory.
About the large number: AFAIK ZGC in generational mode does not do multi-mapping anymore. Both Generational and Single Gen, however, do over-allocate address space (max heap size * 16) - that number may be smaller if capped by whatever is physically possible on the machine. It does that because it rolls its own variant of physical-to-virtual memory mapping, and needs room to maneuver. This is done to fight fragmentation effects.
If you want to know how much memory the process uses, the "committed" numbers in NMT are a lot closer to the truth. They are not the truth, however, since memory can be committed but still untouched and therefore not live, for example when pre-committing with -Xmx==-Xms. In that case, "committed" probably also overreports memory use.
We are working on improving NMT; future versions will report the live memory size too, if it can be cheaply obtained. The upcoming version of Java 24 also contains an improved variant of jcmd System.map, which tells you the live size for each memory segment, and at the end the actual live size of all memory. At least on Linux.
our canary nodes were suddenly killed by OOM
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management?
HTH,
Cheers, Thomas
On Mon, Oct 28, 2024 at 9:11 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hello! First of all, congratulations on all the hard work with ZGC!
TLDR: Running a simple java main with generational ZGC, and NMT reports 221GB of reserved memory on a 32GB machine.
*Context*: at my current company, we're keen on switching from G1GC to ZGC due to its ability to maintain very low pause times. Our problem in particular, is that when we scale up our application, the new nodes get so much traffic in that little time that even the node is technically ready to accept new traffic, the amount of new allocations end up adding a lot of pressure to g1 and that translates to multiple over the second pauses. So we decided to give ZGC a try and although the numbers for those pauses were looking amazing, our canary nodes were suddenly killed by OOM. I've read about the ZGC multi-mapping technique and how that can trick the Linux kernel. I found particularly useful this topic from this same mailing list: https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html and also read about using the -XX:+UseLargePages flag. Even saw a mailing topic about kubernetes and containers having issues with ZCG here: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html. However, despite this research, I have not been able to find a solution to the issue. So I decided to reproduce the problem locally for further investigation. Although my local environment is quite different from our live setup, I encountered the same high reserved memory behavior.
I created a very simple java application (just a Main that loops forever waiting for a number from the console and performs some allocations based on that, but I don't think that matters that much). I run my application with the following JVM args: -XX:+UseZGC -XX:+ZGenerational -Xms12g -Xmx12g -XX:NativeMemoryTracking=summary -Xlog:gc*:gc.log
And that produces the following report on my MacBook Pro M2, 32GB.
*Native Memory Tracking*: (Omitting categories weighting less than 1GB)
Total: reserved=221GB, committed=12GB malloc: 0GB #38256 mmap: reserved=221GB, committed=12GB
- Java Heap (reserved=192GB, committed=12GB) (mmap: reserved=192GB, committed=12GB, at peak)
- Class (reserved=1GB, committed=0GB) (classes #2376) ( instance classes #2142, array classes #234) (mmap: reserved=1GB, committed=0GB, at peak) ( Metadata: ) ( reserved=0GB, committed=0GB) ( used=0GB) ( waste=0GB =0.79%) ( Class space:) ( reserved=1GB, committed=0GB) ( used=0GB) ( waste=0GB =7.49%)
- GC (reserved=16GB, committed=0GB) (mmap: reserved=16GB, committed=0GB, at peak)
- Unknown (reserved=12GB, committed=0GB) (mmap: reserved=12GB, committed=0GB, peak=0GB)
As you can see, it is reporting a total reserved of 221GB, which I find very confusing. I understand it is related to the muli-mapping technique, but my question is, how can I be sure how much memory my app is using if even with jcmd I get reports like this one?
Also, launching the same application with G1, reports Total: reserved=14GB, committed=12GB.
Sorry if that has already been reported/answered, I really tried to inform myself before wasting your time, but I do have the impression that I am missing something here.
Could you please provide any insights or suggestions on what might be happening, or how we could mitigate this issue? If not jcmd, which tool/command would you recommend to measure the memory consumption? We’d greatly appreciate your advice on how to move forward.
Thank you very much for your time and help!
Marçal
Hello again, Thomas. Attaching the NMT report we got from running our app with -XX:NativeMemoryTracking=detail and extracted with `jcmd <PID> VM.native_memory detail`, the GC log and a screenshot of the `top` command. Our application is running on K8s in Google Cloud Platform using openjdk version "21.0.4" 2024-07-16 LTS. Unfortunately we could not get the System.map report because we are using java 21. Please let me know if you need more information. Cheers, Marçal Missatge de Thomas Stüfe <thomas.stuefe@gmail.com> del dia dl., 28 d’oct. 2024 a les 11:11:
Hi Marcel,
Too little information to say anything - would need NMT report, possible jcmd System.map, and possibly the GC log. I am also not aware of any sizing recommendations when switching from G1 to ZGC, but they probably exist and the ZGC devs that normally frequent this ML know this stuff better than I do.
Cheers, Thomas
On Mon, Oct 28, 2024 at 10:58 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hey Thomas,
Thanks a lot for your answer and the information you provided. I think you are right about generational not using multi-mapping ( https://openjdk.org/jeps/439 - "No multi-mapped memory") also I didn't know about the max heap size * 16, which does seems to match the numbers I was seeing in my computer. Good info, thanks again!
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management? Our canary pods using ZGC were OOM killed, yes. It's also visible in our metrics how the "container_memory_working_set_bytes" of the pods using zgc went above 20GB even though they were set to use a max heap of 6GB.
Also, I forgot to mention (in case it helps) we are running: openjdk 21.0.4 2024-07-16 LTS OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)
Best, Marçal
Missatge de Thomas Stüfe <thomas.stuefe@gmail.com> del dia dl., 28 d’oct. 2024 a les 10:25:
Hi Marcal,
likely a red herring - "reserved" should not matter unless you artificially limit the address space size of the process (e.g. with ulimit -v). And even then, ZGC should just work around this limit. Reserved is just address space, and modern 64-bit OSes don't penalize you for allocating large swathes of address space. It should not cost any real memory.
About the large number: AFAIK ZGC in generational mode does not do multi-mapping anymore. Both Generational and Single Gen, however, do over-allocate address space (max heap size * 16) - that number may be smaller if capped by whatever is physically possible on the machine. It does that because it rolls its own variant of physical-to-virtual memory mapping, and needs room to maneuver. This is done to fight fragmentation effects.
If you want to know how much memory the process uses, the "committed" numbers in NMT are a lot closer to the truth. They are not the truth, however, since memory can be committed but still untouched and therefore not live, for example when pre-committing with -Xmx==-Xms. In that case, "committed" probably also overreports memory use.
We are working on improving NMT; future versions will report the live memory size too, if it can be cheaply obtained. The upcoming version of Java 24 also contains an improved variant of jcmd System.map, which tells you the live size for each memory segment, and at the end the actual live size of all memory. At least on Linux.
our canary nodes were suddenly killed by OOM
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management?
HTH,
Cheers, Thomas
On Mon, Oct 28, 2024 at 9:11 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hello! First of all, congratulations on all the hard work with ZGC!
TLDR: Running a simple java main with generational ZGC, and NMT reports 221GB of reserved memory on a 32GB machine.
*Context*: at my current company, we're keen on switching from G1GC to ZGC due to its ability to maintain very low pause times. Our problem in particular, is that when we scale up our application, the new nodes get so much traffic in that little time that even the node is technically ready to accept new traffic, the amount of new allocations end up adding a lot of pressure to g1 and that translates to multiple over the second pauses. So we decided to give ZGC a try and although the numbers for those pauses were looking amazing, our canary nodes were suddenly killed by OOM. I've read about the ZGC multi-mapping technique and how that can trick the Linux kernel. I found particularly useful this topic from this same mailing list: https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html and also read about using the -XX:+UseLargePages flag. Even saw a mailing topic about kubernetes and containers having issues with ZCG here: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html. However, despite this research, I have not been able to find a solution to the issue. So I decided to reproduce the problem locally for further investigation. Although my local environment is quite different from our live setup, I encountered the same high reserved memory behavior.
I created a very simple java application (just a Main that loops forever waiting for a number from the console and performs some allocations based on that, but I don't think that matters that much). I run my application with the following JVM args: -XX:+UseZGC -XX:+ZGenerational -Xms12g -Xmx12g -XX:NativeMemoryTracking=summary -Xlog:gc*:gc.log
And that produces the following report on my MacBook Pro M2, 32GB.
*Native Memory Tracking*: (Omitting categories weighting less than 1GB)
Total: reserved=221GB, committed=12GB malloc: 0GB #38256 mmap: reserved=221GB, committed=12GB
- Java Heap (reserved=192GB, committed=12GB) (mmap: reserved=192GB, committed=12GB, at peak)
- Class (reserved=1GB, committed=0GB) (classes #2376) ( instance classes #2142, array classes #234) (mmap: reserved=1GB, committed=0GB, at peak) ( Metadata: ) ( reserved=0GB, committed=0GB) ( used=0GB) ( waste=0GB =0.79%) ( Class space:) ( reserved=1GB, committed=0GB) ( used=0GB) ( waste=0GB =7.49%)
- GC (reserved=16GB, committed=0GB) (mmap: reserved=16GB, committed=0GB, at peak)
- Unknown (reserved=12GB, committed=0GB) (mmap: reserved=12GB, committed=0GB, peak=0GB)
As you can see, it is reporting a total reserved of 221GB, which I find very confusing. I understand it is related to the muli-mapping technique, but my question is, how can I be sure how much memory my app is using if even with jcmd I get reports like this one?
Also, launching the same application with G1, reports Total: reserved=14GB, committed=12GB.
Sorry if that has already been reported/answered, I really tried to inform myself before wasting your time, but I do have the impression that I am missing something here.
Could you please provide any insights or suggestions on what might be happening, or how we could mitigate this issue? If not jcmd, which tool/command would you recommend to measure the memory consumption? We’d greatly appreciate your advice on how to move forward.
Thank you very much for your time and help!
Marçal
I don't see a problem. Process has an RSS of 2.7 GB. The JVM,- according to NMT, has ~7 GB committed. That seems to be in line for a heap of 6GB. On Mon, Oct 28, 2024 at 4:19 PM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hello again, Thomas.
Attaching the NMT report we got from running our app with -XX:NativeMemoryTracking=detail and extracted with `jcmd <PID> VM.native_memory detail`, the GC log and a screenshot of the `top` command.
Our application is running on K8s in Google Cloud Platform using openjdk version "21.0.4" 2024-07-16 LTS.
Unfortunately we could not get the System.map report because we are using java 21.
Please let me know if you need more information.
Cheers, Marçal
Missatge de Thomas Stüfe <thomas.stuefe@gmail.com> del dia dl., 28 d’oct. 2024 a les 11:11:
Hi Marcel,
Too little information to say anything - would need NMT report, possible jcmd System.map, and possibly the GC log. I am also not aware of any sizing recommendations when switching from G1 to ZGC, but they probably exist and the ZGC devs that normally frequent this ML know this stuff better than I do.
Cheers, Thomas
On Mon, Oct 28, 2024 at 10:58 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hey Thomas,
Thanks a lot for your answer and the information you provided. I think you are right about generational not using multi-mapping ( https://openjdk.org/jeps/439 - "No multi-mapped memory") also I didn't know about the max heap size * 16, which does seems to match the numbers I was seeing in my computer. Good info, thanks again!
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management? Our canary pods using ZGC were OOM killed, yes. It's also visible in our metrics how the "container_memory_working_set_bytes" of the pods using zgc went above 20GB even though they were set to use a max heap of 6GB.
Also, I forgot to mention (in case it helps) we are running: openjdk 21.0.4 2024-07-16 LTS OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)
Best, Marçal
Missatge de Thomas Stüfe <thomas.stuefe@gmail.com> del dia dl., 28 d’oct. 2024 a les 10:25:
Hi Marcal,
likely a red herring - "reserved" should not matter unless you artificially limit the address space size of the process (e.g. with ulimit -v). And even then, ZGC should just work around this limit. Reserved is just address space, and modern 64-bit OSes don't penalize you for allocating large swathes of address space. It should not cost any real memory.
About the large number: AFAIK ZGC in generational mode does not do multi-mapping anymore. Both Generational and Single Gen, however, do over-allocate address space (max heap size * 16) - that number may be smaller if capped by whatever is physically possible on the machine. It does that because it rolls its own variant of physical-to-virtual memory mapping, and needs room to maneuver. This is done to fight fragmentation effects.
If you want to know how much memory the process uses, the "committed" numbers in NMT are a lot closer to the truth. They are not the truth, however, since memory can be committed but still untouched and therefore not live, for example when pre-committing with -Xmx==-Xms. In that case, "committed" probably also overreports memory use.
We are working on improving NMT; future versions will report the live memory size too, if it can be cheaply obtained. The upcoming version of Java 24 also contains an improved variant of jcmd System.map, which tells you the live size for each memory segment, and at the end the actual live size of all memory. At least on Linux.
our canary nodes were suddenly killed by OOM
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management?
HTH,
Cheers, Thomas
On Mon, Oct 28, 2024 at 9:11 AM Marçal Perapoch Amadó < marcal.perapoch@gmail.com> wrote:
Hello! First of all, congratulations on all the hard work with ZGC!
TLDR: Running a simple java main with generational ZGC, and NMT reports 221GB of reserved memory on a 32GB machine.
*Context*: at my current company, we're keen on switching from G1GC to ZGC due to its ability to maintain very low pause times. Our problem in particular, is that when we scale up our application, the new nodes get so much traffic in that little time that even the node is technically ready to accept new traffic, the amount of new allocations end up adding a lot of pressure to g1 and that translates to multiple over the second pauses. So we decided to give ZGC a try and although the numbers for those pauses were looking amazing, our canary nodes were suddenly killed by OOM. I've read about the ZGC multi-mapping technique and how that can trick the Linux kernel. I found particularly useful this topic from this same mailing list: https://mail.openjdk.org/pipermail/zgc-dev/2018-November/000511.html and also read about using the -XX:+UseLargePages flag. Even saw a mailing topic about kubernetes and containers having issues with ZCG here: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001259.html. However, despite this research, I have not been able to find a solution to the issue. So I decided to reproduce the problem locally for further investigation. Although my local environment is quite different from our live setup, I encountered the same high reserved memory behavior.
I created a very simple java application (just a Main that loops forever waiting for a number from the console and performs some allocations based on that, but I don't think that matters that much). I run my application with the following JVM args: -XX:+UseZGC -XX:+ZGenerational -Xms12g -Xmx12g -XX:NativeMemoryTracking=summary -Xlog:gc*:gc.log
And that produces the following report on my MacBook Pro M2, 32GB.
*Native Memory Tracking*: (Omitting categories weighting less than 1GB)
Total: reserved=221GB, committed=12GB malloc: 0GB #38256 mmap: reserved=221GB, committed=12GB
- Java Heap (reserved=192GB, committed=12GB) (mmap: reserved=192GB, committed=12GB, at peak)
- Class (reserved=1GB, committed=0GB) (classes #2376) ( instance classes #2142, array classes #234) (mmap: reserved=1GB, committed=0GB, at peak) ( Metadata: ) ( reserved=0GB, committed=0GB) ( used=0GB) ( waste=0GB =0.79%) ( Class space:) ( reserved=1GB, committed=0GB) ( used=0GB) ( waste=0GB =7.49%)
- GC (reserved=16GB, committed=0GB) (mmap: reserved=16GB, committed=0GB, at peak)
- Unknown (reserved=12GB, committed=0GB) (mmap: reserved=12GB, committed=0GB, peak=0GB)
As you can see, it is reporting a total reserved of 221GB, which I find very confusing. I understand it is related to the muli-mapping technique, but my question is, how can I be sure how much memory my app is using if even with jcmd I get reports like this one?
Also, launching the same application with G1, reports Total: reserved=14GB, committed=12GB.
Sorry if that has already been reported/answered, I really tried to inform myself before wasting your time, but I do have the impression that I am missing something here.
Could you please provide any insights or suggestions on what might be happening, or how we could mitigate this issue? If not jcmd, which tool/command would you recommend to measure the memory consumption? We’d greatly appreciate your advice on how to move forward.
Thank you very much for your time and help!
Marçal
* Marçal Perapoch Amadó:
As in, Java OOMEs? OOM killer? Or the pod being killed from the pod management?
Our canary pods using ZGC were OOM killed, yes. It's also visible in our metrics how the "container_memory_working_set_bytes" of the pods using zgc went above 20GB even though they were set to use a max heap of 6GB.
I think some container hosts kill processes based on RSS alone, so even memory-mapped I/O can trigger this. From the hosts perspective, it doesn't matter if the memory is just used for caching and could be discarded any time because it's a read-only MAP_SHARED mapping from a file. Thanks, Florian
participants (3)
-
Florian Weimer
-
Marçal Perapoch Amadó
-
Thomas Stüfe