ZGC Unable to reclaim memory for long time
Hi, I ran into this issue where ZGC is unable to reclaim memory for few hours/days. It just keep printing "Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and Allocation Stall happening on that thread. Here is the metrics which shows for some reason even though there is Garbage but it is unable to Reclaim .... [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] GC(112126) Live: - 6366M (78%) 6366M (78%) 6366M (78%) - - *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Garbage: - 1735M (21%) 1735M (21%) 1731M (21%)* - - [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Reclaimed: - - 0M (0%) 4M (0%) ... [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Live: - 6367M (78%) 6367M (78%) 6367M (78%) - - *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Garbage: - 1730M (21%) 1730M (21%) 1724M (21%)* - - [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Reclaimed: - - 0M (0%) 6M (0%) Here it was in this state for ~8hours and it is still happening. It says has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims only 4-6M. Any idea what might be the issue here. TIA Sundar
Hi, When a workload produces a uniformly swiss-cheesy heap, i.e. where all parts of the heap have roughly the same amount of garbage, then the GC will face a situation where there are no free lunches and it will have to work hard (compact a lot) to reclaim memory. Therefore, the GC will tolerate a certain amount of fragmentation/waste, in the hope that more object will die soon, making compaction less expensive (at the expense of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a trade-off. You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not be a good trade-off in your case. The alternative is to give the GC a larger heap to work with. cheers, Per On 11/4/19 7:56 PM, Sundara Mohan M wrote:
Hi, I ran into this issue where ZGC is unable to reclaim memory for few hours/days. It just keep printing "Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and Allocation Stall happening on that thread.
Here is the metrics which shows for some reason even though there is Garbage but it is unable to Reclaim
.... [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] GC(112126) Live: - 6366M (78%) 6366M (78%) 6366M (78%) - - *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Garbage: - 1735M (21%) 1735M (21%) 1731M (21%)* - - [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Reclaimed: - - 0M (0%) 4M (0%) ...
[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Live: - 6367M (78%) 6367M (78%) 6367M (78%) - - *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Garbage: - 1730M (21%) 1730M (21%) 1724M (21%)* - - [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Reclaimed: - - 0M (0%) 6M (0%)
Here it was in this state for ~8hours and it is still happening. It says has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims only 4-6M.
Any idea what might be the issue here.
TIA Sundar
HI Per, This explains why it didn't work to reclaim memory, also my heap memory was 8G and 6G was strongly reachable (when i took heap dump). Agreed increasing heap memory will help in this case. Still trying to understand better on ZGC, 1. So shouldn't GC try to be more aggressive and try to put more effort to reclaim without additional settings? 2. Is there a reason why it shouldn't give more CPU to GC threads and reclaim garbage (say after X run of GC it could not reclaim memory)? In this case it would be good to reclaim existing garbage instead of doing Allocation Stall and failing with heap out of memory. Thanks Sundar On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden@oracle.com> wrote:
Hi,
When a workload produces a uniformly swiss-cheesy heap, i.e. where all parts of the heap have roughly the same amount of garbage, then the GC will face a situation where there are no free lunches and it will have to work hard (compact a lot) to reclaim memory. Therefore, the GC will tolerate a certain amount of fragmentation/waste, in the hope that more object will die soon, making compaction less expensive (at the expense of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a trade-off.
You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not be a good trade-off in your case. The alternative is to give the GC a larger heap to work with.
cheers, Per
On 11/4/19 7:56 PM, Sundara Mohan M wrote:
Hi, I ran into this issue where ZGC is unable to reclaim memory for few hours/days. It just keep printing "Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and Allocation Stall happening on that thread.
Here is the metrics which shows for some reason even though there is Garbage but it is unable to Reclaim
.... [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] GC(112126) Live: - 6366M (78%) 6366M (78%) 6366M (78%) - - *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Garbage: - 1735M (21%) 1735M (21%) 1731M (21%)* - - [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Reclaimed: - - 0M (0%) 4M (0%) ...
[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Live: - 6367M (78%) 6367M (78%) 6367M (78%) - - *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Garbage: - 1730M (21%) 1730M (21%) 1724M (21%)* - - [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Reclaimed: - - 0M (0%) 6M (0%)
Here it was in this state for ~8hours and it is still happening. It says has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims only 4-6M.
Any idea what might be the issue here.
TIA Sundar
Reading this and similar threads I am struck by the fact that ZGC users are experiencing things that users of Azul’s Zing JVM also go through. I remember the amazement at seeing a JVM run without substantive GC pauses and thinking that it was a free lunch. But the price was two parts - ensuring adequate heap, and rewiring brains that are accustomed to seeing cpu and memory as independent resources. The second turns out to be much harder. From experience, I think a lot of pain can be avoided by clearly communicating that an adequate heap is a prerequisite for a healthy JVM. Most java developers have absorbed the notion that large heaps are bad/risky and unlearning takes time. Sent from my iPhone
On Nov 4, 2019, at 8:28 PM, Sundara Mohan M <m.sundar85@gmail.com> wrote:
HI Per, This explains why it didn't work to reclaim memory, also my heap memory was 8G and 6G was strongly reachable (when i took heap dump). Agreed increasing heap memory will help in this case.
Still trying to understand better on ZGC, 1. So shouldn't GC try to be more aggressive and try to put more effort to reclaim without additional settings? 2. Is there a reason why it shouldn't give more CPU to GC threads and reclaim garbage (say after X run of GC it could not reclaim memory)? In this case it would be good to reclaim existing garbage instead of doing Allocation Stall and failing with heap out of memory.
Thanks Sundar
On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden@oracle.com> wrote:
Hi,
When a workload produces a uniformly swiss-cheesy heap, i.e. where all parts of the heap have roughly the same amount of garbage, then the GC will face a situation where there are no free lunches and it will have to work hard (compact a lot) to reclaim memory. Therefore, the GC will tolerate a certain amount of fragmentation/waste, in the hope that more object will die soon, making compaction less expensive (at the expense of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a trade-off.
You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not be a good trade-off in your case. The alternative is to give the GC a larger heap to work with.
cheers, Per
On 11/4/19 7:56 PM, Sundara Mohan M wrote: Hi, I ran into this issue where ZGC is unable to reclaim memory for few hours/days. It just keep printing "Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and Allocation Stall happening on that thread.
Here is the metrics which shows for some reason even though there is Garbage but it is unable to Reclaim
.... [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] GC(112126) Live: - 6366M (78%) 6366M (78%) 6366M (78%) - - *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Garbage: - 1735M (21%) 1735M (21%) 1731M (21%)* - - [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Reclaimed: - - 0M (0%) 4M (0%) ...
[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Live: - 6367M (78%) 6367M (78%) 6367M (78%) - - *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Garbage: - 1730M (21%) 1730M (21%) 1724M (21%)* - - [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Reclaimed: - - 0M (0%) 6M (0%)
Here it was in this state for ~8hours and it is still happening. It says has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims only 4-6M.
Any idea what might be the issue here.
TIA Sundar
On 11/5/19 4:48 PM, Peter Booth wrote:
Reading this and similar threads I am struck by the fact that ZGC users are experiencing things that users of Azul’s Zing JVM also go through. I remember the amazement at seeing a JVM run without substantive GC pauses and thinking that it was a free lunch. But the price was two parts - ensuring adequate heap, and rewiring brains that are accustomed to seeing cpu and memory as independent resources. The second turns out to be much harder.
From experience, I think a lot of pain can be avoided by clearly communicating that an adequate heap is a prerequisite for a healthy JVM. Most java developers have absorbed the notion that large heaps are bad/risky and unlearning takes time.
The documentation on the ZGC wiki [1] tries to be clear about this, but I'm sure it could be improved. [1] https://wiki.openjdk.java.net/display/zgc/Main cheers, Per
Sent from my iPhone
On Nov 4, 2019, at 8:28 PM, Sundara Mohan M <m.sundar85@gmail.com> wrote:
HI Per, This explains why it didn't work to reclaim memory, also my heap memory was 8G and 6G was strongly reachable (when i took heap dump). Agreed increasing heap memory will help in this case.
Still trying to understand better on ZGC, 1. So shouldn't GC try to be more aggressive and try to put more effort to reclaim without additional settings? 2. Is there a reason why it shouldn't give more CPU to GC threads and reclaim garbage (say after X run of GC it could not reclaim memory)? In this case it would be good to reclaim existing garbage instead of doing Allocation Stall and failing with heap out of memory.
Thanks Sundar
On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden@oracle.com> wrote:
Hi,
When a workload produces a uniformly swiss-cheesy heap, i.e. where all parts of the heap have roughly the same amount of garbage, then the GC will face a situation where there are no free lunches and it will have to work hard (compact a lot) to reclaim memory. Therefore, the GC will tolerate a certain amount of fragmentation/waste, in the hope that more object will die soon, making compaction less expensive (at the expense of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a trade-off.
You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not be a good trade-off in your case. The alternative is to give the GC a larger heap to work with.
cheers, Per
On 11/4/19 7:56 PM, Sundara Mohan M wrote: Hi, I ran into this issue where ZGC is unable to reclaim memory for few hours/days. It just keep printing "Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and Allocation Stall happening on that thread.
Here is the metrics which shows for some reason even though there is Garbage but it is unable to Reclaim
.... [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] GC(112126) Live: - 6366M (78%) 6366M (78%) 6366M (78%) - - *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Garbage: - 1735M (21%) 1735M (21%) 1731M (21%)* - - [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Reclaimed: - - 0M (0%) 4M (0%) ...
[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Live: - 6367M (78%) 6367M (78%) 6367M (78%) - - *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Garbage: - 1730M (21%) 1730M (21%) 1724M (21%)* - - [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Reclaimed: - - 0M (0%) 6M (0%)
Here it was in this state for ~8hours and it is still happening. It says has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims only 4-6M.
Any idea what might be the issue here.
TIA Sundar
Hi Per As per [1] https://wiki.openjdk.java.net/display/zgc/Main it says it can handle *few hundred megabytes* to multi terabytes*.* So my understanding was if my application is running with 8G before, with ZGC and same heap also it should run without issues. So far that is not the case i have to increase the heap size always to make sure it gets the same latency/RPS. For me this doesn't seem to be true always in my case(heap ranging from 8 - 48 G i need to change to higher value to make sure i am getting same RPS and latency). Again this is my observation and might vary for different workload. Thanks Sundar On Wed, Nov 6, 2019 at 2:46 AM Per Liden <per.liden@oracle.com> wrote:
On 11/5/19 4:48 PM, Peter Booth wrote:
Reading this and similar threads I am struck by the fact that ZGC users are experiencing things that users of Azul’s Zing JVM also go through. I remember the amazement at seeing a JVM run without substantive GC pauses and thinking that it was a free lunch. But the price was two parts - ensuring adequate heap, and rewiring brains that are accustomed to seeing cpu and memory as independent resources. The second turns out to be much harder.
From experience, I think a lot of pain can be avoided by clearly communicating that an adequate heap is a prerequisite for a healthy JVM. Most java developers have absorbed the notion that large heaps are bad/risky and unlearning takes time.
The documentation on the ZGC wiki [1] tries to be clear about this, but I'm sure it could be improved.
[1] https://wiki.openjdk.java.net/display/zgc/Main
cheers, Per
Sent from my iPhone
On Nov 4, 2019, at 8:28 PM, Sundara Mohan M <m.sundar85@gmail.com>
wrote:
HI Per, This explains why it didn't work to reclaim memory, also my heap memory
was
8G and 6G was strongly reachable (when i took heap dump). Agreed increasing heap memory will help in this case.
Still trying to understand better on ZGC, 1. So shouldn't GC try to be more aggressive and try to put more effort to reclaim without additional settings? 2. Is there a reason why it shouldn't give more CPU to GC threads and reclaim garbage (say after X run of GC it could not reclaim memory)? In this case it would be good to reclaim existing garbage instead of doing Allocation Stall and failing with heap out of memory.
Thanks Sundar
On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden@oracle.com> wrote:
Hi,
When a workload produces a uniformly swiss-cheesy heap, i.e. where all parts of the heap have roughly the same amount of garbage, then the GC will face a situation where there are no free lunches and it will have to work hard (compact a lot) to reclaim memory. Therefore, the GC will tolerate a certain amount of fragmentation/waste, in the hope that more object will die soon, making compaction less expensive (at the expense of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a trade-off.
You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not be a good trade-off in your case. The alternative is to give the GC a larger heap to work with.
cheers, Per
On 11/4/19 7:56 PM, Sundara Mohan M wrote: Hi, I ran into this issue where ZGC is unable to reclaim memory for few hours/days. It just keep printing "Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and Allocation Stall happening on that thread.
Here is the metrics which shows for some reason even though there is Garbage but it is unable to Reclaim
.... [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] GC(112126) Live: - 6366M (78%) 6366M (78%) 6366M (78%) - - *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Garbage: - 1735M (21%) 1735M (21%) 1731M (21%)* - - [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) Reclaimed: - - 0M (0%) 4M (0%) ...
[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Live: - 6367M (78%) 6367M (78%) 6367M (78%) - - *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Garbage: - 1730M (21%) 1730M (21%) 1724M (21%)* - - [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) Reclaimed: - - 0M (0%) 6M (0%)
Here it was in this state for ~8hours and it is still happening. It says has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims only 4-6M.
Any idea what might be the issue here.
TIA Sundar
* Sundara Mohan M.:
Hi Per As per [1] https://wiki.openjdk.java.net/display/zgc/Main it says it can handle *few hundred megabytes* to multi terabytes*.*
So my understanding was if my application is running with 8G before, with ZGC and same heap also it should run without issues. So far that is not the case i have to increase the heap size always to make sure it gets the same latency/RPS.
What do you mean with “before”? Thanks.
To be clear "before" i was referring to my previous GC, Parallel/G1/CMS. Here is what i was seeing Instance1 - 8G heap, ParallelGC, 100RPS, 200ms Latency Instance2 - 8G heap, ZGC, 100RPS, 600ms Latency Instance3 - 32G heap, ZGC, 100RPS, 200ms Latency My expectation was Instance2 should give me same result as Instance1 but that is not the case. Instead i had to move to Instance3 setting to get what i want. This is just my observation and it might not be same for all workloads. But on the other hand with ZGC my throughput increased from *(ParallelGC)97%* to* (ZGC)99.7% *and my STW pauses have never crossed 20ms. Thanks Sundar On Wed, Nov 6, 2019 at 12:47 PM Florian Weimer <fw@deneb.enyo.de> wrote:
* Sundara Mohan M.:
Hi Per As per [1] https://wiki.openjdk.java.net/display/zgc/Main it says it can handle *few hundred megabytes* to multi terabytes*.*
So my understanding was if my application is running with 8G before, with ZGC and same heap also it should run without issues. So far that is not the case i have to increase the heap size always to make sure it gets the same latency/RPS.
What do you mean with “before”? Thanks.
ZGC and other pauseless/low pause collectors are designed to allow your process to continue while the GC is running - hence why your pause times are low. The problem is that if your process is still running, it's still allocating. If you are filling up the heap faster than ZGC can keep up, you will have degraded performance. Assuming your allocations are mostly not long-lived, you give ZGC more time to collect by giving it a bigger heap, which takes longer for you to fill. This is the mindset shift that Peter Booth was referring to earlier. A lot of engineers have developed/been trained into a mindset where smaller heaps are better, particularly by STW GCs like ParallelGC. That approach is not helpful with concurrent collectors, and it's something you can already see with G1GC. In most cases giving G1GC a larger heap will improve performance, but it does depend on your workload and allocation/lifetime pattern. So the fact ZGC can handle heaps that are TBs in size doesn't mean that you can expect better performance by moving to ZGC and keeping the heap size the same, even if the heap is only a few GB. Niall On 11/6/19, 13:05, "zgc-dev on behalf of Sundara Mohan M" <zgc-dev-bounces@openjdk.java.net on behalf of m.sundar85@gmail.com> wrote: To be clear "before" i was referring to my previous GC, Parallel/G1/CMS. Here is what i was seeing Instance1 - 8G heap, ParallelGC, 100RPS, 200ms Latency Instance2 - 8G heap, ZGC, 100RPS, 600ms Latency Instance3 - 32G heap, ZGC, 100RPS, 200ms Latency My expectation was Instance2 should give me same result as Instance1 but that is not the case. Instead i had to move to Instance3 setting to get what i want. This is just my observation and it might not be same for all workloads. But on the other hand with ZGC my throughput increased from *(ParallelGC)97%* to* (ZGC)99.7% *and my STW pauses have never crossed 20ms. Thanks Sundar On Wed, Nov 6, 2019 at 12:47 PM Florian Weimer <fw@deneb.enyo.de> wrote: > * Sundara Mohan M.: > > > Hi Per > > As per [1] https://wiki.openjdk.java.net/display/zgc/Main it says it > can > > handle *few hundred megabytes* to multi terabytes*.* > > > > So my understanding was if my application is running with 8G before, with > > ZGC and same heap also it should run without issues. So far that is not > > the case i have to increase the heap size always to make sure it gets the > > same latency/RPS. > > What do you mean with “before”? Thanks. >
* Sundara Mohan M.:
To be clear "before" i was referring to my previous GC, Parallel/G1/CMS.
Here is what i was seeing Instance1 - 8G heap, ParallelGC, 100RPS, 200ms Latency Instance2 - 8G heap, ZGC, 100RPS, 600ms Latency Instance3 - 32G heap, ZGC, 100RPS, 200ms Latency
Latency is measured end-to-end, including processing time and stalls etc.? Keep in mind that ZGC does not supported compressed pointers, so if your workload is heavy on pointers, the VM has to do more work, and the memory requirements are also higher.
My expectation was Instance2 should give me same result as Instance1 but that is not the case. Instead i had to move to Instance3 setting to get what i want.
It's difficult to beat ParallelGC in terms of overall CPU efficiency. If you do not have CPU cycles to spare and the GC does not have sufficient extra heap to work with beyond the live object set (to some extent, you can trade RAM vs CPU), application performance will suffer. You could also give Shenandoah a try. 8-)
On 11/6/19 10:04 PM, Sundara Mohan M wrote: [...]
But on the other hand with ZGC my throughput increased from *(ParallelGC)97%* to*(ZGC)99.7% *and my STW pauses have never crossed 20ms.
Could you please paste the last printout of the "Garbage Collection Statistics" table? If you have pauses close to 20ms, it would be interesting to see where that time is spent. I would assume it's accounted to "Subphase: Pause Roots Threads", but seeing the whole table would tell the me more. thanks, Per
Unfortunately i don't have that stats. I was running application with this option for gc log -Xlog:gc,gc+init,gc+start,gc+phases,gc+heap,gc+cpu,gc+reloc,gc+ref,gc+marking,gc+metaspace=info. Will see if i can get that issue reproducible and get that data for you. Thanks Sundar On Wed, Nov 6, 2019 at 3:08 PM Per Liden <per.liden@oracle.com> wrote:
On 11/6/19 10:04 PM, Sundara Mohan M wrote: [...]
But on the other hand with ZGC my throughput increased from *(ParallelGC)97%* to*(ZGC)99.7% *and my STW pauses have never crossed 20ms.
Could you please paste the last printout of the "Garbage Collection Statistics" table? If you have pauses close to 20ms, it would be interesting to see where that time is spent. I would assume it's accounted to "Subphase: Pause Roots Threads", but seeing the whole table would tell the me more.
thanks, Per
On 11/7/19 2:28 AM, Sundara Mohan M wrote:
Unfortunately i don't have that stats. I was running application with this option for gc log -Xlog:gc,gc+init,gc+start,gc+phases,gc+heap,gc+cpu,gc+reloc,gc+ref,gc+marking,gc+metaspace=info. Will see if i can get that issue reproducible and get that data for you.
Ok, thanks. I'd recommend using -Xlog:gc*, that way you catch pretty much all you need to tell what's going on, without being super verbose. /Per
Thanks Sundar
On Wed, Nov 6, 2019 at 3:08 PM Per Liden <per.liden@oracle.com <mailto:per.liden@oracle.com>> wrote:
On 11/6/19 10:04 PM, Sundara Mohan M wrote: [...] > But on the other hand with ZGC my throughput increased from > *(ParallelGC)97%* to*(ZGC)99.7% *and my STW pauses have never crossed 20ms.
Could you please paste the last printout of the "Garbage Collection Statistics" table? If you have pauses close to 20ms, it would be interesting to see where that time is spent. I would assume it's accounted to "Subphase: Pause Roots Threads", but seeing the whole table would tell the me more.
thanks, Per
Hi, On 11/5/19 2:27 AM, Sundara Mohan M wrote:
HI Per, This explains why it didn't work to reclaim memory, also my heap memory was 8G and 6G was strongly reachable (when i took heap dump). Agreed increasing heap memory will help in this case.
Still trying to understand better on ZGC, 1. So shouldn't GC try to be more aggressive and try to put more effort to reclaim without additional settings? 2. Is there a reason why it shouldn't give more CPU to GC threads and reclaim garbage (say after X run of GC it could not reclaim memory)? In this case it would be good to reclaim existing garbage instead of doing Allocation Stall and failing with heap out of memory.
The tricky part is knowing/detecting when to be more aggressive, since it tends to become an exercise in trying to predict the future. Reacting when something bad happens (e.g. allocation stall) tends to be too late. However, before thinking too much about heuristics, we might just want to reconsider the ZFragmentationLimit default value, as it is perhaps a bit too generous today. Most apps I've looked at tend to stabilize somewhere between 2-10% fragmentation/waste (i.e. way below 25%), so lowering the default might not hurt most apps, but help some apps. cheers, Per
Thanks Sundar
On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden@oracle.com <mailto:per.liden@oracle.com>> wrote:
Hi,
When a workload produces a uniformly swiss-cheesy heap, i.e. where all parts of the heap have roughly the same amount of garbage, then the GC will face a situation where there are no free lunches and it will have to work hard (compact a lot) to reclaim memory. Therefore, the GC will tolerate a certain amount of fragmentation/waste, in the hope that more object will die soon, making compaction less expensive (at the expense of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a trade-off.
You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not be a good trade-off in your case. The alternative is to give the GC a larger heap to work with.
cheers, Per
On 11/4/19 7:56 PM, Sundara Mohan M wrote: > Hi, > I ran into this issue where ZGC is unable to reclaim memory for few > hours/days. It just keep printing "Exception in thread "RMI TCP > Connection(idle)" java.lang.OutOfMemoryError: Java heap space" and > Allocation Stall happening on that thread. > > > Here is the metrics which shows for some reason even though there is > Garbage but it is unable to Reclaim > > .... > [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] > GC(112126) Live: - 6366M (78%) 6366M (78%) > 6366M (78%) > - - > *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] > GC(112126) Garbage: - 1735M (21%) 1735M (21%) > 1731M (21%)* > - - > [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) > Reclaimed: - - 0M (0%) > 4M (0%) > ... > > [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) > Live: - 6367M (78%) 6367M (78%) > 6367M (78%) > - - > *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] > GC(135520) Garbage: - 1730M (21%) 1730M (21%) > 1724M (21%)* > - - > [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) > Reclaimed: - - 0M (0%) > 6M (0%) > > Here it was in this state for ~8hours and it is still happening. It says > has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims > only 4-6M. > > Any idea what might be the issue here. > > > TIA > Sundar >
HI Per, Thanks. Will try changing ZFragmentationLimit value to see if it works. Regards Sundar On Wed, Nov 6, 2019 at 2:38 AM Per Liden <per.liden@oracle.com> wrote:
Hi,
On 11/5/19 2:27 AM, Sundara Mohan M wrote:
HI Per, This explains why it didn't work to reclaim memory, also my heap memory was 8G and 6G was strongly reachable (when i took heap dump). Agreed increasing heap memory will help in this case.
Still trying to understand better on ZGC, 1. So shouldn't GC try to be more aggressive and try to put more effort to reclaim without additional settings? 2. Is there a reason why it shouldn't give more CPU to GC threads and reclaim garbage (say after X run of GC it could not reclaim memory)? In this case it would be good to reclaim existing garbage instead of doing Allocation Stall and failing with heap out of memory.
The tricky part is knowing/detecting when to be more aggressive, since it tends to become an exercise in trying to predict the future. Reacting when something bad happens (e.g. allocation stall) tends to be too late.
However, before thinking too much about heuristics, we might just want to reconsider the ZFragmentationLimit default value, as it is perhaps a bit too generous today. Most apps I've looked at tend to stabilize somewhere between 2-10% fragmentation/waste (i.e. way below 25%), so lowering the default might not hurt most apps, but help some apps.
cheers, Per
Thanks Sundar
On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden@oracle.com <mailto:per.liden@oracle.com>> wrote:
Hi,
When a workload produces a uniformly swiss-cheesy heap, i.e. where
all
parts of the heap have roughly the same amount of garbage, then the
GC
will face a situation where there are no free lunches and it will
have
to work hard (compact a lot) to reclaim memory. Therefore, the GC
will
tolerate a certain amount of fragmentation/waste, in the hope that
more
object will die soon, making compaction less expensive (at the
expense
of using more memory for a while). How many CPU cycles to spend on compaction vs. how much memory you can spare is of course a
trade-off.
You can use -XX:ZFragmentationLimit to control this. It currently defaults to 25% and your workload seems to stabilize at 21%. If you want more aggressive compaction/reclamation, then set the -XX:ZFragmentationLimit to something below 21. This may or may not
be a
good trade-off in your case. The alternative is to give the GC a
larger
heap to work with.
cheers, Per
On 11/4/19 7:56 PM, Sundara Mohan M wrote: > Hi, > I ran into this issue where ZGC is unable to reclaim memory for few > hours/days. It just keep printing "Exception in thread "RMI TCP > Connection(idle)" java.lang.OutOfMemoryError: Java heap space"
and
> Allocation Stall happening on that thread. > > > Here is the metrics which shows for some reason even though there
is
> Garbage but it is unable to Reclaim > > .... > [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap ] > GC(112126) Live: - 6366M (78%) 6366M (78%) > 6366M (78%) > - - > *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] > GC(112126) Garbage: - 1735M (21%) 1735M (21%) > 1731M (21%)* > - - > [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap ] GC(112126) > Reclaimed: - - 0M (0%) > 4M (0%) > ... > > [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) > Live: - 6367M (78%) 6367M (78%) > 6367M (78%) > - - > *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] > GC(135520) Garbage: - 1730M (21%) 1730M (21%) > 1724M (21%)* > - - > [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap ] GC(135520) > Reclaimed: - - 0M (0%) > 6M (0%) > > Here it was in this state for ~8hours and it is still happening. It says > has a Garbage of 21G but it is not able to Reclaim it everytime it reclaims > only 4-6M. > > Any idea what might be the issue here. > > > TIA > Sundar >
participants (5)
-
Connaughton, Niall
-
Florian Weimer
-
Per Liden
-
Peter Booth
-
Sundara Mohan M