ZGC Unable to reclaim memory for long time

Wed Nov 6 10:44:38 UTC 2019

On 11/5/19 4:48 PM, Peter Booth wrote:
> Reading this and similar threads I am struck by the fact that ZGC users are experiencing things that users of Azul’s Zing JVM also go through. I remember the amazement at seeing a JVM run without substantive GC pauses and thinking that it was a free lunch. But the price was two parts - ensuring adequate heap, and rewiring brains that are accustomed to seeing cpu and memory as independent resources. The second turns out to be much harder.
> 
>  From experience, I think a lot of pain can be avoided by clearly communicating that an adequate heap is a prerequisite for a healthy JVM. Most java developers have absorbed the notion that large heaps are bad/risky and unlearning takes time.

The documentation on the ZGC wiki [1] tries to be clear about this, but 
I'm sure it could be improved.

[1] https://wiki.openjdk.java.net/display/zgc/Main

cheers,
Per

> 
> Sent from my iPhone
> 
>> On Nov 4, 2019, at 8:28 PM, Sundara Mohan M <m.sundar85 at gmail.com> wrote:
>>
>> HI Per,
>> This explains why it didn't work to reclaim memory, also my heap memory was
>> 8G and 6G was strongly reachable (when i took heap dump). Agreed increasing
>> heap memory will help in this case.
>>
>> Still trying to understand better on ZGC,
>> 1. So shouldn't GC try to be more aggressive and try to put more effort to
>> reclaim without additional settings?
>> 2. Is there a reason why it shouldn't give more CPU to GC threads and
>> reclaim garbage (say after X run of GC it could not reclaim memory)? In
>> this case it would be good to reclaim existing garbage instead of doing
>> Allocation Stall and failing with heap out of memory.
>>
>>
>> Thanks
>> Sundar
>>
>>> On Mon, Nov 4, 2019 at 12:40 PM Per Liden <per.liden at oracle.com> wrote:
>>>
>>> Hi,
>>>
>>> When a workload produces a uniformly swiss-cheesy heap, i.e. where all
>>> parts of the heap have roughly the same amount of garbage, then the GC
>>> will face a situation where there are no free lunches and it will have
>>> to work hard (compact a lot) to reclaim memory. Therefore, the GC will
>>> tolerate a certain amount of fragmentation/waste, in the hope that more
>>> object will die soon, making compaction less expensive (at the expense
>>> of using more memory for a while). How many CPU cycles to spend on
>>> compaction vs. how much memory you can spare is of course a trade-off.
>>>
>>> You can use -XX:ZFragmentationLimit to control this. It currently
>>> defaults to 25% and your workload seems to stabilize at 21%. If you want
>>> more aggressive compaction/reclamation, then set the
>>> -XX:ZFragmentationLimit to something below 21. This may or may not be a
>>> good trade-off in your case. The alternative is to give the GC a larger
>>> heap to work with.
>>>
>>> cheers,
>>> Per
>>>
>>>> On 11/4/19 7:56 PM, Sundara Mohan M wrote:
>>>> Hi,
>>>>     I ran into this issue where ZGC is unable to reclaim memory for few
>>>> hours/days. It just keep printing "Exception in thread "RMI TCP
>>>> Connection(idle)" java.lang.OutOfMemoryError: Java heap space"  and
>>>> Allocation Stall happening on that thread.
>>>>
>>>>
>>>> Here is the metrics which shows for some reason even though there is
>>>> Garbage but it is unable to Reclaim
>>>>
>>>> ....
>>>> [2019-11-04T*08:39:53.986+0000*][1765465.981s][info][gc,heap     ]
>>>> GC(112126)      Live:         -              6366M (78%)        6366M
>>> (78%)
>>>>         6366M (78%)
>>>>      -                  -
>>>> *[2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap     ]
>>>> GC(112126)   Garbage:         -              1735M (21%)        1735M
>>> (21%)
>>>>         1731M (21%)*
>>>>      -                  -
>>>> [2019-11-04T08:39:53.986+0000][1765465.981s][info][gc,heap     ]
>>> GC(112126)
>>>> Reclaimed:         -                  -                 0M (0%)
>>>>   4M (0%)
>>>> ...
>>>>
>>>> [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap     ]
>>> GC(135520)
>>>>       Live:         -              6367M (78%)        6367M (78%)
>>>>   6367M (78%)
>>>>      -                  -
>>>> *[2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap     ]
>>>> GC(135520)   Garbage:         -              1730M (21%)        1730M
>>> (21%)
>>>>         1724M (21%)*
>>>>      -                  -
>>>> [2019-11-04T16:48:53.742+0000][1794805.738s][info][gc,heap     ]
>>> GC(135520)
>>>> Reclaimed:         -                  -                 0M (0%)
>>>>   6M (0%)
>>>>
>>>> Here it was in this state for ~8hours and it is still happening. It says
>>>> has a Garbage of 21G but it is not able to Reclaim it everytime it
>>> reclaims
>>>> only 4-6M.
>>>>
>>>> Any idea what might be the issue here.
>>>>
>>>>
>>>> TIA
>>>> Sundar
>>>>
>>>
>