Re: Concurrent Mark sweep increases after increase in SoftRef enqueued number increases

19 Apr 2022

      Hi,

On 4/12/22 17:59, Sundara Mohan M wrote:
...
Hi,
     Thanks for clarifying on SoftRefs.
Already using -XX:-UseDynamicNumberOfGCThreads and a fixed number of 
concurrent threads (-XX:ConcGCThreads=9, host has 48 threads).
I did notice the live set has increased whenever stall happens, is there 
any easy way to get the object stats when stall happens?
Right now trying to collect heap dump when a stall happens.
Another way to get a quick overview of what's on the heap is:

jcmd <pid> GC.class_histogram

/Per
...
Thanks
Sundar
On Tue, Apr 12, 2022 at 2:14 AM Per Liden <per.liden@oracle.com 
<mailto:per.liden@oracle.com>> wrote:
Hi,
When an allocation stall occurs, ZGC will start to clear SoftRefs. In
    other words, clearing of SoftRefs you see is likely just a side effect
    of you running into an allocation stall.
A longer concurrent mark phase could be caused by a number of things.
    For example, the application might have a larger live-set, the machine
    might have been overloaded, etc. However, it could also be that ZGC's
    heuristics selecting number of threads to use got things wrong, and
    used
    too few. That would also prolong the concurrent mark phase. You could
    try the option -XX:-UseDynamicNumberOfGCThreads to disable this
    heuristic, and instead always use a fixed number of threads.
/Per
On 2022-04-11 22:50, Sundara Mohan M wrote:
     > Hi,
     >      I am running ZGC in production with following settings
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init]
    Initializing The Z
     > Garbage Collector
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Version:
    17.0.2+8
     > (release)
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] NUMA
    Support: Enabled
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] NUMA Nodes: 2
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] CPUs: 72
    total, 72
     > available
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Memory: 191871M
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Large Page
    Support:
     > Disabled
     > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] GC Workers:
    9/44
     > (static)
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Address
    Space Type:
     > Contiguous/Unrestricted/Complete
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Address
    Space Size:
     > 1572864M x 3 = 4718592M
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Heap
    Backing File:
     > /memfd:java_heap
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Heap Backing
     > Filesystem: tmpfs (0x1021994)
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Min
    Capacity: 98304M
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Initial
    Capacity:
     > 98304M
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Max
    Capacity: 98304M
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Medium Page
    Size: 32M
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Pre-touch:
    Disabled
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Available
    space on
     > backing filesystem: N/A
     > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Uncommit:
    Implicitly
     > Disabled (-Xms equals -Xmx)
     > [2022-04-05T22:30:48.788+0000][11.672s][info][gc,init] Runtime
    Workers: 44
     >
     > In 24 hours I have seen 4 allocation stalls, each is happening
    over 5 mins
     > with >1s stall on multiple threads.
     > Whenever allocation stalls, the following pattern is observed
     > 1. There is an increase in SoftRef enqueued values
     > 2. Concurrent Mark time increase from 3s to ~12s
     >
     >
     > [2022-04-10T19:46:22.380+0000][422145.264s][info][gc,ref      ]
    GC(17874)
     > Soft: 22330 encountered, 19702 discovered, 11571 enqueued
     > [2022-04-10T19:46:34.534+0000][422157.418s][info][gc,ref      ]
    GC(17875)
     > Soft: 10781 encountered, 7560 discovered, 19 enqueued
     > [2022-04-10T19:46:47.755+0000][422170.639s][info][gc,ref      ]
    GC(17876)
     > Soft: 10863 encountered, 3289 discovered, 44 enqueued
     > ...
     > [2022-04-11T05:31:55.763+0000][457278.647s][info][gc,ref      ]
    GC(19267)
     > Soft: 9664 encountered, 868 discovered, 12 enqueued
     > [2022-04-11T05:32:09.553+0000][457292.437s][info][gc,ref      ]
    GC(19268)
     > Soft: 9846 encountered, 6202 discovered, 98 enqueued
     > [2022-04-11T05:32:23.798+0000][457306.682s][info][gc,ref      ]
    GC(19269)
     > Soft: 9859 encountered, 6715 discovered, 132 enqueued
     > [2022-04-11T05:32:37.903+0000][457320.787s][info][gc,ref      ]
    GC(19270)
     > Soft: 9873 encountered, 7585 discovered, 42 enqueued
     > [2022-04-11T05:32:52.609+0000][457335.493s][info][gc,ref      ]
    GC(19271)
     > Soft: 10073 encountered, 6466 discovered, 57 enqueued
     > [2022-04-11T11:13:14.758+0000][477757.642s][info][gc,ref      ]
    GC(20578)
     > Soft: 25760 encountered, 20850 discovered, 15635 enqueued
     >
     > Can someone share info about what else might cause Concurrent
    Mark time to
     > go higher?
     > Will cleaning SoftReference quickly using SoftRefLRUPolicyMSPerMB
    flag or
     > not using SoftRef might help in this case?
     >
     > Can send complete logs if it helps.
     >
     >
     > Thanks
     > Sundar