Hi, On 4/12/22 17:59, Sundara Mohan M wrote:
Hi, Thanks for clarifying on SoftRefs. Already using -XX:-UseDynamicNumberOfGCThreads and a fixed number of concurrent threads (-XX:ConcGCThreads=9, host has 48 threads). I did notice the live set has increased whenever stall happens, is there any easy way to get the object stats when stall happens? Right now trying to collect heap dump when a stall happens.
Another way to get a quick overview of what's on the heap is: jcmd <pid> GC.class_histogram /Per
Thanks Sundar
On Tue, Apr 12, 2022 at 2:14 AM Per Liden <per.liden@oracle.com <mailto:per.liden@oracle.com>> wrote:
Hi,
When an allocation stall occurs, ZGC will start to clear SoftRefs. In other words, clearing of SoftRefs you see is likely just a side effect of you running into an allocation stall.
A longer concurrent mark phase could be caused by a number of things. For example, the application might have a larger live-set, the machine might have been overloaded, etc. However, it could also be that ZGC's heuristics selecting number of threads to use got things wrong, and used too few. That would also prolong the concurrent mark phase. You could try the option -XX:-UseDynamicNumberOfGCThreads to disable this heuristic, and instead always use a fixed number of threads.
/Per
On 2022-04-11 22:50, Sundara Mohan M wrote: > Hi, > I am running ZGC in production with following settings > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Initializing The Z > Garbage Collector > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Version: 17.0.2+8 > (release) > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] NUMA Support: Enabled > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] NUMA Nodes: 2 > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] CPUs: 72 total, 72 > available > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Memory: 191871M > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] Large Page Support: > Disabled > [2022-04-05T22:30:37.154+0000][0.038s][info][gc,init] GC Workers: 9/44 > (static) > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Address Space Type: > Contiguous/Unrestricted/Complete > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Address Space Size: > 1572864M x 3 = 4718592M > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Heap Backing File: > /memfd:java_heap > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Heap Backing > Filesystem: tmpfs (0x1021994) > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Min Capacity: 98304M > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Initial Capacity: > 98304M > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Max Capacity: 98304M > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Medium Page Size: 32M > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Pre-touch: Disabled > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Available space on > backing filesystem: N/A > [2022-04-05T22:30:37.157+0000][0.040s][info][gc,init] Uncommit: Implicitly > Disabled (-Xms equals -Xmx) > [2022-04-05T22:30:48.788+0000][11.672s][info][gc,init] Runtime Workers: 44 > > In 24 hours I have seen 4 allocation stalls, each is happening over 5 mins > with >1s stall on multiple threads. > Whenever allocation stalls, the following pattern is observed > 1. There is an increase in SoftRef enqueued values > 2. Concurrent Mark time increase from 3s to ~12s > > > [2022-04-10T19:46:22.380+0000][422145.264s][info][gc,ref ] GC(17874) > Soft: 22330 encountered, 19702 discovered, 11571 enqueued > [2022-04-10T19:46:34.534+0000][422157.418s][info][gc,ref ] GC(17875) > Soft: 10781 encountered, 7560 discovered, 19 enqueued > [2022-04-10T19:46:47.755+0000][422170.639s][info][gc,ref ] GC(17876) > Soft: 10863 encountered, 3289 discovered, 44 enqueued > ... > [2022-04-11T05:31:55.763+0000][457278.647s][info][gc,ref ] GC(19267) > Soft: 9664 encountered, 868 discovered, 12 enqueued > [2022-04-11T05:32:09.553+0000][457292.437s][info][gc,ref ] GC(19268) > Soft: 9846 encountered, 6202 discovered, 98 enqueued > [2022-04-11T05:32:23.798+0000][457306.682s][info][gc,ref ] GC(19269) > Soft: 9859 encountered, 6715 discovered, 132 enqueued > [2022-04-11T05:32:37.903+0000][457320.787s][info][gc,ref ] GC(19270) > Soft: 9873 encountered, 7585 discovered, 42 enqueued > [2022-04-11T05:32:52.609+0000][457335.493s][info][gc,ref ] GC(19271) > Soft: 10073 encountered, 6466 discovered, 57 enqueued > [2022-04-11T11:13:14.758+0000][477757.642s][info][gc,ref ] GC(20578) > Soft: 25760 encountered, 20850 discovered, 15635 enqueued > > Can someone share info about what else might cause Concurrent Mark time to > go higher? > Will cleaning SoftReference quickly using SoftRefLRUPolicyMSPerMB flag or > not using SoftRef might help in this case? > > Can send complete logs if it helps. > > > Thanks > Sundar