Information on how to parse/interpret ZGC Logs

Per Liden per.liden at oracle.com
Mon Feb 24 12:32:46 UTC 2020


Hi,

On 2/13/20 8:20 AM, Prabhash Rathore wrote:
> Thank you Per for your help! It's very helpful.
> 
> I started my application GC configuration with default settings, I just 
> had Xmx set but because of memory allocation stalls and Gc pauses, I 
> tunes Concurrent threads and Parallel threads as default options didn't 
> see enough.
> 
> My reasoning for high parallel  thread configuration 
> (-XX:ParallelGCThreads=78) is that application threads are anyway 
> stalled during full pause so having higher threads (for now 80% of OS 
> threads) can work on collection. and keep the GC pause time lower. Again 
> I increased Concurrent threads from default value to keep collection 
> rate on par with the allocation rate.
> 
> You mentioned when I see Allocation Stall, increase heap size. I think I 
> have already heap configured at 80% of RAM size. For such allocation 
> stall, is there anything else I can tune other than heap size, 
> concurrent and parallel thread counts.
> 
> 
> Hi Peter,
> 
> This application runs on Linux RHEL 7.7. OS. Kernel version is 3.10.0-1062
> 
> Output of *egrep "thp|trans" /proc/vmstat:*
> nr_anon_transparent_hugepages 4722
> thp_fault_alloc 51664
> thp_fault_fallback 620147
> thp_collapse_alloc 11462
> thp_collapse_alloc_failed 20085
> thp_split 9350
> thp_zero_page_alloc 1
> thp_zero_page_alloc_failed 0

For best throughput and latency, I'd suggest you disable all use of 
transparent hugepages and instead configure the kernel hugepage pool and 
use the -XX:+UseLargePages JVM option.

cheers,
Per

> 
> 
> 
> Output of *tail -28 /proc/cpuinfo*
> power management:
> 
> processor : 95
> vendor_id : GenuineIntel
> cpu family : 6
> model : 85
> model name : Intel(R) Xeon(R) Gold 6263CY CPU @ 2.60GHz
> stepping : 7
> microcode : 0x5000021
> cpu MHz : 3304.431
> cache size : 33792 KB
> physical id : 1
> siblings : 48
> core id : 29
> cpu cores : 24
> apicid : 123
> initial apicid : 123
> fpu : yes
> fpu_exception : yes
> cpuid level : 22
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
> xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor 
> ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
> lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin 
> intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi 
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms 
> invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt 
> clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc 
> cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp 
> hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear 
> spec_ctrl intel_stibp flush_l1d arch_capabilities
> bogomips : 5206.07
> clflush size : 64
> cache_alignment : 64
> address sizes : 46 bits physical, 48 bits virtual
> power management:
> 
> 
> 
> 
> 
> On Mon, Feb 10, 2020 at 9:28 AM Peter Booth <peter_booth at me.com 
> <mailto:peter_booth at me.com>> wrote:
> 
>     Prabhash,
> 
>     What OS version?
>     Is it a vanilla OS install?
>     Can you print the output of the following?
>     (Assuming Linux)
>     egrep ‘thp|trans’ /proc/vmstat
>     tail -28 /proc/cpuinfo
> 
>     Peter
> 
>     Sent from my iPhone
> 
>      > On Feb 9, 2020, at 1:56 AM, Prabhash Rathore
>     <prabhashrathore at gmail.com <mailto:prabhashrathore at gmail.com>> wrote:
>      >
>      > Hi Per,
>      >
>      > Thanks for your reply!
>      >
>      > About ZGC logs, in general I am trying to understand following:
>      >
>      >   - What are the full pause times?
>      >   - How many such pause per unit time?
>      >   - Anything else which helps me eliminate GC as cause for high
>      >   application latency.
>      >
>      > This is how I have configured ZGC logging at JVM level, wondering
>     if I
>      > should add other tags like Safepoint to get more details about GC
>     stats:
>      > -Xlog:gc*=debug:file=gc.log
>      >
>      > All JVM flas used in my application:
>      > -Xms130G -Xmx130G
>      >
>     -Xlog:gc*=debug:file=/somedir/gc.log:time:filecount=10,filesize=67108864
>      > -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError
>      > -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=48
>      > -XX:ParallelGCThreads=96 -XX:-OmitStackTraceInFastThrow
>      >
>      > It's a large machine with 96 threads and 196 GB RAM.
>      >
>      > I have -XX:+AlwaysPreTouch configured as one another option. With
>      > AlwaysPreTouch option, I see in Linux top command shows a very
>     high shared
>      > and resident memory. My max heap size is configured as 130 GB but
>     I see
>      > shared memory is shown as 388 GB and Resident memory as 436 GB.
>     On the
>      > other hand, total virtual memory for this process in top is shown
>     as 17.1
>      > tera byte. How is this possible? My whole machine size is 196 GB
>     (is this
>      > accounting for things swapped out to disk). I did see without
>      > AlwaysPretouch, numbers look close to the heap size. Trying to
>     understand
>      > why with PreTouch, process memory is shown was higher than
>     configured size?
>      > I understand shared memory has all shared libs mapped out but how
>     can it be
>      > such. a large size?
>      >
>      > Regarding high GC pause time, I did notice that my machine was low on
>      > memory and it was swapping, hence slowing down everything. For
>     now I have
>      > disabled Swappines completely with Kernel VM tunable but I am
>     still trying
>      > to find the actual cause of why swapping kicked in. This machine
>     only runs
>      > this particular Java applicaion which has 130 GB heap size. Other
>     than
>      > heap, I still have 66 GB memory available on host. Trying to
>     figure out if
>      > there is a native memory leak. If you have any inputs on this
>     then please
>      > share.
>      >
>      > Thanks!
>      > Prabhash Rathore
>      >
>      >> On Mon, Feb 3, 2020 at 2:35 AM Per Liden <per.liden at oracle.com
>     <mailto:per.liden at oracle.com>> wrote:
>      >>
>      >> Hi,
>      >>
>      >>> On 2020-02-03 06:52, Prabhash Rathore wrote:
>      >>> Hello,
>      >>>
>      >>> We have decided to use ZGC Garbage Collector for our Java
>     application
>      >>> running on Java 11. I was wondering if there are any tools or any
>      >>> documenation on how to interpret ZGC logs.
>      >>
>      >> Is there something in particular in the logs you're wondering about?
>      >>
>      >>>
>      >>> I found following statistics in ZGC log which as per my
>     understanding
>      >> shows
>      >>> a very large allocation stall of 3902042.342 milliseconds. It
>     will be
>      >>> really great if I can get some help to understand this further.
>      >>
>      >> I can see that you've had making times that is more than an hour
>     long,
>      >> which suggests that something in your system is seriously wrong
>     (like
>      >> extremely overloaded or an extremely slow disk that you log
>     to... just
>      >> guessing here). I think you need to have a broader look at the
>     health of
>      >> the system before we can draw any conclusion from the GC logs.
>      >>
>      >> cheers,
>      >> Per
>      >>
>      >>>
>      >>> [2020-02-02T22:37:36.883+0000] === Garbage Collection Statistics
>      >>>
>      >>
>     =======================================================================================================================
>      >>> [2020-02-02T22:37:36.883+0000]
>      >>>                  Last 10s              Last 10m             
>     Last 10h
>      >>>          Total
>      >>> [2020-02-02T22:37:36.883+0000]
>      >>>                  Avg / Max             Avg / Max           
>       Avg / Max
>      >>>         Avg / Max
>      >>> [2020-02-02T22:37:36.883+0000]   Collector: Garbage Collection
>     Cycle
>      >>>                0.000 / 0.000      7789.187 / 7789.187  12727.424 /
>      >>> 3903938.012  1265.033 / 3903938.012   ms
>      >>> [2020-02-02T22:37:36.883+0000]  Contention: Mark Segment Reset
>     Contention
>      >>>                   0 / 0                10 / 1084            176
>     / 15122
>      >>>        42 / 15122       ops/s
>      >>> [2020-02-02T22:37:36.883+0000]  Contention: Mark SeqNum Reset
>     Contention
>      >>>                    0 / 0                 0 / 5               
>       0 / 31
>      >>>          0 / 31          ops/s
>      >>> [2020-02-02T22:37:36.883+0000]  Contention: Relocation Contention
>      >>>                   0 / 0                 0 / 3                 1
>     / 708
>      >>>         7 / 890         ops/s
>      >>> [2020-02-02T22:37:36.883+0000]    Critical: Allocation Stall
>      >>>                0.000 / 0.000         0.000 / 0.000      6714.722 /
>      >>> 3902042.342  6714.722 / 3902042.342   ms
>      >>> [2020-02-02T22:37:36.883+0000]    Critical: Allocation Stall
>      >>>                    0 / 0                 0 / 0               
>     12 / 4115
>      >>>          2 / 4115        ops/s
>      >>> [2020-02-02T22:37:36.883+0000]    Critical: GC Locker Stall
>      >>>               0.000 / 0.000         0.000 / 0.000         3.979
>     / 6.561
>      >>>     1.251 / 6.561       ms
>      >>> [2020-02-02T22:37:36.883+0000]    Critical: GC Locker Stall
>      >>>                   0 / 0                 0 / 0                 0 / 1
>      >>>         0 / 1           ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Allocation Rate
>      >>>                   0 / 0                 6 / 822             762
>     / 25306
>      >>>      1548 / 25306       MB/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used After Mark
>      >>>                    0 / 0             92170 / 92170         89632 /
>      >> 132896
>      >>>      30301 / 132896      MB
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used After
>     Relocation
>      >>>                    0 / 0             76376 / 76376         67490 /
>      >> 132928
>      >>>       8047 / 132928      MB
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used Before Mark
>      >>>                   0 / 0             92128 / 92128         84429
>     / 132896
>      >>>      29452 / 132896      MB
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used Before
>     Relocation
>      >>>                   0 / 0             86340 / 86340         76995
>     / 132896
>      >>>      15862 / 132896      MB
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Out Of Memory
>      >>>                   0 / 0                 0 / 0                 0 / 0
>      >>>         0 / 0           ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Flush
>      >>>                    0 / 0                 0 / 0               
>     62 / 2868
>      >>>         16 / 2868        MB/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Hit L1
>      >>>                   0 / 0                 7 / 2233            277
>     / 11553
>      >>>       583 / 11553       ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Hit L2
>      >>>                   0 / 0                 0 / 0                20
>     / 4619
>      >>>         59 / 4619        ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Miss
>      >>>                   0 / 0                 0 / 0                15
>     / 1039
>      >>>          3 / 1297        ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Object
>     Allocation Failed
>      >>>                   0 / 0                 0 / 0                 0
>     / 24
>      >>>          0 / 24          ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Object Allocation
>      >>> Succeeded                  0 / 0                 0 / 3         
>             1
>      >> /
>      >>> 708               7 / 890         ops/s
>      >>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Page Allocation
>      >>>                    0 / 0                 0 / 12             
>       30 / 3464
>      >>>          7 / 3464        ops/s
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Destroy
>     Detached
>      >>> Pages             0.000 / 0.000         0.004 / 0.004       
>     11.675 /
>      >>> 1484.886      1.155 / 1484.886    ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Mark
>      >>>               0.000 / 0.000      7016.569 / 7016.569  11758.365 /
>      >>> 3901893.544  1103.558 / 3901893.544   ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Mark
>     Continue
>      >>>                0.000 / 0.000         0.000 / 0.000      1968.844 /
>      >> 3674.454
>      >>>   1968.844 / 3674.454    ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Prepare
>     Relocation
>      >>> Set             0.000 / 0.000       453.732 / 453.732     364.535 /
>      >>> 7103.720     39.453 / 7103.720    ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Process
>     Non-Strong
>      >>> References      0.000 / 0.000         2.003 / 2.003         2.738 /
>      >> 34.406
>      >>>        2.253 / 34.406      ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Relocate
>      >>>               0.000 / 0.000       261.822 / 261.822     335.954 /
>      >> 2207.669
>      >>>     45.868 / 2207.669    ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Reset
>     Relocation
>      >> Set
>      >>>               0.000 / 0.000         6.083 / 6.083        13.489 /
>      >> 1128.678
>      >>>      3.574 / 1128.678    ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Select
>     Relocation
>      >>> Set              0.000 / 0.000         6.379 / 6.379       
>     97.530 /
>      >>> 1460.679     18.439 / 1460.679    ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Pause Mark End
>      >>>                0.000 / 0.000         4.420 / 4.420         6.219 /
>      >> 26.498
>      >>>      6.474 / 40.883      ms
>      >>> [2020-02-02T22:37:36.883+0000]       Phase: Pause Mark Start
>      >>>                0.000 / 0.000        14.836 / 14.836       11.893 /
>      >> 28.350
>      >>>     11.664 / 41.767      ms
>      >>> [2020-02-02T22:37:36.884+0000]       Phase: Pause Relocate Start
>      >>>                0.000 / 0.000        13.411 / 13.411       30.849 /
>      >> 697.344
>      >>>      11.995 / 697.344     ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark
>      >>>               0.000 / 0.000      7015.793 / 7016.276  18497.265 /
>      >>> 3901893.075  1690.497 / 3901893.075   ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Idle
>      >>>                0.000 / 0.000         1.127 / 13.510        1.292 /
>      >> 219.999
>      >>>       1.280 / 219.999     ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Try
>     Flush
>      >>>               0.000 / 0.000         1.295 / 2.029        47.094 /
>      >> 34869.359
>      >>>     4.797 / 34869.359   ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Try
>     Terminate
>      >>>               0.000 / 0.000         1.212 / 14.847        1.760 /
>      >> 3799.238
>      >>>      1.724 / 3799.238    ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent
>     References Enqueue
>      >>>               0.000 / 0.000         0.009 / 0.009         0.022
>     / 1.930
>      >>>     0.017 / 2.350       ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent
>     References Process
>      >>>               0.000 / 0.000         0.599 / 0.599         0.768
>     / 23.966
>      >>>      0.495 / 23.966      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>      >>>               0.000 / 0.000         0.882 / 1.253         1.155
>     / 21.699
>      >>>      1.077 / 23.602      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>      >>> JNIWeakHandles          0.000 / 0.000         0.301 / 0.943
>      >> 0.308 /
>      >>> 10.868        0.310 / 23.219      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>      >>> StringTable             0.000 / 0.000         0.289 / 0.496
>      >> 0.390 /
>      >>> 12.794        0.363 / 22.907      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>      >>> VMWeakHandles           0.000 / 0.000         0.230 / 0.469
>      >> 0.329 /
>      >>> 21.267        0.331 / 23.135      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Mark Try Complete
>      >>>               0.000 / 0.000         0.000 / 0.000         0.501
>     / 4.801
>      >>>     0.480 / 17.208      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Remap TLABS
>      >>>               0.000 / 0.000         0.252 / 0.252         0.195
>     / 0.528
>      >>>     0.226 / 3.451       ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Retire TLABS
>      >>>                0.000 / 0.000         1.195 / 1.195       
>       1.324 / 5.082
>      >>>       1.408 / 11.219      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots
>      >>>               0.000 / 0.000         6.968 / 10.865       12.329 /
>      >> 693.701
>      >>>     6.431 / 1300.994    ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots
>      >>> ClassLoaderDataGraph              0.000 / 0.000         4.819 /
>     8.232
>      >>>   9.635 / 693.405       3.476 / 693.405     ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots CodeCache
>      >>>               0.000 / 0.000         0.842 / 2.731         0.996
>     / 83.553
>      >>>      0.780 / 83.553      ms
>      >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots JNIHandles
>      >>>                0.000 / 0.000         1.171 / 6.314         0.866 /
>      >> 17.875
>      >>>      0.837 / 25.708      ms
>      >>>
>      >>>
>      >>> Thank you!
>      >>> Prabhash Rathore
>      >>>
>      >>
> 


More information about the zgc-dev mailing list