Information on how to parse/interpret ZGC Logs
Per Liden
per.liden at oracle.com
Mon Feb 24 12:32:46 UTC 2020
Hi,
On 2/13/20 8:20 AM, Prabhash Rathore wrote:
> Thank you Per for your help! It's very helpful.
>
> I started my application GC configuration with default settings, I just
> had Xmx set but because of memory allocation stalls and Gc pauses, I
> tunes Concurrent threads and Parallel threads as default options didn't
> see enough.
>
> My reasoning for high parallel thread configuration
> (-XX:ParallelGCThreads=78) is that application threads are anyway
> stalled during full pause so having higher threads (for now 80% of OS
> threads) can work on collection. and keep the GC pause time lower. Again
> I increased Concurrent threads from default value to keep collection
> rate on par with the allocation rate.
>
> You mentioned when I see Allocation Stall, increase heap size. I think I
> have already heap configured at 80% of RAM size. For such allocation
> stall, is there anything else I can tune other than heap size,
> concurrent and parallel thread counts.
>
>
> Hi Peter,
>
> This application runs on Linux RHEL 7.7. OS. Kernel version is 3.10.0-1062
>
> Output of *egrep "thp|trans" /proc/vmstat:*
> nr_anon_transparent_hugepages 4722
> thp_fault_alloc 51664
> thp_fault_fallback 620147
> thp_collapse_alloc 11462
> thp_collapse_alloc_failed 20085
> thp_split 9350
> thp_zero_page_alloc 1
> thp_zero_page_alloc_failed 0
For best throughput and latency, I'd suggest you disable all use of
transparent hugepages and instead configure the kernel hugepage pool and
use the -XX:+UseLargePages JVM option.
cheers,
Per
>
>
>
> Output of *tail -28 /proc/cpuinfo*
> power management:
>
> processor : 95
> vendor_id : GenuineIntel
> cpu family : 6
> model : 85
> model name : Intel(R) Xeon(R) Gold 6263CY CPU @ 2.60GHz
> stepping : 7
> microcode : 0x5000021
> cpu MHz : 3304.431
> cache size : 33792 KB
> physical id : 1
> siblings : 48
> core id : 29
> cpu cores : 24
> apicid : 123
> initial apicid : 123
> fpu : yes
> fpu_exception : yes
> cpuid level : 22
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin
> intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
> clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc
> cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp
> hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear
> spec_ctrl intel_stibp flush_l1d arch_capabilities
> bogomips : 5206.07
> clflush size : 64
> cache_alignment : 64
> address sizes : 46 bits physical, 48 bits virtual
> power management:
>
>
>
>
>
> On Mon, Feb 10, 2020 at 9:28 AM Peter Booth <peter_booth at me.com
> <mailto:peter_booth at me.com>> wrote:
>
> Prabhash,
>
> What OS version?
> Is it a vanilla OS install?
> Can you print the output of the following?
> (Assuming Linux)
> egrep ‘thp|trans’ /proc/vmstat
> tail -28 /proc/cpuinfo
>
> Peter
>
> Sent from my iPhone
>
> > On Feb 9, 2020, at 1:56 AM, Prabhash Rathore
> <prabhashrathore at gmail.com <mailto:prabhashrathore at gmail.com>> wrote:
> >
> > Hi Per,
> >
> > Thanks for your reply!
> >
> > About ZGC logs, in general I am trying to understand following:
> >
> > - What are the full pause times?
> > - How many such pause per unit time?
> > - Anything else which helps me eliminate GC as cause for high
> > application latency.
> >
> > This is how I have configured ZGC logging at JVM level, wondering
> if I
> > should add other tags like Safepoint to get more details about GC
> stats:
> > -Xlog:gc*=debug:file=gc.log
> >
> > All JVM flas used in my application:
> > -Xms130G -Xmx130G
> >
> -Xlog:gc*=debug:file=/somedir/gc.log:time:filecount=10,filesize=67108864
> > -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError
> > -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=48
> > -XX:ParallelGCThreads=96 -XX:-OmitStackTraceInFastThrow
> >
> > It's a large machine with 96 threads and 196 GB RAM.
> >
> > I have -XX:+AlwaysPreTouch configured as one another option. With
> > AlwaysPreTouch option, I see in Linux top command shows a very
> high shared
> > and resident memory. My max heap size is configured as 130 GB but
> I see
> > shared memory is shown as 388 GB and Resident memory as 436 GB.
> On the
> > other hand, total virtual memory for this process in top is shown
> as 17.1
> > tera byte. How is this possible? My whole machine size is 196 GB
> (is this
> > accounting for things swapped out to disk). I did see without
> > AlwaysPretouch, numbers look close to the heap size. Trying to
> understand
> > why with PreTouch, process memory is shown was higher than
> configured size?
> > I understand shared memory has all shared libs mapped out but how
> can it be
> > such. a large size?
> >
> > Regarding high GC pause time, I did notice that my machine was low on
> > memory and it was swapping, hence slowing down everything. For
> now I have
> > disabled Swappines completely with Kernel VM tunable but I am
> still trying
> > to find the actual cause of why swapping kicked in. This machine
> only runs
> > this particular Java applicaion which has 130 GB heap size. Other
> than
> > heap, I still have 66 GB memory available on host. Trying to
> figure out if
> > there is a native memory leak. If you have any inputs on this
> then please
> > share.
> >
> > Thanks!
> > Prabhash Rathore
> >
> >> On Mon, Feb 3, 2020 at 2:35 AM Per Liden <per.liden at oracle.com
> <mailto:per.liden at oracle.com>> wrote:
> >>
> >> Hi,
> >>
> >>> On 2020-02-03 06:52, Prabhash Rathore wrote:
> >>> Hello,
> >>>
> >>> We have decided to use ZGC Garbage Collector for our Java
> application
> >>> running on Java 11. I was wondering if there are any tools or any
> >>> documenation on how to interpret ZGC logs.
> >>
> >> Is there something in particular in the logs you're wondering about?
> >>
> >>>
> >>> I found following statistics in ZGC log which as per my
> understanding
> >> shows
> >>> a very large allocation stall of 3902042.342 milliseconds. It
> will be
> >>> really great if I can get some help to understand this further.
> >>
> >> I can see that you've had making times that is more than an hour
> long,
> >> which suggests that something in your system is seriously wrong
> (like
> >> extremely overloaded or an extremely slow disk that you log
> to... just
> >> guessing here). I think you need to have a broader look at the
> health of
> >> the system before we can draw any conclusion from the GC logs.
> >>
> >> cheers,
> >> Per
> >>
> >>>
> >>> [2020-02-02T22:37:36.883+0000] === Garbage Collection Statistics
> >>>
> >>
> =======================================================================================================================
> >>> [2020-02-02T22:37:36.883+0000]
> >>> Last 10s Last 10m
> Last 10h
> >>> Total
> >>> [2020-02-02T22:37:36.883+0000]
> >>> Avg / Max Avg / Max
> Avg / Max
> >>> Avg / Max
> >>> [2020-02-02T22:37:36.883+0000] Collector: Garbage Collection
> Cycle
> >>> 0.000 / 0.000 7789.187 / 7789.187 12727.424 /
> >>> 3903938.012 1265.033 / 3903938.012 ms
> >>> [2020-02-02T22:37:36.883+0000] Contention: Mark Segment Reset
> Contention
> >>> 0 / 0 10 / 1084 176
> / 15122
> >>> 42 / 15122 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Contention: Mark SeqNum Reset
> Contention
> >>> 0 / 0 0 / 5
> 0 / 31
> >>> 0 / 31 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Contention: Relocation Contention
> >>> 0 / 0 0 / 3 1
> / 708
> >>> 7 / 890 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Critical: Allocation Stall
> >>> 0.000 / 0.000 0.000 / 0.000 6714.722 /
> >>> 3902042.342 6714.722 / 3902042.342 ms
> >>> [2020-02-02T22:37:36.883+0000] Critical: Allocation Stall
> >>> 0 / 0 0 / 0
> 12 / 4115
> >>> 2 / 4115 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Critical: GC Locker Stall
> >>> 0.000 / 0.000 0.000 / 0.000 3.979
> / 6.561
> >>> 1.251 / 6.561 ms
> >>> [2020-02-02T22:37:36.883+0000] Critical: GC Locker Stall
> >>> 0 / 0 0 / 0 0 / 1
> >>> 0 / 1 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Allocation Rate
> >>> 0 / 0 6 / 822 762
> / 25306
> >>> 1548 / 25306 MB/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used After Mark
> >>> 0 / 0 92170 / 92170 89632 /
> >> 132896
> >>> 30301 / 132896 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used After
> Relocation
> >>> 0 / 0 76376 / 76376 67490 /
> >> 132928
> >>> 8047 / 132928 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used Before Mark
> >>> 0 / 0 92128 / 92128 84429
> / 132896
> >>> 29452 / 132896 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used Before
> Relocation
> >>> 0 / 0 86340 / 86340 76995
> / 132896
> >>> 15862 / 132896 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Out Of Memory
> >>> 0 / 0 0 / 0 0 / 0
> >>> 0 / 0 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Flush
> >>> 0 / 0 0 / 0
> 62 / 2868
> >>> 16 / 2868 MB/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Hit L1
> >>> 0 / 0 7 / 2233 277
> / 11553
> >>> 583 / 11553 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Hit L2
> >>> 0 / 0 0 / 0 20
> / 4619
> >>> 59 / 4619 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Miss
> >>> 0 / 0 0 / 0 15
> / 1039
> >>> 3 / 1297 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Undo Object
> Allocation Failed
> >>> 0 / 0 0 / 0 0
> / 24
> >>> 0 / 24 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Undo Object Allocation
> >>> Succeeded 0 / 0 0 / 3
> 1
> >> /
> >>> 708 7 / 890 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Undo Page Allocation
> >>> 0 / 0 0 / 12
> 30 / 3464
> >>> 7 / 3464 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Destroy
> Detached
> >>> Pages 0.000 / 0.000 0.004 / 0.004
> 11.675 /
> >>> 1484.886 1.155 / 1484.886 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Mark
> >>> 0.000 / 0.000 7016.569 / 7016.569 11758.365 /
> >>> 3901893.544 1103.558 / 3901893.544 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Mark
> Continue
> >>> 0.000 / 0.000 0.000 / 0.000 1968.844 /
> >> 3674.454
> >>> 1968.844 / 3674.454 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Prepare
> Relocation
> >>> Set 0.000 / 0.000 453.732 / 453.732 364.535 /
> >>> 7103.720 39.453 / 7103.720 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Process
> Non-Strong
> >>> References 0.000 / 0.000 2.003 / 2.003 2.738 /
> >> 34.406
> >>> 2.253 / 34.406 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Relocate
> >>> 0.000 / 0.000 261.822 / 261.822 335.954 /
> >> 2207.669
> >>> 45.868 / 2207.669 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Reset
> Relocation
> >> Set
> >>> 0.000 / 0.000 6.083 / 6.083 13.489 /
> >> 1128.678
> >>> 3.574 / 1128.678 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Select
> Relocation
> >>> Set 0.000 / 0.000 6.379 / 6.379
> 97.530 /
> >>> 1460.679 18.439 / 1460.679 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Pause Mark End
> >>> 0.000 / 0.000 4.420 / 4.420 6.219 /
> >> 26.498
> >>> 6.474 / 40.883 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Pause Mark Start
> >>> 0.000 / 0.000 14.836 / 14.836 11.893 /
> >> 28.350
> >>> 11.664 / 41.767 ms
> >>> [2020-02-02T22:37:36.884+0000] Phase: Pause Relocate Start
> >>> 0.000 / 0.000 13.411 / 13.411 30.849 /
> >> 697.344
> >>> 11.995 / 697.344 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark
> >>> 0.000 / 0.000 7015.793 / 7016.276 18497.265 /
> >>> 3901893.075 1690.497 / 3901893.075 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark Idle
> >>> 0.000 / 0.000 1.127 / 13.510 1.292 /
> >> 219.999
> >>> 1.280 / 219.999 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark Try
> Flush
> >>> 0.000 / 0.000 1.295 / 2.029 47.094 /
> >> 34869.359
> >>> 4.797 / 34869.359 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark Try
> Terminate
> >>> 0.000 / 0.000 1.212 / 14.847 1.760 /
> >> 3799.238
> >>> 1.724 / 3799.238 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent
> References Enqueue
> >>> 0.000 / 0.000 0.009 / 0.009 0.022
> / 1.930
> >>> 0.017 / 2.350 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent
> References Process
> >>> 0.000 / 0.000 0.599 / 0.599 0.768
> / 23.966
> >>> 0.495 / 23.966 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> 0.000 / 0.000 0.882 / 1.253 1.155
> / 21.699
> >>> 1.077 / 23.602 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> JNIWeakHandles 0.000 / 0.000 0.301 / 0.943
> >> 0.308 /
> >>> 10.868 0.310 / 23.219 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> StringTable 0.000 / 0.000 0.289 / 0.496
> >> 0.390 /
> >>> 12.794 0.363 / 22.907 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> VMWeakHandles 0.000 / 0.000 0.230 / 0.469
> >> 0.329 /
> >>> 21.267 0.331 / 23.135 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Mark Try Complete
> >>> 0.000 / 0.000 0.000 / 0.000 0.501
> / 4.801
> >>> 0.480 / 17.208 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Remap TLABS
> >>> 0.000 / 0.000 0.252 / 0.252 0.195
> / 0.528
> >>> 0.226 / 3.451 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Retire TLABS
> >>> 0.000 / 0.000 1.195 / 1.195
> 1.324 / 5.082
> >>> 1.408 / 11.219 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots
> >>> 0.000 / 0.000 6.968 / 10.865 12.329 /
> >> 693.701
> >>> 6.431 / 1300.994 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots
> >>> ClassLoaderDataGraph 0.000 / 0.000 4.819 /
> 8.232
> >>> 9.635 / 693.405 3.476 / 693.405 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots CodeCache
> >>> 0.000 / 0.000 0.842 / 2.731 0.996
> / 83.553
> >>> 0.780 / 83.553 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots JNIHandles
> >>> 0.000 / 0.000 1.171 / 6.314 0.866 /
> >> 17.875
> >>> 0.837 / 25.708 ms
> >>>
> >>>
> >>> Thank you!
> >>> Prabhash Rathore
> >>>
> >>
>
More information about the zgc-dev
mailing list