Information on how to parse/interpret ZGC Logs
Prabhash Rathore
prabhashrathore at gmail.com
Thu Feb 13 07:20:40 UTC 2020
Thank you Per for your help! It's very helpful.
I started my application GC configuration with default settings, I just had
Xmx set but because of memory allocation stalls and Gc pauses, I tunes
Concurrent threads and Parallel threads as default options didn't see
enough.
My reasoning for high parallel thread configuration
(-XX:ParallelGCThreads=78) is that application threads are anyway stalled
during full pause so having higher threads (for now 80% of OS threads) can
work on collection. and keep the GC pause time lower. Again I increased
Concurrent threads from default value to keep collection rate on par with
the allocation rate.
You mentioned when I see Allocation Stall, increase heap size. I think I
have already heap configured at 80% of RAM size. For such allocation stall,
is there anything else I can tune other than heap size, concurrent and
parallel thread counts.
Hi Peter,
This application runs on Linux RHEL 7.7. OS. Kernel version is 3.10.0-1062
Output of *egrep "thp|trans" /proc/vmstat:*
nr_anon_transparent_hugepages 4722
thp_fault_alloc 51664
thp_fault_fallback 620147
thp_collapse_alloc 11462
thp_collapse_alloc_failed 20085
thp_split 9350
thp_zero_page_alloc 1
thp_zero_page_alloc_failed 0
Output of *tail -28 /proc/cpuinfo*
power management:
processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6263CY CPU @ 2.60GHz
stepping : 7
microcode : 0x5000021
cpu MHz : 3304.431
cache size : 33792 KB
physical id : 1
siblings : 48
core id : 29
cpu cores : 24
apicid : 123
initial apicid : 123
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba
ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a
avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl
xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
bogomips : 5206.07
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
On Mon, Feb 10, 2020 at 9:28 AM Peter Booth <peter_booth at me.com> wrote:
> Prabhash,
>
> What OS version?
> Is it a vanilla OS install?
> Can you print the output of the following?
> (Assuming Linux)
> egrep ‘thp|trans’ /proc/vmstat
> tail -28 /proc/cpuinfo
>
> Peter
>
> Sent from my iPhone
>
> > On Feb 9, 2020, at 1:56 AM, Prabhash Rathore <prabhashrathore at gmail.com>
> wrote:
> >
> > Hi Per,
> >
> > Thanks for your reply!
> >
> > About ZGC logs, in general I am trying to understand following:
> >
> > - What are the full pause times?
> > - How many such pause per unit time?
> > - Anything else which helps me eliminate GC as cause for high
> > application latency.
> >
> > This is how I have configured ZGC logging at JVM level, wondering if I
> > should add other tags like Safepoint to get more details about GC stats:
> > -Xlog:gc*=debug:file=gc.log
> >
> > All JVM flas used in my application:
> > -Xms130G -Xmx130G
> > -Xlog:gc*=debug:file=/somedir/gc.log:time:filecount=10,filesize=67108864
> > -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError
> > -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=48
> > -XX:ParallelGCThreads=96 -XX:-OmitStackTraceInFastThrow
> >
> > It's a large machine with 96 threads and 196 GB RAM.
> >
> > I have -XX:+AlwaysPreTouch configured as one another option. With
> > AlwaysPreTouch option, I see in Linux top command shows a very high
> shared
> > and resident memory. My max heap size is configured as 130 GB but I see
> > shared memory is shown as 388 GB and Resident memory as 436 GB. On the
> > other hand, total virtual memory for this process in top is shown as 17.1
> > tera byte. How is this possible? My whole machine size is 196 GB (is this
> > accounting for things swapped out to disk). I did see without
> > AlwaysPretouch, numbers look close to the heap size. Trying to understand
> > why with PreTouch, process memory is shown was higher than configured
> size?
> > I understand shared memory has all shared libs mapped out but how can it
> be
> > such. a large size?
> >
> > Regarding high GC pause time, I did notice that my machine was low on
> > memory and it was swapping, hence slowing down everything. For now I have
> > disabled Swappines completely with Kernel VM tunable but I am still
> trying
> > to find the actual cause of why swapping kicked in. This machine only
> runs
> > this particular Java applicaion which has 130 GB heap size. Other than
> > heap, I still have 66 GB memory available on host. Trying to figure out
> if
> > there is a native memory leak. If you have any inputs on this then please
> > share.
> >
> > Thanks!
> > Prabhash Rathore
> >
> >> On Mon, Feb 3, 2020 at 2:35 AM Per Liden <per.liden at oracle.com> wrote:
> >>
> >> Hi,
> >>
> >>> On 2020-02-03 06:52, Prabhash Rathore wrote:
> >>> Hello,
> >>>
> >>> We have decided to use ZGC Garbage Collector for our Java application
> >>> running on Java 11. I was wondering if there are any tools or any
> >>> documenation on how to interpret ZGC logs.
> >>
> >> Is there something in particular in the logs you're wondering about?
> >>
> >>>
> >>> I found following statistics in ZGC log which as per my understanding
> >> shows
> >>> a very large allocation stall of 3902042.342 milliseconds. It will be
> >>> really great if I can get some help to understand this further.
> >>
> >> I can see that you've had making times that is more than an hour long,
> >> which suggests that something in your system is seriously wrong (like
> >> extremely overloaded or an extremely slow disk that you log to... just
> >> guessing here). I think you need to have a broader look at the health of
> >> the system before we can draw any conclusion from the GC logs.
> >>
> >> cheers,
> >> Per
> >>
> >>>
> >>> [2020-02-02T22:37:36.883+0000] === Garbage Collection Statistics
> >>>
> >>
> =======================================================================================================================
> >>> [2020-02-02T22:37:36.883+0000]
> >>> Last 10s Last 10m Last 10h
> >>> Total
> >>> [2020-02-02T22:37:36.883+0000]
> >>> Avg / Max Avg / Max Avg / Max
> >>> Avg / Max
> >>> [2020-02-02T22:37:36.883+0000] Collector: Garbage Collection Cycle
> >>> 0.000 / 0.000 7789.187 / 7789.187 12727.424 /
> >>> 3903938.012 1265.033 / 3903938.012 ms
> >>> [2020-02-02T22:37:36.883+0000] Contention: Mark Segment Reset
> Contention
> >>> 0 / 0 10 / 1084 176 / 15122
> >>> 42 / 15122 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Contention: Mark SeqNum Reset
> Contention
> >>> 0 / 0 0 / 5 0 / 31
> >>> 0 / 31 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Contention: Relocation Contention
> >>> 0 / 0 0 / 3 1 / 708
> >>> 7 / 890 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Critical: Allocation Stall
> >>> 0.000 / 0.000 0.000 / 0.000 6714.722 /
> >>> 3902042.342 6714.722 / 3902042.342 ms
> >>> [2020-02-02T22:37:36.883+0000] Critical: Allocation Stall
> >>> 0 / 0 0 / 0 12 / 4115
> >>> 2 / 4115 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Critical: GC Locker Stall
> >>> 0.000 / 0.000 0.000 / 0.000 3.979 / 6.561
> >>> 1.251 / 6.561 ms
> >>> [2020-02-02T22:37:36.883+0000] Critical: GC Locker Stall
> >>> 0 / 0 0 / 0 0 / 1
> >>> 0 / 1 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Allocation Rate
> >>> 0 / 0 6 / 822 762 / 25306
> >>> 1548 / 25306 MB/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used After Mark
> >>> 0 / 0 92170 / 92170 89632 /
> >> 132896
> >>> 30301 / 132896 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used After Relocation
> >>> 0 / 0 76376 / 76376 67490 /
> >> 132928
> >>> 8047 / 132928 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used Before Mark
> >>> 0 / 0 92128 / 92128 84429 /
> 132896
> >>> 29452 / 132896 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Heap Used Before Relocation
> >>> 0 / 0 86340 / 86340 76995 /
> 132896
> >>> 15862 / 132896 MB
> >>> [2020-02-02T22:37:36.883+0000] Memory: Out Of Memory
> >>> 0 / 0 0 / 0 0 / 0
> >>> 0 / 0 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Flush
> >>> 0 / 0 0 / 0 62 / 2868
> >>> 16 / 2868 MB/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Hit L1
> >>> 0 / 0 7 / 2233 277 / 11553
> >>> 583 / 11553 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Hit L2
> >>> 0 / 0 0 / 0 20 / 4619
> >>> 59 / 4619 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Page Cache Miss
> >>> 0 / 0 0 / 0 15 / 1039
> >>> 3 / 1297 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Undo Object Allocation
> Failed
> >>> 0 / 0 0 / 0 0 / 24
> >>> 0 / 24 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Undo Object Allocation
> >>> Succeeded 0 / 0 0 / 3
> 1
> >> /
> >>> 708 7 / 890 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Memory: Undo Page Allocation
> >>> 0 / 0 0 / 12 30 / 3464
> >>> 7 / 3464 ops/s
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Destroy Detached
> >>> Pages 0.000 / 0.000 0.004 / 0.004 11.675 /
> >>> 1484.886 1.155 / 1484.886 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Mark
> >>> 0.000 / 0.000 7016.569 / 7016.569 11758.365 /
> >>> 3901893.544 1103.558 / 3901893.544 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Mark Continue
> >>> 0.000 / 0.000 0.000 / 0.000 1968.844 /
> >> 3674.454
> >>> 1968.844 / 3674.454 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Prepare
> Relocation
> >>> Set 0.000 / 0.000 453.732 / 453.732 364.535 /
> >>> 7103.720 39.453 / 7103.720 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Process
> Non-Strong
> >>> References 0.000 / 0.000 2.003 / 2.003 2.738 /
> >> 34.406
> >>> 2.253 / 34.406 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Relocate
> >>> 0.000 / 0.000 261.822 / 261.822 335.954 /
> >> 2207.669
> >>> 45.868 / 2207.669 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Reset Relocation
> >> Set
> >>> 0.000 / 0.000 6.083 / 6.083 13.489 /
> >> 1128.678
> >>> 3.574 / 1128.678 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Concurrent Select
> Relocation
> >>> Set 0.000 / 0.000 6.379 / 6.379 97.530 /
> >>> 1460.679 18.439 / 1460.679 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Pause Mark End
> >>> 0.000 / 0.000 4.420 / 4.420 6.219 /
> >> 26.498
> >>> 6.474 / 40.883 ms
> >>> [2020-02-02T22:37:36.883+0000] Phase: Pause Mark Start
> >>> 0.000 / 0.000 14.836 / 14.836 11.893 /
> >> 28.350
> >>> 11.664 / 41.767 ms
> >>> [2020-02-02T22:37:36.884+0000] Phase: Pause Relocate Start
> >>> 0.000 / 0.000 13.411 / 13.411 30.849 /
> >> 697.344
> >>> 11.995 / 697.344 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark
> >>> 0.000 / 0.000 7015.793 / 7016.276 18497.265 /
> >>> 3901893.075 1690.497 / 3901893.075 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark Idle
> >>> 0.000 / 0.000 1.127 / 13.510 1.292 /
> >> 219.999
> >>> 1.280 / 219.999 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark Try Flush
> >>> 0.000 / 0.000 1.295 / 2.029 47.094 /
> >> 34869.359
> >>> 4.797 / 34869.359 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Mark Try
> Terminate
> >>> 0.000 / 0.000 1.212 / 14.847 1.760 /
> >> 3799.238
> >>> 1.724 / 3799.238 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent References
> Enqueue
> >>> 0.000 / 0.000 0.009 / 0.009 0.022 / 1.930
> >>> 0.017 / 2.350 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent References
> Process
> >>> 0.000 / 0.000 0.599 / 0.599 0.768 /
> 23.966
> >>> 0.495 / 23.966 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> 0.000 / 0.000 0.882 / 1.253 1.155 /
> 21.699
> >>> 1.077 / 23.602 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> JNIWeakHandles 0.000 / 0.000 0.301 / 0.943
> >> 0.308 /
> >>> 10.868 0.310 / 23.219 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> StringTable 0.000 / 0.000 0.289 / 0.496
> >> 0.390 /
> >>> 12.794 0.363 / 22.907 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Concurrent Weak Roots
> >>> VMWeakHandles 0.000 / 0.000 0.230 / 0.469
> >> 0.329 /
> >>> 21.267 0.331 / 23.135 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Mark Try Complete
> >>> 0.000 / 0.000 0.000 / 0.000 0.501 / 4.801
> >>> 0.480 / 17.208 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Remap TLABS
> >>> 0.000 / 0.000 0.252 / 0.252 0.195 / 0.528
> >>> 0.226 / 3.451 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Retire TLABS
> >>> 0.000 / 0.000 1.195 / 1.195 1.324 /
> 5.082
> >>> 1.408 / 11.219 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots
> >>> 0.000 / 0.000 6.968 / 10.865 12.329 /
> >> 693.701
> >>> 6.431 / 1300.994 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots
> >>> ClassLoaderDataGraph 0.000 / 0.000 4.819 / 8.232
> >>> 9.635 / 693.405 3.476 / 693.405 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots CodeCache
> >>> 0.000 / 0.000 0.842 / 2.731 0.996 /
> 83.553
> >>> 0.780 / 83.553 ms
> >>> [2020-02-02T22:37:36.884+0000] Subphase: Pause Roots JNIHandles
> >>> 0.000 / 0.000 1.171 / 6.314 0.866 /
> >> 17.875
> >>> 0.837 / 25.708 ms
> >>>
> >>>
> >>> Thank you!
> >>> Prabhash Rathore
> >>>
> >>
>
>
More information about the zgc-dev
mailing list