Information on how to parse/interpret ZGC Logs

Peter Booth peter_booth at me.com
Mon Feb 24 21:02:56 UTC 2020


I asked about THP because I imagined it was probably enabled. 

Now, to be frank, I think that right now your host / app is sick, and that it would be easy to waste time trying alternate configurations / tunings. It looks as though there are currently too many “smart ideas” interacting. I think you might want to consider reverting to a simple, unoptimized, transparent configuration and confirming that it works. Then step by step add settings and measure their impact.

By “simple” I mean removing the non-default settings for GC threads, removing PreTouch, removing GC debug log. By “transparent” I mean setting a much lower Xms value - say 10G, so that, as your application runs you will see from logs when the heap resizes. This isn’t about tuning but stability and increasing visibility into the behavior of the app. 

Hope this helps

Peter

Sent from my iPhone

> On Feb 24, 2020, at 7:32 AM, Per Liden <per.liden at oracle.com> wrote:
> 
> Hi,
> 
>> On 2/13/20 8:20 AM, Prabhash Rathore wrote:
>> Thank you Per for your help! It's very helpful.
>> I started my application GC configuration with default settings, I just had Xmx set but because of memory allocation stalls and Gc pauses, I tunes Concurrent threads and Parallel threads as default options didn't see enough.
>> My reasoning for high parallel  thread configuration (-XX:ParallelGCThreads=78) is that application threads are anyway stalled during full pause so having higher threads (for now 80% of OS threads) can work on collection. and keep the GC pause time lower. Again I increased Concurrent threads from default value to keep collection rate on par with the allocation rate.
>> You mentioned when I see Allocation Stall, increase heap size. I think I have already heap configured at 80% of RAM size. For such allocation stall, is there anything else I can tune other than heap size, concurrent and parallel thread counts.
>> Hi Peter,
>> This application runs on Linux RHEL 7.7. OS. Kernel version is 3.10.0-1062
>> Output of *egrep "thp|trans" /proc/vmstat:*
>> nr_anon_transparent_hugepages 4722
>> thp_fault_alloc 51664
>> thp_fault_fallback 620147
>> thp_collapse_alloc 11462
>> thp_collapse_alloc_failed 20085
>> thp_split 9350
>> thp_zero_page_alloc 1
>> thp_zero_page_alloc_failed 0
> 
> For best throughput and latency, I'd suggest you disable all use of transparent hugepages and instead configure the kernel hugepage pool and use the -XX:+UseLargePages JVM option.
> 
> cheers,
> Per
> 
>> Output of *tail -28 /proc/cpuinfo*
>> power management:
>> processor : 95
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 85
>> model name : Intel(R) Xeon(R) Gold 6263CY CPU @ 2.60GHz
>> stepping : 7
>> microcode : 0x5000021
>> cpu MHz : 3304.431
>> cache size : 33792 KB
>> physical id : 1
>> siblings : 48
>> core id : 29
>> cpu cores : 24
>> apicid : 123
>> initial apicid : 123
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 22
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
>> bogomips : 5206.07
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 46 bits physical, 48 bits virtual
>> power management:
>> On Mon, Feb 10, 2020 at 9:28 AM Peter Booth <peter_booth at me.com <mailto:peter_booth at me.com>> wrote:
>>    Prabhash,
>>    What OS version?
>>    Is it a vanilla OS install?
>>    Can you print the output of the following?
>>    (Assuming Linux)
>>    egrep ‘thp|trans’ /proc/vmstat
>>    tail -28 /proc/cpuinfo
>>    Peter
>>    Sent from my iPhone
>>     > On Feb 9, 2020, at 1:56 AM, Prabhash Rathore
>>    <prabhashrathore at gmail.com <mailto:prabhashrathore at gmail.com>> wrote:
>>     >
>>     > Hi Per,
>>     >
>>     > Thanks for your reply!
>>     >
>>     > About ZGC logs, in general I am trying to understand following:
>>     >
>>     >   - What are the full pause times?
>>     >   - How many such pause per unit time?
>>     >   - Anything else which helps me eliminate GC as cause for high
>>     >   application latency.
>>     >
>>     > This is how I have configured ZGC logging at JVM level, wondering
>>    if I
>>     > should add other tags like Safepoint to get more details about GC
>>    stats:
>>     > -Xlog:gc*=debug:file=gc.log
>>     >
>>     > All JVM flas used in my application:
>>     > -Xms130G -Xmx130G
>>     >
>>    -Xlog:gc*=debug:file=/somedir/gc.log:time:filecount=10,filesize=67108864
>>     > -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError
>>     > -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=48
>>     > -XX:ParallelGCThreads=96 -XX:-OmitStackTraceInFastThrow
>>     >
>>     > It's a large machine with 96 threads and 196 GB RAM.
>>     >
>>     > I have -XX:+AlwaysPreTouch configured as one another option. With
>>     > AlwaysPreTouch option, I see in Linux top command shows a very
>>    high shared
>>     > and resident memory. My max heap size is configured as 130 GB but
>>    I see
>>     > shared memory is shown as 388 GB and Resident memory as 436 GB.
>>    On the
>>     > other hand, total virtual memory for this process in top is shown
>>    as 17.1
>>     > tera byte. How is this possible? My whole machine size is 196 GB
>>    (is this
>>     > accounting for things swapped out to disk). I did see without
>>     > AlwaysPretouch, numbers look close to the heap size. Trying to
>>    understand
>>     > why with PreTouch, process memory is shown was higher than
>>    configured size?
>>     > I understand shared memory has all shared libs mapped out but how
>>    can it be
>>     > such. a large size?
>>     >
>>     > Regarding high GC pause time, I did notice that my machine was low on
>>     > memory and it was swapping, hence slowing down everything. For
>>    now I have
>>     > disabled Swappines completely with Kernel VM tunable but I am
>>    still trying
>>     > to find the actual cause of why swapping kicked in. This machine
>>    only runs
>>     > this particular Java applicaion which has 130 GB heap size. Other
>>    than
>>     > heap, I still have 66 GB memory available on host. Trying to
>>    figure out if
>>     > there is a native memory leak. If you have any inputs on this
>>    then please
>>     > share.
>>     >
>>     > Thanks!
>>     > Prabhash Rathore
>>     >
>>     >> On Mon, Feb 3, 2020 at 2:35 AM Per Liden <per.liden at oracle.com
>>    <mailto:per.liden at oracle.com>> wrote:
>>     >>
>>     >> Hi,
>>     >>
>>     >>> On 2020-02-03 06:52, Prabhash Rathore wrote:
>>     >>> Hello,
>>     >>>
>>     >>> We have decided to use ZGC Garbage Collector for our Java
>>    application
>>     >>> running on Java 11. I was wondering if there are any tools or any
>>     >>> documenation on how to interpret ZGC logs.
>>     >>
>>     >> Is there something in particular in the logs you're wondering about?
>>     >>
>>     >>>
>>     >>> I found following statistics in ZGC log which as per my
>>    understanding
>>     >> shows
>>     >>> a very large allocation stall of 3902042.342 milliseconds. It
>>    will be
>>     >>> really great if I can get some help to understand this further.
>>     >>
>>     >> I can see that you've had making times that is more than an hour
>>    long,
>>     >> which suggests that something in your system is seriously wrong
>>    (like
>>     >> extremely overloaded or an extremely slow disk that you log
>>    to... just
>>     >> guessing here). I think you need to have a broader look at the
>>    health of
>>     >> the system before we can draw any conclusion from the GC logs.
>>     >>
>>     >> cheers,
>>     >> Per
>>     >>
>>     >>>
>>     >>> [2020-02-02T22:37:36.883+0000] === Garbage Collection Statistics
>>     >>>
>>     >>
>>    =======================================================================================================================
>>     >>> [2020-02-02T22:37:36.883+0000]
>>     >>>                  Last 10s              Last 10m                 Last 10h
>>     >>>          Total
>>     >>> [2020-02-02T22:37:36.883+0000]
>>     >>>                  Avg / Max             Avg / Max                 Avg / Max
>>     >>>         Avg / Max
>>     >>> [2020-02-02T22:37:36.883+0000]   Collector: Garbage Collection
>>    Cycle
>>     >>>                0.000 / 0.000      7789.187 / 7789.187  12727.424 /
>>     >>> 3903938.012  1265.033 / 3903938.012   ms
>>     >>> [2020-02-02T22:37:36.883+0000]  Contention: Mark Segment Reset
>>    Contention
>>     >>>                   0 / 0                10 / 1084            176
>>    / 15122
>>     >>>        42 / 15122       ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]  Contention: Mark SeqNum Reset
>>    Contention
>>     >>>                    0 / 0                 0 / 5                     0 / 31
>>     >>>          0 / 31          ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]  Contention: Relocation Contention
>>     >>>                   0 / 0                 0 / 3                 1
>>    / 708
>>     >>>         7 / 890         ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]    Critical: Allocation Stall
>>     >>>                0.000 / 0.000         0.000 / 0.000      6714.722 /
>>     >>> 3902042.342  6714.722 / 3902042.342   ms
>>     >>> [2020-02-02T22:37:36.883+0000]    Critical: Allocation Stall
>>     >>>                    0 / 0                 0 / 0                   12 / 4115
>>     >>>          2 / 4115        ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]    Critical: GC Locker Stall
>>     >>>               0.000 / 0.000         0.000 / 0.000         3.979
>>    / 6.561
>>     >>>     1.251 / 6.561       ms
>>     >>> [2020-02-02T22:37:36.883+0000]    Critical: GC Locker Stall
>>     >>>                   0 / 0                 0 / 0                 0 / 1
>>     >>>         0 / 1           ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Allocation Rate
>>     >>>                   0 / 0                 6 / 822             762
>>    / 25306
>>     >>>      1548 / 25306       MB/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used After Mark
>>     >>>                    0 / 0             92170 / 92170         89632 /
>>     >> 132896
>>     >>>      30301 / 132896      MB
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used After
>>    Relocation
>>     >>>                    0 / 0             76376 / 76376         67490 /
>>     >> 132928
>>     >>>       8047 / 132928      MB
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used Before Mark
>>     >>>                   0 / 0             92128 / 92128         84429
>>    / 132896
>>     >>>      29452 / 132896      MB
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used Before
>>    Relocation
>>     >>>                   0 / 0             86340 / 86340         76995
>>    / 132896
>>     >>>      15862 / 132896      MB
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Out Of Memory
>>     >>>                   0 / 0                 0 / 0                 0 / 0
>>     >>>         0 / 0           ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Flush
>>     >>>                    0 / 0                 0 / 0                   62 / 2868
>>     >>>         16 / 2868        MB/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Hit L1
>>     >>>                   0 / 0                 7 / 2233            277
>>    / 11553
>>     >>>       583 / 11553       ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Hit L2
>>     >>>                   0 / 0                 0 / 0                20
>>    / 4619
>>     >>>         59 / 4619        ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Miss
>>     >>>                   0 / 0                 0 / 0                15
>>    / 1039
>>     >>>          3 / 1297        ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Object
>>    Allocation Failed
>>     >>>                   0 / 0                 0 / 0                 0
>>    / 24
>>     >>>          0 / 24          ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Object Allocation
>>     >>> Succeeded                  0 / 0                 0 / 3                     1
>>     >> /
>>     >>> 708               7 / 890         ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Page Allocation
>>     >>>                    0 / 0                 0 / 12                   30 / 3464
>>     >>>          7 / 3464        ops/s
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Destroy
>>    Detached
>>     >>> Pages             0.000 / 0.000         0.004 / 0.004           11.675 /
>>     >>> 1484.886      1.155 / 1484.886    ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Mark
>>     >>>               0.000 / 0.000      7016.569 / 7016.569  11758.365 /
>>     >>> 3901893.544  1103.558 / 3901893.544   ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Mark
>>    Continue
>>     >>>                0.000 / 0.000         0.000 / 0.000      1968.844 /
>>     >> 3674.454
>>     >>>   1968.844 / 3674.454    ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Prepare
>>    Relocation
>>     >>> Set             0.000 / 0.000       453.732 / 453.732     364.535 /
>>     >>> 7103.720     39.453 / 7103.720    ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Process
>>    Non-Strong
>>     >>> References      0.000 / 0.000         2.003 / 2.003         2.738 /
>>     >> 34.406
>>     >>>        2.253 / 34.406      ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Relocate
>>     >>>               0.000 / 0.000       261.822 / 261.822     335.954 /
>>     >> 2207.669
>>     >>>     45.868 / 2207.669    ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Reset
>>    Relocation
>>     >> Set
>>     >>>               0.000 / 0.000         6.083 / 6.083        13.489 /
>>     >> 1128.678
>>     >>>      3.574 / 1128.678    ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Select
>>    Relocation
>>     >>> Set              0.000 / 0.000         6.379 / 6.379           97.530 /
>>     >>> 1460.679     18.439 / 1460.679    ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Pause Mark End
>>     >>>                0.000 / 0.000         4.420 / 4.420         6.219 /
>>     >> 26.498
>>     >>>      6.474 / 40.883      ms
>>     >>> [2020-02-02T22:37:36.883+0000]       Phase: Pause Mark Start
>>     >>>                0.000 / 0.000        14.836 / 14.836       11.893 /
>>     >> 28.350
>>     >>>     11.664 / 41.767      ms
>>     >>> [2020-02-02T22:37:36.884+0000]       Phase: Pause Relocate Start
>>     >>>                0.000 / 0.000        13.411 / 13.411       30.849 /
>>     >> 697.344
>>     >>>      11.995 / 697.344     ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark
>>     >>>               0.000 / 0.000      7015.793 / 7016.276  18497.265 /
>>     >>> 3901893.075  1690.497 / 3901893.075   ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Idle
>>     >>>                0.000 / 0.000         1.127 / 13.510        1.292 /
>>     >> 219.999
>>     >>>       1.280 / 219.999     ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Try
>>    Flush
>>     >>>               0.000 / 0.000         1.295 / 2.029        47.094 /
>>     >> 34869.359
>>     >>>     4.797 / 34869.359   ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Try
>>    Terminate
>>     >>>               0.000 / 0.000         1.212 / 14.847        1.760 /
>>     >> 3799.238
>>     >>>      1.724 / 3799.238    ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent
>>    References Enqueue
>>     >>>               0.000 / 0.000         0.009 / 0.009         0.022
>>    / 1.930
>>     >>>     0.017 / 2.350       ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent
>>    References Process
>>     >>>               0.000 / 0.000         0.599 / 0.599         0.768
>>    / 23.966
>>     >>>      0.495 / 23.966      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>     >>>               0.000 / 0.000         0.882 / 1.253         1.155
>>    / 21.699
>>     >>>      1.077 / 23.602      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>     >>> JNIWeakHandles          0.000 / 0.000         0.301 / 0.943
>>     >> 0.308 /
>>     >>> 10.868        0.310 / 23.219      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>     >>> StringTable             0.000 / 0.000         0.289 / 0.496
>>     >> 0.390 /
>>     >>> 12.794        0.363 / 22.907      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>     >>> VMWeakHandles           0.000 / 0.000         0.230 / 0.469
>>     >> 0.329 /
>>     >>> 21.267        0.331 / 23.135      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Mark Try Complete
>>     >>>               0.000 / 0.000         0.000 / 0.000         0.501
>>    / 4.801
>>     >>>     0.480 / 17.208      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Remap TLABS
>>     >>>               0.000 / 0.000         0.252 / 0.252         0.195
>>    / 0.528
>>     >>>     0.226 / 3.451       ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Retire TLABS
>>     >>>                0.000 / 0.000         1.195 / 1.195             1.324 / 5.082
>>     >>>       1.408 / 11.219      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots
>>     >>>               0.000 / 0.000         6.968 / 10.865       12.329 /
>>     >> 693.701
>>     >>>     6.431 / 1300.994    ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots
>>     >>> ClassLoaderDataGraph              0.000 / 0.000         4.819 /
>>    8.232
>>     >>>   9.635 / 693.405       3.476 / 693.405     ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots CodeCache
>>     >>>               0.000 / 0.000         0.842 / 2.731         0.996
>>    / 83.553
>>     >>>      0.780 / 83.553      ms
>>     >>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots JNIHandles
>>     >>>                0.000 / 0.000         1.171 / 6.314         0.866 /
>>     >> 17.875
>>     >>>      0.837 / 25.708      ms
>>     >>>
>>     >>>
>>     >>> Thank you!
>>     >>> Prabhash Rathore
>>     >>>
>>     >>


More information about the zgc-dev mailing list