zgc-dev Digest, Vol 26, Issue 4

Mon Mar 2 10:20:09 UTC 2020

Hi Pierre,

On 2/13/20 2:58 PM, Pierre Mevel wrote:
> Good morning,
> 
> Following on "Information on how to parse/interpret ZGC Logs",
> I did get the same issues back in October. (
> http://mail.openjdk.java.net/pipermail/zgc-dev/2019-October/000779.html for
> the curious).
> 
> Basically, our application runs on relatively big servers, and allocates
> memory at a very high pace.
> 
> We get enormous allocation stalls with ZGC, and increasing the amount of
> threads running will simply delay the first allocation stalls, not resolve
> the issue.
> Because ZGC is almost entirely concurrent, the application still allocates
> memory during the Concurrent Relocation phase.
> We have two root issues that clash against each other:
> 1. The allocation rate can be much higher than the recollection rate (which
> makes us want to give more OS resources to the GC)
> 2. The allocation rate can vary greatly (and when it's at a low, we do not
> want to have many threads running Concurrent phases) (and when it's at a
> high, it's because clients want some answers from the system, and we can't
> afford a long Allocation Stall)
> 
> As I tried to suggest in my previous mail in October, I suggest that the
> workers count be boosted to a higher value (Max(ParallelGcThreads,
> ConcGcThreads) maybe?) as soon as an Allocation Stall is triggered, and
> restored to normal at the end of the phase (maybe?).

Changing number of worker threads in the middle of a GC cycle is 
non-trivial (and I'm not even sure it would be the the right solution 
for this). If I understand your problem, the root of it is that ZGC's 
heuristics doesn't have a long enough memory. I.e. after a long period 
of low allocation rate, it will not remember (and take into account) 
that huge allocation spikes happen now and then. As I mentioned earlier, 
you can try using -XX:SoftMaxHeapSize to tell ZGC to collect earlier to 
increase the safety margin and avoid allocation stalls. Did you try that?

cheers,
Per

> 
> Apologies for having taken up your time,
> Best Regards,
> Pierre Mével
> 
> 
> 
> On Thu, Feb 13, 2020 at 1:03 PM <zgc-dev-request at openjdk.java.net> wrote:
> 
>> Send zgc-dev mailing list submissions to
>>          zgc-dev at openjdk.java.net
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>          https://mail.openjdk.java.net/mailman/listinfo/zgc-dev
>> or, via email, send a message with subject or body 'help' to
>>          zgc-dev-request at openjdk.java.net
>>
>> You can reach the person managing the list at
>>          zgc-dev-owner at openjdk.java.net
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of zgc-dev digest..."
>>
>>
>> Today's Topics:
>>
>>     1. Re: Information on how to parse/interpret ZGC Logs
>>        (Prabhash Rathore)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Wed, 12 Feb 2020 23:20:40 -0800
>> From: Prabhash Rathore <prabhashrathore at gmail.com>
>> To: Peter Booth <peter_booth at me.com>
>> Cc: zgc-dev at openjdk.java.net
>> Subject: Re: Information on how to parse/interpret ZGC Logs
>> Message-ID:
>>          <
>> CAFw09gKVw4514CLDUdtvbCd0UvgE75XUGLa-CCVSdKXn4Hhcxg at mail.gmail.com>
>> Content-Type: text/plain; charset="UTF-8"
>>
>> Thank you Per for your help! It's very helpful.
>>
>> I started my application GC configuration with default settings, I just had
>> Xmx set but because of memory allocation stalls and Gc pauses, I tunes
>> Concurrent threads and Parallel threads as default options didn't see
>> enough.
>>
>> My reasoning for high parallel  thread configuration
>> (-XX:ParallelGCThreads=78) is that application threads are anyway stalled
>> during full pause so having higher threads (for now 80% of OS threads) can
>> work on collection. and keep the GC pause time lower. Again I increased
>> Concurrent threads from default value to keep collection rate on par with
>> the allocation rate.
>>
>> You mentioned when I see Allocation Stall, increase heap size. I think I
>> have already heap configured at 80% of RAM size. For such allocation stall,
>> is there anything else I can tune other than heap size, concurrent and
>> parallel thread counts.
>>
>>
>> Hi Peter,
>>
>> This application runs on Linux RHEL 7.7. OS. Kernel version is 3.10.0-1062
>>
>> Output of *egrep "thp|trans" /proc/vmstat:*
>> nr_anon_transparent_hugepages 4722
>> thp_fault_alloc 51664
>> thp_fault_fallback 620147
>> thp_collapse_alloc 11462
>> thp_collapse_alloc_failed 20085
>> thp_split 9350
>> thp_zero_page_alloc 1
>> thp_zero_page_alloc_failed 0
>>
>>
>>
>> Output of *tail -28 /proc/cpuinfo*
>> power management:
>>
>> processor : 95
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 85
>> model name : Intel(R) Xeon(R) Gold 6263CY CPU @ 2.60GHz
>> stepping : 7
>> microcode : 0x5000021
>> cpu MHz : 3304.431
>> cache size : 33792 KB
>> physical id : 1
>> siblings : 48
>> core id : 29
>> cpu cores : 24
>> apicid : 123
>> initial apicid : 123
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 22
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
>> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
>> rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology
>> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx
>> est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe
>> popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
>> 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba
>> ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
>> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a
>> avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl
>> xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
>> dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>> avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
>> bogomips : 5206.07
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 46 bits physical, 48 bits virtual
>> power management:
>>
>>
>>
>>
>>
>> On Mon, Feb 10, 2020 at 9:28 AM Peter Booth <peter_booth at me.com> wrote:
>>
>>> Prabhash,
>>>
>>> What OS version?
>>> Is it a vanilla OS install?
>>> Can you print the output of the following?
>>> (Assuming Linux)
>>> egrep ?thp|trans? /proc/vmstat
>>> tail -28 /proc/cpuinfo
>>>
>>> Peter
>>>
>>> Sent from my iPhone
>>>
>>>> On Feb 9, 2020, at 1:56 AM, Prabhash Rathore <
>> prabhashrathore at gmail.com>
>>> wrote:
>>>>
>>>> ?Hi Per,
>>>>
>>>> Thanks for your reply!
>>>>
>>>> About ZGC logs, in general I am trying to understand following:
>>>>
>>>>    - What are the full pause times?
>>>>    - How many such pause per unit time?
>>>>    - Anything else which helps me eliminate GC as cause for high
>>>>    application latency.
>>>>
>>>> This is how I have configured ZGC logging at JVM level, wondering if I
>>>> should add other tags like Safepoint to get more details about GC
>> stats:
>>>> -Xlog:gc*=debug:file=gc.log
>>>>
>>>> All JVM flas used in my application:
>>>> -Xms130G -Xmx130G
>>>>
>> -Xlog:gc*=debug:file=/somedir/gc.log:time:filecount=10,filesize=67108864
>>>> -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError
>>>> -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=48
>>>> -XX:ParallelGCThreads=96 -XX:-OmitStackTraceInFastThrow
>>>>
>>>> It's a large machine with 96 threads and 196 GB RAM.
>>>>
>>>> I have -XX:+AlwaysPreTouch configured as one another option. With
>>>> AlwaysPreTouch option, I see in Linux top command shows a very high
>>> shared
>>>> and resident memory. My max heap size is configured as 130 GB but I see
>>>> shared memory is shown as 388 GB and Resident memory as 436 GB. On the
>>>> other hand, total virtual memory for this process in top is shown as
>> 17.1
>>>> tera byte. How is this possible? My whole machine size is 196 GB (is
>> this
>>>> accounting for things swapped out to disk). I did see without
>>>> AlwaysPretouch, numbers look close to the heap size. Trying to
>> understand
>>>> why with PreTouch, process memory is shown was higher than configured
>>> size?
>>>> I understand shared memory has all shared libs mapped out but how can
>> it
>>> be
>>>> such. a large size?
>>>>
>>>> Regarding high GC pause time, I did notice that my machine was low on
>>>> memory and it was swapping, hence slowing down everything. For now I
>> have
>>>> disabled Swappines completely with Kernel VM tunable but I am still
>>> trying
>>>> to find the actual cause of why swapping kicked in. This machine only
>>> runs
>>>> this particular Java applicaion which has 130 GB heap size. Other than
>>>> heap, I still have 66 GB memory available on host. Trying to figure out
>>> if
>>>> there is a native memory leak. If you have any inputs on this then
>> please
>>>> share.
>>>>
>>>> Thanks!
>>>> Prabhash Rathore
>>>>
>>>>> On Mon, Feb 3, 2020 at 2:35 AM Per Liden <per.liden at oracle.com>
>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> On 2020-02-03 06:52, Prabhash Rathore wrote:
>>>>>> Hello,
>>>>>>
>>>>>> We have decided to use ZGC Garbage Collector for our Java application
>>>>>> running on Java 11. I was wondering if there are any tools or any
>>>>>> documenation on how to interpret ZGC logs.
>>>>>
>>>>> Is there something in particular in the logs you're wondering about?
>>>>>
>>>>>>
>>>>>> I found following statistics in ZGC log which as per my understanding
>>>>> shows
>>>>>> a very large allocation stall of 3902042.342 milliseconds. It will be
>>>>>> really great if I can get some help to understand this further.
>>>>>
>>>>> I can see that you've had making times that is more than an hour long,
>>>>> which suggests that something in your system is seriously wrong (like
>>>>> extremely overloaded or an extremely slow disk that you log to... just
>>>>> guessing here). I think you need to have a broader look at the health
>> of
>>>>> the system before we can draw any conclusion from the GC logs.
>>>>>
>>>>> cheers,
>>>>> Per
>>>>>
>>>>>>
>>>>>> [2020-02-02T22:37:36.883+0000] === Garbage Collection Statistics
>>>>>>
>>>>>
>>>
>> =======================================================================================================================
>>>>>> [2020-02-02T22:37:36.883+0000]
>>>>>>                   Last 10s              Last 10m              Last 10h
>>>>>>           Total
>>>>>> [2020-02-02T22:37:36.883+0000]
>>>>>>                   Avg / Max             Avg / Max             Avg /
>> Max
>>>>>>          Avg / Max
>>>>>> [2020-02-02T22:37:36.883+0000]   Collector: Garbage Collection Cycle
>>>>>>                 0.000 / 0.000      7789.187 / 7789.187  12727.424 /
>>>>>> 3903938.012  1265.033 / 3903938.012   ms
>>>>>> [2020-02-02T22:37:36.883+0000]  Contention: Mark Segment Reset
>>> Contention
>>>>>>                    0 / 0                10 / 1084            176 /
>> 15122
>>>>>>         42 / 15122       ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]  Contention: Mark SeqNum Reset
>>> Contention
>>>>>>                     0 / 0                 0 / 5                 0 / 31
>>>>>>           0 / 31          ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]  Contention: Relocation Contention
>>>>>>                    0 / 0                 0 / 3                 1 / 708
>>>>>>          7 / 890         ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]    Critical: Allocation Stall
>>>>>>                 0.000 / 0.000         0.000 / 0.000      6714.722 /
>>>>>> 3902042.342  6714.722 / 3902042.342   ms
>>>>>> [2020-02-02T22:37:36.883+0000]    Critical: Allocation Stall
>>>>>>                     0 / 0                 0 / 0                12 /
>> 4115
>>>>>>           2 / 4115        ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]    Critical: GC Locker Stall
>>>>>>                0.000 / 0.000         0.000 / 0.000         3.979 /
>> 6.561
>>>>>>      1.251 / 6.561       ms
>>>>>> [2020-02-02T22:37:36.883+0000]    Critical: GC Locker Stall
>>>>>>                    0 / 0                 0 / 0                 0 / 1
>>>>>>          0 / 1           ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Allocation Rate
>>>>>>                    0 / 0                 6 / 822             762 /
>> 25306
>>>>>>       1548 / 25306       MB/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used After Mark
>>>>>>                     0 / 0             92170 / 92170         89632 /
>>>>> 132896
>>>>>>       30301 / 132896      MB
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used After
>> Relocation
>>>>>>                     0 / 0             76376 / 76376         67490 /
>>>>> 132928
>>>>>>        8047 / 132928      MB
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used Before Mark
>>>>>>                    0 / 0             92128 / 92128         84429 /
>>> 132896
>>>>>>       29452 / 132896      MB
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Heap Used Before
>> Relocation
>>>>>>                    0 / 0             86340 / 86340         76995 /
>>> 132896
>>>>>>       15862 / 132896      MB
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Out Of Memory
>>>>>>                    0 / 0                 0 / 0                 0 / 0
>>>>>>          0 / 0           ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Flush
>>>>>>                     0 / 0                 0 / 0                62 /
>> 2868
>>>>>>          16 / 2868        MB/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Hit L1
>>>>>>                    0 / 0                 7 / 2233            277 /
>> 11553
>>>>>>        583 / 11553       ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Hit L2
>>>>>>                    0 / 0                 0 / 0                20 /
>> 4619
>>>>>>          59 / 4619        ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Page Cache Miss
>>>>>>                    0 / 0                 0 / 0                15 /
>> 1039
>>>>>>           3 / 1297        ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Object Allocation
>>> Failed
>>>>>>                    0 / 0                 0 / 0                 0 / 24
>>>>>>           0 / 24          ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Object Allocation
>>>>>> Succeeded                  0 / 0                 0 / 3
>>>   1
>>>>> /
>>>>>> 708               7 / 890         ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]      Memory: Undo Page Allocation
>>>>>>                     0 / 0                 0 / 12               30 /
>> 3464
>>>>>>           7 / 3464        ops/s
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Destroy
>> Detached
>>>>>> Pages             0.000 / 0.000         0.004 / 0.004        11.675 /
>>>>>> 1484.886      1.155 / 1484.886    ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Mark
>>>>>>                0.000 / 0.000      7016.569 / 7016.569  11758.365 /
>>>>>> 3901893.544  1103.558 / 3901893.544   ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Mark Continue
>>>>>>                 0.000 / 0.000         0.000 / 0.000      1968.844 /
>>>>> 3674.454
>>>>>>    1968.844 / 3674.454    ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Prepare
>>> Relocation
>>>>>> Set             0.000 / 0.000       453.732 / 453.732     364.535 /
>>>>>> 7103.720     39.453 / 7103.720    ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Process
>>> Non-Strong
>>>>>> References      0.000 / 0.000         2.003 / 2.003         2.738 /
>>>>> 34.406
>>>>>>         2.253 / 34.406      ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Relocate
>>>>>>                0.000 / 0.000       261.822 / 261.822     335.954 /
>>>>> 2207.669
>>>>>>      45.868 / 2207.669    ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Reset
>> Relocation
>>>>> Set
>>>>>>                0.000 / 0.000         6.083 / 6.083        13.489 /
>>>>> 1128.678
>>>>>>       3.574 / 1128.678    ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Concurrent Select
>>> Relocation
>>>>>> Set              0.000 / 0.000         6.379 / 6.379        97.530 /
>>>>>> 1460.679     18.439 / 1460.679    ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Pause Mark End
>>>>>>                 0.000 / 0.000         4.420 / 4.420         6.219 /
>>>>> 26.498
>>>>>>       6.474 / 40.883      ms
>>>>>> [2020-02-02T22:37:36.883+0000]       Phase: Pause Mark Start
>>>>>>                 0.000 / 0.000        14.836 / 14.836       11.893 /
>>>>> 28.350
>>>>>>      11.664 / 41.767      ms
>>>>>> [2020-02-02T22:37:36.884+0000]       Phase: Pause Relocate Start
>>>>>>                 0.000 / 0.000        13.411 / 13.411       30.849 /
>>>>> 697.344
>>>>>>       11.995 / 697.344     ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark
>>>>>>                0.000 / 0.000      7015.793 / 7016.276  18497.265 /
>>>>>> 3901893.075  1690.497 / 3901893.075   ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Idle
>>>>>>                 0.000 / 0.000         1.127 / 13.510        1.292 /
>>>>> 219.999
>>>>>>        1.280 / 219.999     ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Try Flush
>>>>>>                0.000 / 0.000         1.295 / 2.029        47.094 /
>>>>> 34869.359
>>>>>>      4.797 / 34869.359   ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Mark Try
>>> Terminate
>>>>>>                0.000 / 0.000         1.212 / 14.847        1.760 /
>>>>> 3799.238
>>>>>>       1.724 / 3799.238    ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent References
>>> Enqueue
>>>>>>                0.000 / 0.000         0.009 / 0.009         0.022 /
>> 1.930
>>>>>>      0.017 / 2.350       ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent References
>>> Process
>>>>>>                0.000 / 0.000         0.599 / 0.599         0.768 /
>>> 23.966
>>>>>>       0.495 / 23.966      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>>>>>                0.000 / 0.000         0.882 / 1.253         1.155 /
>>> 21.699
>>>>>>       1.077 / 23.602      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>>>>> JNIWeakHandles          0.000 / 0.000         0.301 / 0.943
>>>>> 0.308 /
>>>>>> 10.868        0.310 / 23.219      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>>>>> StringTable             0.000 / 0.000         0.289 / 0.496
>>>>> 0.390 /
>>>>>> 12.794        0.363 / 22.907      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Concurrent Weak Roots
>>>>>> VMWeakHandles           0.000 / 0.000         0.230 / 0.469
>>>>> 0.329 /
>>>>>> 21.267        0.331 / 23.135      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Mark Try Complete
>>>>>>                0.000 / 0.000         0.000 / 0.000         0.501 /
>> 4.801
>>>>>>      0.480 / 17.208      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Remap TLABS
>>>>>>                0.000 / 0.000         0.252 / 0.252         0.195 /
>> 0.528
>>>>>>      0.226 / 3.451       ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Retire TLABS
>>>>>>                 0.000 / 0.000         1.195 / 1.195         1.324 /
>>> 5.082
>>>>>>        1.408 / 11.219      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots
>>>>>>                0.000 / 0.000         6.968 / 10.865       12.329 /
>>>>> 693.701
>>>>>>      6.431 / 1300.994    ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots
>>>>>> ClassLoaderDataGraph              0.000 / 0.000         4.819 / 8.232
>>>>>>    9.635 / 693.405       3.476 / 693.405     ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots CodeCache
>>>>>>                0.000 / 0.000         0.842 / 2.731         0.996 /
>>> 83.553
>>>>>>       0.780 / 83.553      ms
>>>>>> [2020-02-02T22:37:36.884+0000]    Subphase: Pause Roots JNIHandles
>>>>>>                 0.000 / 0.000         1.171 / 6.314         0.866 /
>>>>> 17.875
>>>>>>       0.837 / 25.708      ms
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>> Prabhash Rathore
>>>>>>
>>>>>
>>>
>>>
>>
>>
>> End of zgc-dev Digest, Vol 26, Issue 4
>> **************************************
>>