Big hiccups with ZGC

Fri Nov 9 09:32:37 UTC 2018

Hi,

On 11/8/18 6:22 PM, charlie hunt wrote:
> Oh, a couple other quick things I noticed in the GC logs ...
> 
> You should consider making the following suggested system configuration 
> change:
> 
> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] ***** WARNING! 
> INCORRECT SYSTEM CONFIGURATION DETECTED! *****
> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] The system limit on 
> number of memory mappings per process might be too low for the given
> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] max Java heap size 
> (51200M). Please adjust /proc/sys/vm/max_map_count to allow for at
> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] least 92160 mappings 
> (current limit is 65530). Continuing execution with the current
> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] limit could lead to 
> a fatal error, due to failure to map memory.
> 
> Large pages are disabled as indicated by:
> [2018-11-08T12:09:55.059+0000][0.005s][17][gc,init] Large Page Support: 
> Disabled
> 
> ZGC tends to perform better with huge pages enabled. It is not required 
> to run ZGC, but it should help. Enabling huge pages can be done by 
> setting Linux transparent huge pages to "madvise" for both transparent 
> huge pages "enabled" and "defrag", and then adding 
> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM command line options.

Note that huge pages (aka large pages) come in two different "modes", 
explicit and transparent. Explicit huge pages will give you best 
performance, but requires you to actively configure the kernel's huge 
page pool. With transparent huge page you don't need to reserve memory 
in the kernel's huge page pool up front, but it can cause latency issues 
(the kernel will be doing extra work). See the ZGC wiki for more 
information on this:

https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingLargePages

https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingTransparentHugePages

cheers,
Per

> 
> hths,
> 
> charlie
> 
> On 11/8/18 11:12 AM, charlie hunt wrote:
>> Hi Alex,
>>
>> Did a quick look at the first two GC logs. Haven't had a chance to 
>> look at the 3rd.
>>
>> A couple tips that may help you as you continue your looking at ZGC.
>>
>> - If you see "Allocation Stall" in the GC log, such as "Allocation 
>> Stall (qtp1059634518-72) 15.108ms", this means that ZGC has slowed 
>> down the application thread(s) because you are running out of 
>> available heap space. In other words, GC lost the race of reclaiming 
>> space with the allocation rate.
>>
>> When you see these "Allocation Stall" messages in the GC log, there 
>> are a couple options, (one of these or a combination should resolve 
>> what you are seeing):
>> a.) Increase the number of concurrent GC threads. This will help ZGC 
>> win the race. In your first GC log, there are 8 concurrent GC threads. 
>> It probably needs 10 or 12 concurrent GC threads in the absence of 
>> making other changes.
>> b.) Increase the size of the Java heap to offer ZGC additional head room.
>> c.) Make changes to the application to either reduce the amount of 
>> live data, or reduce the allocation rate.
>>
>> If you reduce cache sizes as you mentioned, this should help avoid the 
>> "Allocation Stalls".
>>
>> hths,
>>
>> charlie
>>
>> On 11/8/18 9:57 AM, Alex Yakushev wrote:
>>> A quick follow up. I think we figured what's going on – there is not 
>>> enough free heap to deal with the allocation rate. You see, we have a 
>>> cache inside the program the size of which was tuned with G1 enabled. 
>>> Apparently, ZGC (and Shenandoah too, got the same problems with it 
>>> today) inflates the size of the cache in bytes (because of the 
>>> overhead) which leaves less breathing room for ZGC/Shenandoah to 
>>> work. Will try to reduce the cache size and come back with the results.