Big hiccups with ZGC

Fri Nov 9 12:55:18 UTC 2018

On 11/9/18 11:49 AM, Peter Booth wrote:
> As I read this it all sounds very familiar. I wonder, to what extent, 
> was the design of  ZGC influenced by Azul’s Zing JVM
>   and specifically the collector described seven or eight years ago in 
> the paper https://www.azul.com/files/c4_paper_acm.pdf?
> 
> in 2011 I started a job at a high traffic retail web site that ran on 
> Azul Vega hardware. I was surprised to see apps run with
> 64GB heaps with negligible GC pauses. A couple of years later I took a 
> contract at a financial firm that was standardizing
> on Azul Zing’s software JVM. When I left I was running 380GB heaps with 
> peak GC pauses of about 0.14ms.
> 
> But the thing that Zing and ZGC seem to share is that what you knew 
> about CMS, G1 are unhelpful with a different collector.
> What I learned with Zing was that Milton Friedman was correct - there’s 
> no such thing as a free lunch. So if I want predictable
>   latencies with high throughput then the price I need to pay is little 
> more physical memory and additional CPU resources -
> to do the continuous compacting. As Per suggests, using THP can have 
> enormous latency issues. The lesson is that
> your experience with other collectors can make it harder to make 
> progress with completely different collector.After spending
> sic years working with an atypical collector my fervent advice is to 
> have an open mind, don’t be attached to prior
>   understanding, and pay attention to evidence and be willing to create 
> new mental models of how your JVM operates.

Yes, you're right. Tuning a concurrent collector requires a different 
mindset compared to tuning a traditional collector. With a concurrent 
collector, like ZGC, you're essentially tuning to avoid allocations 
stalls. I.e. tuning so that garbage can be collected at the same rate 
(or faster) than it's created. The two main options to play with is -Xmx 
(give ZGC more heap headroom) and -XX:ConcGCThreads (give ZGC more CPU 
time).

It's a bit of an educational challenge, but I'm hopeful this knowledge 
will spread as the use of concurrent collectors becomes more and more 
common.

cheers,
Per

> 
> There was a good paper written by a group at IBM t left ten years ago 
> that described that poor performing
> java apps were frequently the result of teams following “best practices” 
> The lesson is to read blogs like mechanical sympathy,
> and the writing of Jeremy Eder, Cliff Click, Gil Tene, Martin Thompson, 
> Neil Gunther, Nitsan Wakart, Heinz Kabutz, Charlie Hunt
> and moths ..
> 
> Peter Booth
> 
> 
> 
> ms
> 
>> On 9 Nov 2018, at 4:32 AM, Per Liden <per.liden at oracle.com 
>> <mailto:per.liden at oracle.com>> wrote:
>>
>> Hi,
>>
>> On 11/8/18 6:22 PM, charlie hunt wrote:
>>> Oh, a couple other quick things I noticed in the GC logs ...
>>> You should consider making the following suggested system 
>>> configuration change:
>>> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] ***** WARNING! 
>>> INCORRECT SYSTEM CONFIGURATION DETECTED! *****
>>> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] The system limit 
>>> on number of memory mappings per process might be too low for the given
>>> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] max Java heap 
>>> size (51200M). Please adjust /proc/sys/vm/max_map_count to allow for at
>>> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] least 92160 
>>> mappings (current limit is 65530). Continuing execution with the current
>>> [2018-11-08T12:09:55.060+0000][0.006s][17][gc,init] limit could lead 
>>> to a fatal error, due to failure to map memory.
>>> Large pages are disabled as indicated by:
>>> [2018-11-08T12:09:55.059+0000][0.005s][17][gc,init] Large Page 
>>> Support: Disabled
>>> ZGC tends to perform better with huge pages enabled. It is not 
>>> required to run ZGC, but it should help. Enabling huge pages can be 
>>> done by setting Linux transparent huge pages to "madvise" for both 
>>> transparent huge pages "enabled" and "defrag", and then adding 
>>> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM command line 
>>> options.
>>
>> Note that huge pages (aka large pages) come in two different "modes", 
>> explicit and transparent. Explicit huge pages will give you best 
>> performance, but requires you to actively configure the kernel's huge 
>> page pool. With transparent huge page you don't need to reserve memory 
>> in the kernel's huge page pool up front, but it can cause latency 
>> issues (the kernel will be doing extra work). See the ZGC wiki for 
>> more information on this:
>>
>> https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingLargePages
>>
>> https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingTransparentHugePages
>>
>> cheers,
>> Per
>>
>>> hths,
>>> charlie
>>> On 11/8/18 11:12 AM, charlie hunt wrote:
>>>> Hi Alex,
>>>>
>>>> Did a quick look at the first two GC logs. Haven't had a chance to 
>>>> look at the 3rd.
>>>>
>>>> A couple tips that may help you as you continue your looking at ZGC.
>>>>
>>>> - If you see "Allocation Stall" in the GC log, such as "Allocation 
>>>> Stall (qtp1059634518-72) 15.108ms", this means that ZGC has slowed 
>>>> down the application thread(s) because you are running out of 
>>>> available heap space. In other words, GC lost the race of reclaiming 
>>>> space with the allocation rate.
>>>>
>>>> When you see these "Allocation Stall" messages in the GC log, there 
>>>> are a couple options, (one of these or a combination should resolve 
>>>> what you are seeing):
>>>> a.) Increase the number of concurrent GC threads. This will help ZGC 
>>>> win the race. In your first GC log, there are 8 concurrent GC 
>>>> threads. It probably needs 10 or 12 concurrent GC threads in the 
>>>> absence of making other changes.
>>>> b.) Increase the size of the Java heap to offer ZGC additional head 
>>>> room.
>>>> c.) Make changes to the application to either reduce the amount of 
>>>> live data, or reduce the allocation rate.
>>>>
>>>> If you reduce cache sizes as you mentioned, this should help avoid 
>>>> the "Allocation Stalls".
>>>>
>>>> hths,
>>>>
>>>> charlie
>>>>
>>>> On 11/8/18 9:57 AM, Alex Yakushev wrote:
>>>>> A quick follow up. I think we figured what's going on – there is 
>>>>> not enough free heap to deal with the allocation rate. You see, we 
>>>>> have a cache inside the program the size of which was tuned with G1 
>>>>> enabled. Apparently, ZGC (and Shenandoah too, got the same problems 
>>>>> with it today) inflates the size of the cache in bytes (because of 
>>>>> the overhead) which leaves less breathing room for ZGC/Shenandoah 
>>>>> to work. Will try to reduce the cache size and come back with the 
>>>>> results.
>