Big hiccups with ZGC
peter_booth at me.com
Fri Nov 9 10:49:14 UTC 2018
As I read this it all sounds very familiar. I wonder, to what extent, was the design of ZGC influenced by Azul’s Zing JVM
and specifically the collector described seven or eight years ago in the paper https://www.azul.com/files/c4_paper_acm.pdf?
in 2011 I started a job at a high traffic retail web site that ran on Azul Vega hardware. I was surprised to see apps run with
64GB heaps with negligible GC pauses. A couple of years later I took a contract at a financial firm that was standardizing
on Azul Zing’s software JVM. When I left I was running 380GB heaps with peak GC pauses of about 0.14ms.
But the thing that Zing and ZGC seem to share is that what you knew about CMS, G1 are unhelpful with a different collector.
What I learned with Zing was that Milton Friedman was correct - there’s no such thing as a free lunch. So if I want predictable
latencies with high throughput then the price I need to pay is little more physical memory and additional CPU resources -
to do the continuous compacting. As Per suggests, using THP can have enormous latency issues. The lesson is that
your experience with other collectors can make it harder to make progress with completely different collector.After spending
sic years working with an atypical collector my fervent advice is to have an open mind, don’t be attached to prior
understanding, and pay attention to evidence and be willing to create new mental models of how your JVM operates.
There was a good paper written by a group at IBM t left ten years ago that described that poor performing
java apps were frequently the result of teams following “best practices” The lesson is to read blogs like mechanical sympathy,
and the writing of Jeremy Eder, Cliff Click, Gil Tene, Martin Thompson, Neil Gunther, Nitsan Wakart, Heinz Kabutz, Charlie Hunt
and moths ..
> On 9 Nov 2018, at 4:32 AM, Per Liden <per.liden at oracle.com> wrote:
> On 11/8/18 6:22 PM, charlie hunt wrote:
>> Oh, a couple other quick things I noticed in the GC logs ...
>> You should consider making the following suggested system configuration change:
>> [2018-11-08T12:09:55.060+0000][0.006s][gc,init] ***** WARNING! INCORRECT SYSTEM CONFIGURATION DETECTED! *****
>> [2018-11-08T12:09:55.060+0000][0.006s][gc,init] The system limit on number of memory mappings per process might be too low for the given
>> [2018-11-08T12:09:55.060+0000][0.006s][gc,init] max Java heap size (51200M). Please adjust /proc/sys/vm/max_map_count to allow for at
>> [2018-11-08T12:09:55.060+0000][0.006s][gc,init] least 92160 mappings (current limit is 65530). Continuing execution with the current
>> [2018-11-08T12:09:55.060+0000][0.006s][gc,init] limit could lead to a fatal error, due to failure to map memory.
>> Large pages are disabled as indicated by:
>> [2018-11-08T12:09:55.059+0000][0.005s][gc,init] Large Page Support: Disabled
>> ZGC tends to perform better with huge pages enabled. It is not required to run ZGC, but it should help. Enabling huge pages can be done by setting Linux transparent huge pages to "madvise" for both transparent huge pages "enabled" and "defrag", and then adding -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM command line options.
> Note that huge pages (aka large pages) come in two different "modes", explicit and transparent. Explicit huge pages will give you best performance, but requires you to actively configure the kernel's huge page pool. With transparent huge page you don't need to reserve memory in the kernel's huge page pool up front, but it can cause latency issues (the kernel will be doing extra work). See the ZGC wiki for more information on this:
> https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingLargePages <https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingLargePages>
> https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingTransparentHugePages <https://wiki.openjdk.java.net/display/zgc/Main#Main-EnablingTransparentHugePages>
>> On 11/8/18 11:12 AM, charlie hunt wrote:
>>> Hi Alex,
>>> Did a quick look at the first two GC logs. Haven't had a chance to look at the 3rd.
>>> A couple tips that may help you as you continue your looking at ZGC.
>>> - If you see "Allocation Stall" in the GC log, such as "Allocation Stall (qtp1059634518-72) 15.108ms", this means that ZGC has slowed down the application thread(s) because you are running out of available heap space. In other words, GC lost the race of reclaiming space with the allocation rate.
>>> When you see these "Allocation Stall" messages in the GC log, there are a couple options, (one of these or a combination should resolve what you are seeing):
>>> a.) Increase the number of concurrent GC threads. This will help ZGC win the race. In your first GC log, there are 8 concurrent GC threads. It probably needs 10 or 12 concurrent GC threads in the absence of making other changes.
>>> b.) Increase the size of the Java heap to offer ZGC additional head room.
>>> c.) Make changes to the application to either reduce the amount of live data, or reduce the allocation rate.
>>> If you reduce cache sizes as you mentioned, this should help avoid the "Allocation Stalls".
>>> On 11/8/18 9:57 AM, Alex Yakushev wrote:
>>>> A quick follow up. I think we figured what's going on – there is not enough free heap to deal with the allocation rate. You see, we have a cache inside the program the size of which was tuned with G1 enabled. Apparently, ZGC (and Shenandoah too, got the same problems with it today) inflates the size of the cache in bytes (because of the overhead) which leaves less breathing room for ZGC/Shenandoah to work. Will try to reduce the cache size and come back with the results.
More information about the zgc-dev