Big hiccups with ZGC

Fri Nov 9 09:23:44 UTC 2018

Hi,

On 11/8/18 6:12 PM, charlie hunt wrote:
> Hi Alex,
> 
> Did a quick look at the first two GC logs. Haven't had a chance to look 
> at the 3rd.
> 
> A couple tips that may help you as you continue your looking at ZGC.
> 
> - If you see "Allocation Stall" in the GC log, such as "Allocation Stall 
> (qtp1059634518-72) 15.108ms", this means that ZGC has slowed down the 
> application thread(s) because you are running out of available heap 
> space. In other words, GC lost the race of reclaiming space with the 
> allocation rate.
> 
> When you see these "Allocation Stall" messages in the GC log, there are 
> a couple options, (one of these or a combination should resolve what you 
> are seeing):
> a.) Increase the number of concurrent GC threads. This will help ZGC win 
> the race. In your first GC log, there are 8 concurrent GC threads. It 
> probably needs 10 or 12 concurrent GC threads in the absence of making 
> other changes.
> b.) Increase the size of the Java heap to offer ZGC additional head room.
> c.) Make changes to the application to either reduce the amount of live 
> data, or reduce the allocation rate.
> 
> If you reduce cache sizes as you mentioned, this should help avoid the 
> "Allocation Stalls".

I think Charlie summarized it very well and I don't have much to add, 
other than I noticed that the live-set seem to grow and grow throughout 
the run (see the "Live:" column in the heap stats). Maybe this is the 
"cache" you mentioned that is growing?

The only other thing that sticks out from the logs is this:

[2018-11-07T16:28:14.753+0000][0.007s][16][gc,init] CPUs: 36 total, 1 
available

I.e. HotSpot thinks it only has a single core to play with (at list when 
the VM is starting up). Is this workload running in a container or in 
some other constrained environment (e.g. numactl)?

cheers,
Per

> 
> hths,
> 
> charlie
> 
> On 11/8/18 9:57 AM, Alex Yakushev wrote:
>> A quick follow up. I think we figured what's going on – there is not 
>> enough free heap to deal with the allocation rate. You see, we have a 
>> cache inside the program the size of which was tuned with G1 enabled. 
>> Apparently, ZGC (and Shenandoah too, got the same problems with it 
>> today) inflates the size of the cache in bytes (because of the 
>> overhead) which leaves less breathing room for ZGC/Shenandoah to work. 
>> Will try to reduce the cache size and come back with the results.