Feedback on Shenandoah GC on b118 private build (2)

Wed Jun 6 07:20:51 UTC 2018

Hello Pedro,

I like both suggestions. Thank you! This is very useful.

Zhengyu has implemented dynamic worker configuration a while back.
Zhengyu, can you explain how this is supposed to work, and if it could
be useful to Pedro? What do you think of
-XX:ConcToParallelGCThreadsRatio=X ? Does that make sense? Would you
like to implement it?

Aleksey, do you have time to improve the MXBeans stuff like Pedro suggested?

Best regards, Roman

> Thanks for the answers and clarifications.
> 
> If you allow me a couple of extra comments and suggestions, here they are:
> 
> 5.- In my tests at high load for "my" application, I have noticed impact on the amount of req/sec "our" application can handle during the GC cycles, due to contention of vCPU's when ConcGCThreads=ParallelGCThreads (that in my case is = vCPU=8)
> I know that the answer to that, is setting the ConcGCThreads to a lower value.
> However, the value to configure to that property is an absolute value, so it requires to know in advance the amount of vCPU's the server (or Virtual Machine) is going to have.
> In virtualization environments and cloud deployments, where scale-up/down in the amount of vCPUs is possible, setting an absolute value is something not desirable.
> 
> I think it would be better to have a new property (e.g. ConcToParallelGCThreadsRatio, or whatever the name) that would allow to set the # of concurrent GC threads as a fraction of the ParallelGCThreads.
> For instance, a value of 4 would represent 1/4 of  ParallelGCThreads (=25%)
> A value of 2, would be 50%
> 
> 
> 6.- I've observed that the values exported via MXBeans for the GC pauses are like this (view from our application CLI, that we also use to generate graphs)
> 
> 	[CMD] java gc stats
> 	Collector         	 	Count  	Time(ms) 
> 	-----------------  		-----  	-------- 
> 	Shenandoah Pauses    	371      	9864 
> 	Shenandoah Cycles     	98    	456409
> 
> For our purposes, the interesting line is the one of the pauses. However, it doesn't allow to distinguish if the pauses count are for normal pauses, for Degenerated ones or even Full GC's. (in the above lines, I had several Full GC and Degenerated pauses)
> I think it would be more useful if the pauses would be divided in those 3 categories. 
> That way, if we connect to one of our systems that has been running in a production network for several months, we could easily find out if all pauses have been "normal", or there has been any Degenerated one or even Full GC (which would require action on our side to try to tune better the Shenandoah properties)
> And we could also find out the average pause duration for each of the pauses.
> 
> 
> Traditionally, CMS divides the pauses into those of the Young Generation and those of the Old Generation
> 	[CMD] java gc stats
> 	collector            		count  time(ms)
> 	---------            		-----  	--------
> 	ParNew               	559    	27196
> 	ConcurrentMarkSweep  4      	660
> 
> And similar for G1, between minor+mixed pauses and Full GC (IMHO, incorrectly called Old Generation)
> 	[CMD] java gc stats
> 	collector            		count  	time(ms)
> 	---------            		-----  	--------
> 	G1 Young Generation  	147    	13689
> 	G1 Old Generation    	0      	0
> 
> Additionally, given that traditional collectors only provide information about pauses, our application simply sums up all "lines" in MXBean to determine the total pause times due to the GC activity.
> However, now with Shenandoah that is not possible, due to the "Shenandoah Cycles" line, which is not really a pause indicator.
> After all, I don't see it provides too much useful information, given that in normal conditions the cycles count should be pauses / 4. 
> So I would propose to remove it (probably you will have good reasons to keep it )
> 
> Therefore, my proposal would be to have something like:
> 	[CMD] java gc stats
> 	Collector          			Count  Time(ms) 
> 	-----------------  			-----  	-------- 
> 	Shenandoah Minor    		371      	605 
> 	Shenandoah Degenerated     	1    	627
> 	Shenandoah Full     		0    	0
> 
> With the 3 type of pauses and without the line associated to cycles count.
> 
> Thanks for your time and patience,
> 	Pedro
> 
> 
> PEDRO VITON 
> NOKIA
> AAA Specialist
> C/ Maria Tubau 9 (28050 Madrid - SPAIN)
> M: +34 690 964740 (ext. 2411 5746)
> pedro.viton at nokia.com
> 
> -----Original Message-----
> From: Roman Kennke [mailto:rkennke at redhat.com] 
> Sent: Tuesday, June 05, 2018 3:16 PM
> To: Viton, Pedro (Nokia - ES/Madrid) <pedro.viton at nokia.com>; Aleksey Shipilev <shade at redhat.com>
> Cc: shenandoah-dev at openjdk.java.net
> Subject: Re: Feedback on Shenandoah GC on b118 private build
> 
> Hi Pedro,
> 
>> Around 2 months ago, you took into consideration a suggestion I made on MXBeans and the pause time reported.
>> I had been waiting to have that feature backported to the OpenJDK 8 shipped with RHEL 7.x, but as up to 8u171 it was not included yet, I've decided to give it a try to your private build.
> 
> Thanks! We're almost ready to ship a new version in RPMs. Not quite there yet, though...
> 
> 
>> I have some feedback on the b118 build I've used for my tests, that I include here in you can find it useful to improve the current implementation.
>>
>>
>> 1.- I've noticed that when the collector is running, the CPU of the my server (8 vCPU server) goes up to 100%. 
>> It has surprised me because when the collector is not running, it is 
>> around 40% (see attached image) And with G1 and CMS I hadn't observed that behavior.
>>
>> It seems that G1 considers a ParallelGCThreads of 8 and ConcGCThreads of 2, on my 8 vCPU server.
>>        $ java -XX:+UseG1GC -XX:+PrintFlagsFinal –version
>>        …
>>        uintx ParallelGCThreads  = 8 {product}
>>        uintx ConcGCThreads     := 2 {product} <<<<<<<<<<<<<<
>>
>> However, Shenandoah considers 8 vCPU's for the concurrent phase, as well as for the parallel threads
>>        $java -XX:+UseShenandoahGC -XX:+PrintFlagsFinal –version
>>        …
>>        uintx ConcGCThreads      = 8 {product}
>>        uintx ParallelGCThreads  = 8 {product} <<<<<<<<<<<<<<
>>
>> Is this the expected behavior?
>> I suppose not, as if the GC tries to take all vCPU's during the concurrent phase, it would be greatly impacting the Java application.
>> Additionally, in  https://wiki.openjdk.java.net/display/shenandoah/Main says the following related to Degenerated pauses:
>> " The Degenerated cycle might be faster than the concurrent one, because it does not contend with the application over the resources, and it uses -XX:ParallelGCThreads, not -XX:ConcCGThreads for thread pool sizing"
>> I suppose that if they intention is to have the same # of threads for the pauses and concurrent phases, that sentence wouldn't be there.
>> Personally, I would prefer if the # of ConcGCThreads would < ParallelGCThreads, not to interfere with my application, or even with other processes that might be running in the server.
> 
> It is currently the expected behaviour. In our measurements so far it does not impact application performance as much as we'd expect, and it often showed benefitial to get done with the concurrent phase as soon as possible. For example, longer concurrent phases also means more memory usage and waste by, e.g., SATB.
> 
> 
>> 2.- I've noticed a very sharp spike on memory usage when the cycle has 
>> been started. (See attached image) I think the default heuristics 
>> decide to start the cycle at around 70% of utilization (=30% of free heap), which in my case is around 14 GB But during the collection cycle memory usage rises to almost 18.5 GB, which represents and extra usage of 4.5, which is much more than what my application is requesting to be allocated for new objects.
>> Could this be due to the fact that during the evacuation phase, the regions that have already been evacuated are not reclaimed until all regions have already been evacuated, and after the Final Updates Refs phase has completed?
>>
>>  And I suppose that regions already evacuated can't be reclaimed as they are left without live objects, isn't it?
>> Is it necessary to wait until all regions have been evacuated and the Final pause?
> 
> We cannot reclaim the regions until all references to old objects have been updated. Until that happens, yes, we need more memory because evacuated objects occupy both their old and new regions.
> 
>> 3.- I'm asking this because during the running of my test (less than 
>> 30 minutes), I've had a couple of Full GCs and several Degenerated 
>> pauses. I suppose that due to this or maybe due to the default 
>> heuristics
>>
>> $ grep -I degenerated gc.log.0.current [1077.149s][info][gc] GC(14) 
>> Cannot finish degeneration, upgrading to Full GC [1082.720s][info][gc] 
>> GC(14) Pause Degenerated GC (Evacuation) 18634M->6227M(20480M) 
>> 5570.967ms [1996.804s][info][gc] GC(58) Cannot finish degeneration, 
>> upgrading to Full GC [2003.362s][info][gc] GC(58) Pause Degenerated GC 
>> (Evacuation) 18914M->6657M(20480M) 6557.576ms [2232.686s][info][gc] 
>> GC(71) Pause Degenerated GC (Update Refs) 18839M->8434M(20480M) 
>> 136.747ms [2705.913s][info][gc] GC(98) Pause Degenerated GC (Update 
>> Refs) 18983M->8164M(20480M) 429.783ms [2959.040s][info][gc] GC(113) 
>> Pause Degenerated GC (Update Refs) 19138M->8705M(20480M) 117.288ms 
>> [3209.396s][info][gc] GC(129) Pause Degenerated GC (Update Refs) 
>> 18947M->8762M(20480M) 724.390ms
>>
>>
>> I've managed to eliminate those Degenerated pauses, by configuring to 
>> start the cycles at 50% of heap utilization 
>> -XX:+UnlockExperimentalVMOptions -XX:ShenandoahMinFreeThreshold=50 
>> -XX:ShenandoahMaxFreeThreshold=50
>>
>> But somehow I feel that having to double the memory, due to this, can be a high price to pay to use Shenandoah.
>> And that also implies that default heuristics is not valid for my 
>> application (and might also be the case for other people with other 
>> applications). One of the advantages of G1 is that people don't have 
>> to worry about many tuning of the JVM
>>
>> Obviously, this is not the end of the world, considering that RAM is cheap nowadays, and I was used to using CMS starting the CMS cycle at 65% of OG occupancy, trying to prevent Full GC's due to fragmentation.
>> But I was wondering if something could be improved related to this, even if it implies extra micro-pauses during the evacuation phase.
>> I think that G1 works in a similar way as I describe, as regions are reclaimed (and can be used for new allocations) as soon as they are evacuated of live objects.
> 
> You might want to try 'compact' heuristics which trades some throughput for lower footprint. If you can run with JDK10 or JDK11, you also might want to try out a jdk11 build and run with 'traversal' heuristics which only has a single concurrent phase, and can reclaim memory right after finishing that phase. It should also show better footprint characteristics.
> 
> 
>> Besides, and linking it to my previous  comment, if the ConcGCThreads end up being 2 instead of 8 (in my 8 vCPU server), the evacuation phase might take longer to complete (4x times longer). So new allocations during the evacuation phase would imply higher memory usage.
>> In my case, each cycle is fairly fast, and only requires 5s to complete (I'm using an old HP 360G7 server with DDR3 RAM chips, that is not comparable to new HP G9 servers with DDR4 RAM chips). But if the cycle required 20s to complete, the cycle should be started earlier and/or more frequently.
>>
>> [2574.325s][info][gc] GC(91) Pause Init Mark 2.788ms 
>> [2577.122s][info][gc] GC(91) Concurrent marking 13846M->14542M(20480M) 
>> 2797.284ms [2577.133s][info][gc] GC(91) Pause Final Mark 4.637ms 
>> [2577.133s][info][gc] GC(91) Concurrent cleanup 14577M->14578M(20480M) 
>> 0.154ms [2577.588s][info][gc] GC(91) Concurrent evacuation 
>> 14578M->16991M(20480M) 454.667ms [2577.589s][info][gc] GC(91) Pause 
>> Init Update Refs 0.083ms [2579.008s][info][gc] GC(91) Concurrent 
>> update references 16991M->17363M(20480M) 1419.574ms 
>> [2579.013s][info][gc] GC(91) Pause Final Update Refs 1.776ms 
>> [2579.019s][info][gc] GC(91) Concurrent cleanup 17363M->8060M(20480M) 
>> 5.905ms
>>
>>
>> 4.- Having said all this, I have to say that I'm really impressed with the low pauses I've observed with the Shenandoah GC.
>> I was expected to see lower throughput in my Application than with G1 or much higher memory usage due to the forwarding pointer added to all objects (and the fact that pointers are not compressed for heaps below 32 GB).
>> However, what I've observed is similar throughput and CPU usage 
>> (except during the cycles that CPU spikes to 100%), and memory usage 
>> is only around 10% higher (or even less) on static objects
> 
> Very nice! Thank you! :-)
> 
> Let us know how it goes!
> 
> Roman
>