Feedback on Shenandoah GC on b118 private build

Mon Jun 4 10:49:35 UTC 2018

Hi Aleksey (and others):

Around 2 months ago, you took into consideration a suggestion I made on MXBeans and the pause time reported.
I had been waiting to have that feature backported to the OpenJDK 8 shipped with RHEL 7.x, but as up to 8u171 it was not included yet, I've decided to give it a try to your private build.

I have some feedback on the b118 build I've used for my tests, that I include here in you can find it useful to improve the current implementation.

1.- I've noticed that when the collector is running, the CPU of the my server (8 vCPU server) goes up to 100%. 
It has surprised me because when the collector is not running, it is around 40% (see attached image)
And with G1 and CMS I hadn't observed that behavior.

It seems that G1 considers a ParallelGCThreads of 8 and ConcGCThreads of 2, on my 8 vCPU server.
       $ java -XX:+UseG1GC -XX:+PrintFlagsFinal –version
       …
       uintx ParallelGCThreads  = 8 {product}
       uintx ConcGCThreads     := 2 {product} <<<<<<<<<<<<<<

However, Shenandoah considers 8 vCPU's for the concurrent phase, as well as for the parallel threads
       $java -XX:+UseShenandoahGC -XX:+PrintFlagsFinal –version
       …
       uintx ConcGCThreads      = 8 {product}
       uintx ParallelGCThreads  = 8 {product} <<<<<<<<<<<<<<

Is this the expected behavior?
I suppose not, as if the GC tries to take all vCPU's during the concurrent phase, it would be greatly impacting the Java application.
Additionally, in  https://wiki.openjdk.java.net/display/shenandoah/Main says the following related to Degenerated pauses:
" The Degenerated cycle might be faster than the concurrent one, because it does not contend with the application over the resources, and it uses -XX:ParallelGCThreads, not -XX:ConcCGThreads for thread pool sizing"
I suppose that if they intention is to have the same # of threads for the pauses and concurrent phases, that sentence wouldn't be there.
Personally, I would prefer if the # of ConcGCThreads would < ParallelGCThreads, not to interfere with my application, or even with other processes that might be running in the server.

2.- I've noticed a very sharp spike on memory usage when the cycle has been started. (See attached image)
I think the default heuristics decide to start the cycle at around 70% of utilization (=30% of free heap), which in my case is around 14 GB
But during the collection cycle memory usage rises to almost 18.5 GB, which represents and extra usage of 4.5, which is much more than what my application is requesting to be allocated for new objects.
Could this be due to the fact that during the evacuation phase, the regions that have already been evacuated are not reclaimed until all regions have already been evacuated, and after the Final Updates Refs phase has completed?

 And I suppose that regions already evacuated can't be reclaimed as they are left without live objects, isn't it?
Is it necessary to wait until all regions have been evacuated and the Final pause?

3.- I'm asking this because during the running of my test (less than 30 minutes), I've had a couple of Full GCs and several Degenerated pauses. I suppose that due to this or maybe due to the default heuristics

$ grep -I degenerated gc.log.0.current
[1077.149s][info][gc] GC(14) Cannot finish degeneration, upgrading to Full GC
[1082.720s][info][gc] GC(14) Pause Degenerated GC (Evacuation) 18634M->6227M(20480M) 5570.967ms
[1996.804s][info][gc] GC(58) Cannot finish degeneration, upgrading to Full GC
[2003.362s][info][gc] GC(58) Pause Degenerated GC (Evacuation) 18914M->6657M(20480M) 6557.576ms
[2232.686s][info][gc] GC(71) Pause Degenerated GC (Update Refs) 18839M->8434M(20480M) 136.747ms
[2705.913s][info][gc] GC(98) Pause Degenerated GC (Update Refs) 18983M->8164M(20480M) 429.783ms
[2959.040s][info][gc] GC(113) Pause Degenerated GC (Update Refs) 19138M->8705M(20480M) 117.288ms
[3209.396s][info][gc] GC(129) Pause Degenerated GC (Update Refs) 18947M->8762M(20480M) 724.390ms

I've managed to eliminate those Degenerated pauses, by configuring to start the cycles at 50% of heap utilization
-XX:+UnlockExperimentalVMOptions -XX:ShenandoahMinFreeThreshold=50 -XX:ShenandoahMaxFreeThreshold=50

But somehow I feel that having to double the memory, due to this, can be a high price to pay to use Shenandoah.
And that also implies that default heuristics is not valid for my application (and might also be the case for other people with other applications). One of the advantages of G1 is that people don't have to worry about many tuning of the JVM

Obviously, this is not the end of the world, considering that RAM is cheap nowadays, and I was used to using CMS starting the CMS cycle at 65% of OG occupancy, trying to prevent Full GC's due to fragmentation.
But I was wondering if something could be improved related to this, even if it implies extra micro-pauses during the evacuation phase.
I think that G1 works in a similar way as I describe, as regions are reclaimed (and can be used for new allocations) as soon as they are evacuated of live objects.

Besides, and linking it to my previous  comment, if the ConcGCThreads end up being 2 instead of 8 (in my 8 vCPU server), the evacuation phase might take longer to complete (4x times longer). So new allocations during the evacuation phase would imply higher memory usage.
In my case, each cycle is fairly fast, and only requires 5s to complete (I'm using an old HP 360G7 server with DDR3 RAM chips, that is not comparable to new HP G9 servers with DDR4 RAM chips). But if the cycle required 20s to complete, the cycle should be started earlier and/or more frequently.

[2574.325s][info][gc] GC(91) Pause Init Mark 2.788ms
[2577.122s][info][gc] GC(91) Concurrent marking 13846M->14542M(20480M) 2797.284ms
[2577.133s][info][gc] GC(91) Pause Final Mark 4.637ms
[2577.133s][info][gc] GC(91) Concurrent cleanup 14577M->14578M(20480M) 0.154ms
[2577.588s][info][gc] GC(91) Concurrent evacuation 14578M->16991M(20480M) 454.667ms
[2577.589s][info][gc] GC(91) Pause Init Update Refs 0.083ms
[2579.008s][info][gc] GC(91) Concurrent update references 16991M->17363M(20480M) 1419.574ms
[2579.013s][info][gc] GC(91) Pause Final Update Refs 1.776ms
[2579.019s][info][gc] GC(91) Concurrent cleanup 17363M->8060M(20480M) 5.905ms

4.- Having said all this, I have to say that I'm really impressed with the low pauses I've observed with the Shenandoah GC.
I was expected to see lower throughput in my Application than with G1 or much higher memory usage due to the forwarding pointer added to all objects (and the fact that pointers are not compressed for heaps below 32 GB).
However, what I've observed is similar throughput and CPU usage (except during the cycles that CPU spikes to 100%), and memory usage is only around 10% higher (or even less) on static objects

Thanks,
	Pedro

PEDRO VITON 
NOKIA
AAA Specialist
C/ Maria Tubau 9 (28050 Madrid - SPAIN)
M: +34 690 964740 (ext. 2411 5746)
pedro.viton at nokia.com

-----Original Message-----
From: Aleksey Shipilev [mailto:shade at redhat.com] 
Sent: Thursday, April 12, 2018 9:53 PM
To: Viton, Pedro (Nokia - ES/Madrid) <pedro.viton at nokia.com>; shenandoah-dev at openjdk.java.net
Subject: Re: MXBeans to report pause times seem to indicate the complete GC cycle time

On 04/11/2018 04:31 PM, Aleksey Shipilev wrote:
> On 04/04/2018 04:56 PM, Viton, Pedro (Nokia - ES/Madrid) wrote:
>> I think that Shenandoah GC should only report in the MXBean the time 
>> it has paused the Java application, in the same way it is done with 
>> G1 and CMS+ParNew, and not the whole processing time (pause + 
>> concurrent time)
> 
> Thanks for the report! This makes sense. Zhengyu is going to look into this.

Zhengyu pushed the fix to sh/jdk:
  http://hg.openjdk.java.net/shenandoah/jdk/rev/040b0d8b5f24

It would take a while for the fix to propagate to backports and RPMs. You may want to pick up the latest binary build for sh/jdk, and see if it does what you suggested. Builds b57 and later here have the fix:
  https://builds.shipilev.net/openjdk-shenandoah-jdk/

Thanks,
-Aleksey