Shenandoah performance problem

Wed Oct 16 23:01:29 UTC 2019

Hi!

I'm comparing the performance of different GC implementations against a 
low-latency project. During the initialization of the project, we load a 
bunch of key-value pairs from memcached to warm up the local instance 
cache. The values in memcached are serialized and zipped, so we are 
unzipping and deserializing them after they are loaded. In order to 
utilize all cores we use a ThreadPoolExecutor to do this operation using 
multiple threads.

An interesting thing what I noticed, that the cputime burned by the 
transcode pool during startup, is significantly higher in case of 
Shenandoah, than the other GC algorithms:

CMS: ~29 sec
ZGC: ~36 8sec
Shenandoah: ~ 270 sec

Since the code is quite complex, I've tried to simulate roughly what is 
happening by creating a microbenchmark.
You can fetch it from here: https://github.com/axt/jmh-unzip-mt
(Please note that jdk, and jdkarguments at the moment are hardcoded in 
BechmarkRunner, you will need to manually edit those to get it working)

I've executed the benchmark against a freshly built jdk from here:|
$ hg clone http:||//hg||.openjdk.java.net||/shenandoah/jdk| |shenandoah|

The benchmark yields the following results:
```
# JMH version: 1.21
# VM version: JDK 14-internal, OpenJDK 64-Bit Server VM, 
14-internal+0-adhoc.axt.shenandoah
# VM invoker: 
/fast/shenandoah/build/linux-x86_64-server-release/images/jdk/bin/java
# Warmup: 5 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: axt.benchmark.TranscodeBenchmark.benchmark

     Benchmark                     Mode  Cnt  Score   Error  Units
     TranscodeBenchmark.benchmark  avgt   10  1.098 ± 0.018 s/op        
# VM options: -Xms1024m -Xmx1024m -XX:+UseConcMarkSweepGC 
-XX:+CMSConcurrentMTEnabled
     TranscodeBenchmark.benchmark  avgt   10  1.129 ± 0.026 s/op        
# VM options: -Xms1024m -Xmx1024m -XX:+UseG1GC
     TranscodeBenchmark.benchmark  avgt   10  1.083 ± 0.031 s/op        
# VM options: -Xms1024m -Xmx1024m -XX:+UnlockExperimentalVMOptions 
-XX:+UseZGC
     TranscodeBenchmark.benchmark  avgt   10  5.720 ± 0.219 s/op        
# VM options: -Xms1024m -Xmx1024m -XX:+UnlockExperimentalVMOptions 
-XX:+UseShenandoahGC
```

I've also executed it with the fastdebug build, while creating a 
recording with the `perf` profiler.
Here is a screenshot from the flamegraph: https://imgur.com/3v2RDqt

Based on this, most of the extra cpu time is burned, while spinlocking 
on the heap mutex, to pin the memory area while gzip calls into native code.

Do you consider this as a bug?

Thanks, in advance,
   axt