Shenandoah performance problem

Thu Oct 17 07:13:55 UTC 2019

Hi Attila,

Yes, this looks like a performance problem that we shall look at. :-)
Thanks for reporting!

Can you quickly try this with -XX:-CriticalJNINatives ?

Thanks,
Roman

> Hi!
> 
> I'm comparing the performance of different GC implementations against a
> low-latency project. During the initialization of the project, we load a
> bunch of key-value pairs from memcached to warm up the local instance
> cache. The values in memcached are serialized and zipped, so we are
> unzipping and deserializing them after they are loaded. In order to
> utilize all cores we use a ThreadPoolExecutor to do this operation using
> multiple threads.
> 
> An interesting thing what I noticed, that the cputime burned by the
> transcode pool during startup, is significantly higher in case of
> Shenandoah, than the other GC algorithms:
> 
> CMS: ~29 sec
> ZGC: ~36 8sec
> Shenandoah: ~ 270 sec
> 
> Since the code is quite complex, I've tried to simulate roughly what is
> happening by creating a microbenchmark.
> You can fetch it from here: https://github.com/axt/jmh-unzip-mt
> (Please note that jdk, and jdkarguments at the moment are hardcoded in
> BechmarkRunner, you will need to manually edit those to get it working)
> 
> I've executed the benchmark against a freshly built jdk from here:|
> $ hg clone http:||//hg||.openjdk.java.net||/shenandoah/jdk| |shenandoah|
> 
> 
> The benchmark yields the following results:
> ```
> # JMH version: 1.21
> # VM version: JDK 14-internal, OpenJDK 64-Bit Server VM,
> 14-internal+0-adhoc.axt.shenandoah
> # VM invoker:
> /fast/shenandoah/build/linux-x86_64-server-release/images/jdk/bin/java
> # Warmup: 5 iterations, 10 s each
> # Measurement: 10 iterations, 10 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Average time, time/op
> # Benchmark: axt.benchmark.TranscodeBenchmark.benchmark
> 
>     Benchmark                     Mode  Cnt  Score   Error  Units
>     TranscodeBenchmark.benchmark  avgt   10  1.098 ± 0.018 s/op        #
> VM options: -Xms1024m -Xmx1024m -XX:+UseConcMarkSweepGC
> -XX:+CMSConcurrentMTEnabled
>     TranscodeBenchmark.benchmark  avgt   10  1.129 ± 0.026 s/op        #
> VM options: -Xms1024m -Xmx1024m -XX:+UseG1GC
>     TranscodeBenchmark.benchmark  avgt   10  1.083 ± 0.031 s/op        #
> VM options: -Xms1024m -Xmx1024m -XX:+UnlockExperimentalVMOptions
> -XX:+UseZGC
>     TranscodeBenchmark.benchmark  avgt   10  5.720 ± 0.219 s/op        #
> VM options: -Xms1024m -Xmx1024m -XX:+UnlockExperimentalVMOptions
> -XX:+UseShenandoahGC
> ```
> 
> I've also executed it with the fastdebug build, while creating a
> recording with the `perf` profiler.
> Here is a screenshot from the flamegraph: https://imgur.com/3v2RDqt
> 
> Based on this, most of the extra cpu time is burned, while spinlocking
> on the heap mutex, to pin the memory area while gzip calls into native
> code.
> 
> Do you consider this as a bug?
> 
> Thanks, in advance,
>   axt
>