Shenandoah performance problem

Thu Oct 17 07:27:55 UTC 2019

I tried with -XX:-CriticalJNINatives but it did not help.

We'll look into this.

Roman

> Hi Attila,
> 
> Yes, this looks like a performance problem that we shall look at. :-)
> Thanks for reporting!
> 
> Can you quickly try this with -XX:-CriticalJNINatives ?
> 
> Thanks,
> Roman
> 
>> Hi!
>>
>> I'm comparing the performance of different GC implementations against a
>> low-latency project. During the initialization of the project, we load a
>> bunch of key-value pairs from memcached to warm up the local instance
>> cache. The values in memcached are serialized and zipped, so we are
>> unzipping and deserializing them after they are loaded. In order to
>> utilize all cores we use a ThreadPoolExecutor to do this operation using
>> multiple threads.
>>
>> An interesting thing what I noticed, that the cputime burned by the
>> transcode pool during startup, is significantly higher in case of
>> Shenandoah, than the other GC algorithms:
>>
>> CMS: ~29 sec
>> ZGC: ~36 8sec
>> Shenandoah: ~ 270 sec
>>
>> Since the code is quite complex, I've tried to simulate roughly what is
>> happening by creating a microbenchmark.
>> You can fetch it from here: https://github.com/axt/jmh-unzip-mt
>> (Please note that jdk, and jdkarguments at the moment are hardcoded in
>> BechmarkRunner, you will need to manually edit those to get it working)
>>
>> I've executed the benchmark against a freshly built jdk from here:|
>> $ hg clone http:||//hg||.openjdk.java.net||/shenandoah/jdk| |shenandoah|
>>
>>
>> The benchmark yields the following results:
>> ```
>> # JMH version: 1.21
>> # VM version: JDK 14-internal, OpenJDK 64-Bit Server VM,
>> 14-internal+0-adhoc.axt.shenandoah
>> # VM invoker:
>> /fast/shenandoah/build/linux-x86_64-server-release/images/jdk/bin/java
>> # Warmup: 5 iterations, 10 s each
>> # Measurement: 10 iterations, 10 s each
>> # Timeout: 10 min per iteration
>> # Threads: 1 thread, will synchronize iterations
>> # Benchmark mode: Average time, time/op
>> # Benchmark: axt.benchmark.TranscodeBenchmark.benchmark
>>
>>     Benchmark                     Mode  Cnt  Score   Error  Units
>>     TranscodeBenchmark.benchmark  avgt   10  1.098 ± 0.018 s/op        #
>> VM options: -Xms1024m -Xmx1024m -XX:+UseConcMarkSweepGC
>> -XX:+CMSConcurrentMTEnabled
>>     TranscodeBenchmark.benchmark  avgt   10  1.129 ± 0.026 s/op        #
>> VM options: -Xms1024m -Xmx1024m -XX:+UseG1GC
>>     TranscodeBenchmark.benchmark  avgt   10  1.083 ± 0.031 s/op        #
>> VM options: -Xms1024m -Xmx1024m -XX:+UnlockExperimentalVMOptions
>> -XX:+UseZGC
>>     TranscodeBenchmark.benchmark  avgt   10  5.720 ± 0.219 s/op        #
>> VM options: -Xms1024m -Xmx1024m -XX:+UnlockExperimentalVMOptions
>> -XX:+UseShenandoahGC
>> ```
>>
>> I've also executed it with the fastdebug build, while creating a
>> recording with the `perf` profiler.
>> Here is a screenshot from the flamegraph: https://imgur.com/3v2RDqt
>>
>> Based on this, most of the extra cpu time is burned, while spinlocking
>> on the heap mutex, to pin the memory area while gzip calls into native
>> code.
>>
>> Do you consider this as a bug?
>>
>> Thanks, in advance,
>>   axt
>>
>