Benchmark scenario with high G1 performance degradation

Mon Apr 24 14:16:20 UTC 2017

Hi Jens,

On 2017-04-21 13:03, Jens Wilke wrote:
> Hi Stefan,
>
> On Donnerstag, 20. April 2017 19:59:30 ICT Stefan Johansson wrote:
>> Thanks for reaching out and for providing such a good step-by-step guide
>> on how to run the benchmark the same way you are.
> Thanks for the quick reply!
>
>> I've tried with CMS, G1 and Parallel, both with 10g and 20g heap, but so
>> far I can't reproduce your problems. It would be great if you could
>> provide us with some more information. For example GC-logs and the
>> result files. We might be able to dig something out of them.
> The logs from the measurement on my notebook for the first mail (see below) are available at (only 30 days valid):
>
> http://ovh.to/FzKbgrb
>
> What environment you are testing on?
I only did some quick testing on my desktop with has 12 cores and 
hyper-threading, so the default is to use 18 parallel GC threads on my 
system.
>
> Please mind the core count. My stomach tells me that it could have something to do with the hash table arrays. When you are testing with a system that reports more than 8 cores the allocated arrays will be smaller than in my case, since the cache is doing segmentation.
>
>>   From my runs it looks like G1 is about 5-10% behind CMS and 10-15%
>> behind Parallel for both JDK 8 and 9.
> That seems okay.
>
> Actually, I'd like to publish my next benchmark results, however, I am somehow stuck with this issue now. Benchmarking with CMS only doesn't really make sense at the current point in time. Also I don't like to be in doubt that there is something wrong in the setup.
>
>> I took a quick look at the blog as well, and there the system had 32GB
>> ram and the runs were done with -Xmx10g. The system you describe here
>> only have 20GB ram and you are using -Xmx20G, is that correct or has
>> there been a typo?
> My bad, sorry for the confusion. There was enough free memory and RSS was only at 6GB so the system was not swapping. I did play with the parameters to see whether it makes a difference, but forgot to put it in a reasonable range when sending the report.
>
> The effects on the isolated benchmark system with 32GB and-Xmx10g or  -Xmx20G are the same (see blog article for parameters).
>
> The hopping point seems to be the function OtherRegionsTable::add_reference.
> When I run with -prof perfasm and Java 8U121 with and without G1 on the benchmark system I get this:
>
> .../jdk1.8.0_121/bin/java -jar jmh-suite/target/benchmarks.jar \\.RandomSequenceBenchmark -jvmArgs -server\ -Xmx20G\ -XX:BiasedLockingStartupDelay=0\ -verbose:gc\ -XX:+PrintGCDetails -f 1 -wi 1 -w 20s  -i 1 -r 20s -t 4 -prof org.cache2k.benchmark.jmh.LinuxVmProfiler -prof org.cache2k.benchmark.jmh.MiscResultRecorderProfiler  -p cacheFactory=org.cache2k.benchmark.Cache2kFactory -rf json -rff result.json -prof perfasm
>
> ....[Hottest Methods (after inlining)]..............................................................
>   22.48%    6.61%         C2, level 4  org.cache2k.core.AbstractEviction::removeAllFromReplacementListOnEvict, version 897
>   21.06%    8.18%         C2, level 4  org.cache2k.core.HeapCache::insertNewEntry, version 913
>    9.38%    7.13%           libjvm.so  SpinPause
>    9.13%    9.54%         C2, level 4  org.cache2k.benchmark.jmh.suite.eviction.symmetrical.generated.RandomSequenceBenchmark_operation_jmhTest::operation_thrpt_jmhStub, version 873
>    5.00%    3.88%           libjvm.so  _ZN13InstanceKlass17oop_push_contentsEP18PSPromotionManagerP7oopDesc
>    4.54%    4.70%       perf-5104.map  [unknown]
>    3.57%    3.86%         C2, level 4  org.cache2k.core.AbstractEviction::removeFromHashWithoutListener, version 838
>    2.86%   15.40%           libjvm.so  _ZN13ObjectMonitor11NotRunnableEP6ThreadS1_
>    2.48%   12.53%           libjvm.so  _ZN13ObjectMonitor20TrySpin_VaryDurationEP6Thread
>    2.46%    1.72%         C2, level 4  org.cache2k.core.AbstractEviction::refillChunk, version 906
>    2.31%    3.33%           libjvm.so  _ZN18PSPromotionManager22copy_to_survivor_spaceILb0EEEP7oopDescS2_
>    2.24%    6.37%         C2, level 4  java.util.concurrent.locks.StampedLock::acquireRead, version 864
>    2.03%    2.89%           libjvm.so  _ZN18PSPromotionManager18drain_stacks_depthEb
>    1.44%    1.43%           libjvm.so  _ZN13ObjArrayKlass17oop_push_contentsEP18PSPromotionManagerP7oopDesc
>    1.29%    0.72%              kernel  [unknown]
>    1.03%    1.27%           libjvm.so  _ZN18CardTableExtension26scavenge_contents_parallelEP16ObjectStartArrayP12MutableSpaceP8HeapWordP18PSPromotionManagerjj
>    0.79%    1.53%         C2, level 4  java.util.concurrent.locks.StampedLock::acquireWrite, version 865
>    0.74%    4.21%        runtime stub  StubRoutines::SafeFetch32
>    0.71%    0.50%         C2, level 4  org.cache2k.core.ClockProPlusEviction::sumUpListHits, version 772
>    0.70%    0.39%        libc-2.19.so  __clock_gettime
>    3.76%    3.73%  <...other 147 warm methods...>
> ....................................................................................................
> 100.00%   99.93%  <totals>
>
> .../jdk1.8.0_121/bin/java -jar jmh-suite/target/benchmarks.jar \\.RandomSequenceBenchmark -jvmArgs -server\ -Xmx20G\ -XX:BiasedLockingStartupDelay=0\ -verbose:gc\ -XX:+PrintGCDetails\ -XX:+UseG1GC -f 1 -wi 1 -w 20s  -i 1 -r 20s -t 4 -prof org.cache2k.benchmark.jmh.LinuxVmProfiler -prof org.cache2k.benchmark.jmh.MiscResultRecorderProfiler  -p cacheFactory=org.cache2k.benchmark.Cache2kFactory -rf json -rff result.json -prof perfasm
>
> ....[Hottest Methods (after inlining)]..............................................................
>   49.11%   41.16%           libjvm.so  _ZN17OtherRegionsTable13add_referenceEPvi
>   10.25%    3.37%         C2, level 4  org.cache2k.core.ClockProPlusEviction::removeFromReplacementListOnEvict, version 883
>    4.93%    1.43%         C2, level 4  org.cache2k.core.SegmentedEviction::submitWithoutEviction, version 694
>    4.31%    5.89%           libjvm.so  _ZN29G1UpdateRSOrPushRefOopClosure6do_oopEPj
>    3.18%    4.17%           libjvm.so  _ZN13ObjArrayKlass20oop_oop_iterate_nv_mEP7oopDescP24FilterOutOfRegionClosure9MemRegion
>    3.17%    3.00%           libjvm.so  _ZN29G1BlockOffsetArrayContigSpace18block_start_unsafeEPKv
>    2.95%    3.16%       perf-5226.map  [unknown]
>    2.19%    1.00%         C2, level 4  org.cache2k.benchmark.Cache2kFactory$1::getIfPresent, version 892
>    1.58%    1.50%           libjvm.so  _ZN8G1RemSet11refine_cardEPajb
>    1.42%    5.02%           libjvm.so  _ZNK10HeapRegion12block_is_objEPK8HeapWord
>    1.41%    3.31%           libjvm.so  _ZN10HeapRegion32oops_on_card_seq_iterate_carefulE9MemRegionP24FilterOutOfRegionClosurebPa
>    1.13%    3.05%           libjvm.so  _ZN13InstanceKlass18oop_oop_iterate_nvEP7oopDescP24FilterOutOfRegionClosure
>    0.98%    0.51%           libjvm.so  _ZN14G1HotCardCache6insertEPa
>    0.89%    4.27%           libjvm.so  _ZN13ObjectMonitor11NotRunnableEP6ThreadS1_
>    0.85%    1.17%         C2, level 4  org.cache2k.core.HeapCache::insertNewEntry, version 899
>    0.74%    3.59%           libjvm.so  _ZN13ObjectMonitor20TrySpin_VaryDurationEP6Thread
>    0.74%    0.57%           libjvm.so  _ZN20G1ParScanThreadState10trim_queueEv
>    0.70%    0.70%         C2, level 4  org.cache2k.core.Hash2::remove, version 864
>    0.69%    0.81%         C2, level 4  org.cache2k.core.ClockProPlusEviction::findEvictionCandidate, version 906
>    0.65%    1.59%         C2, level 4  org.cache2k.benchmark.jmh.suite.eviction.symmetrical.generated.RandomSequenceBenchmark_operation_jmhTest::operation_thrpt_jmhStub, version 857
>    8.14%   10.65%  <...other 331 warm methods...>
> ....................................................................................................
> 100.00%   99.91%  <totals>
As I mentioned in my reply to your other mail, these calls are caused by 
region to region pointers in G1. Adding those references can be done 
either during a safepoint or concurrently. Looking at your profile it 
seems that most calls come from the concurrent path and since your 
system has few cores having the concurrent refinement threads doing a 
lot of work will have impact on the over all performance more.

Stefan
> Best,
>
> Jens
>