Benchmark scenario with high G1 performance degradation

Tue May 16 15:35:45 UTC 2017

Hi Jens,

  sorry for the late reply...

On Thu, 2017-04-20 at 09:49 +0700, Jens Wilke wrote:
> Hello,
> 
> I am currently benchmarking different in-process caching libraries.
> In some benchmark scenarios I found very "odd" results when using
> the G1 garbage collector. In a particular benchmark scenario the
> performance (as measured in ops/s of the particular benchmark) 
> drops to about 30% when compared to CMS, typically I expect (and
> observe) only a performance degradation to around 80% of  the
> performance with CMS.
> 
> In the blog post is a little bit more background of what I am doing:
> https://cruftex.net/2017/03/28/The-6-Memory-Metrics-You-Should-Track-
> in-Your-Java-Benchmarks.html

Some quirks about the benchmark setup:

- after every iteration it does not try to get the VM back to some kind
of initial state, but just continues running.
This means that depending on when GCs occur, the performance results
can vary a lot (and they do).
E.g. I am noticing a consistent score error (deviation?) of +-50% for
parallel gc...

I would recommend trying to make longer iterations to improve
repeatability.

Particularly Parallel GC with its regular Full GCs suffer a lot from
it: depending on if or when Full GCs occur the results vary from better
than CMS to much worse.

Depending how many of these full gcs occur (i.e. how much the default
policies of the different collectors expand the heap and ergonomics
starts them - and the policies are quite different) the impact of the
various policies is larger than the impact of the GC algorithm.

Eg. Parallel GC young GCs seem to be faster than e.g. CMS'es in this
case; the difference is that in this short running benchmark CMS does
not run into a concurrent mode failure (full gc). It may also be that
CMS might never run into one as there is no fragmentation.

CMS happens to have more stable behavior because its (still 4-500ms)
pauses in this case are young-only pauses. 

- It is also unclear to me what you are measuring: probably out-of-box
experience, which is a reasonable setup (I guess), but I wonder why in
the options -XX:BiasedLockingStartupDelay=0 is used?

- The benchmark as given also does not do any warmup iterations.
Looking at the raw data for the various collectors I would say that at
least the first three iterations give values that are heavily
influenced by warmup (in this case benefitting CMS).

(I am aware that in that blog entry you used two warmup iterations)

> For the particular scenario I have made the following observations:
> 
> - Java 8 VM 25.131-b11: Average benchmark performance throughput is
> 2098524 ops/s
> - Java 9 WM 9-EA+165: Average benchmark performance throughput is 
>  566699 ops/s
> - Java 8 VM 25.131-b11: Throughput is quite steady
> - Java 9 WM 9-EA+165: Throughput has big variations and the tendency
to decrease
> - Java 8 VM 25.131-b11: VmRSS, as reported by Linux, grows to 4.2 GB 
> - Java 9 WM 9-EA+165: VmRSS, as reported by Linux, grows to 6.3 GB

Which collectors do you compare in these results?

> - Java 9 WM 9-EA+165: Profiling shows that 44.19% of CPU cycles is
> spent in OtherRegionsTable::add_reference (for Java 8 G1 it is
> similar)
> 
> And less quantified:
> 
> - With Java 8 G1 it seems more worse
> - Scenarios with smaller heap/cache sizes don't show the high 
> performance drop when comparing CMS and G1

In this case the caches seem to be the cause for the many old->young
references, as they are humongous objects (larger than half a region).

I did some tests with a largish young gen, trying to keep stuff in
young gen as long as possible, and at least G1 improved...

> - Java 9 WM 9-EA+165 with the options -XX:+UseParallelGC -XX:
> +UseParallelOldGC, seems to have 50% performance of Java 8 and higher
> memory consumption
>   (What are the correct parameters to restore the old default
> behavior?)

Can not reproduce a regression for parallel gc on a similar machine
(spec below). I did not try on larger machines since my colleague
already indiciated that g1 does not show this extreme behavior there.

> - The overall GC activity and time spend for GC is quite low

That may be the case (I am not completely sure about that; I did not
measure), but GC times highly influence the score. The scores of e.g.
parallel gc drop significantly every time there is a full gc during an
iteration. Other than that parallel gc young gc times are better than
for CMS which is imho substantiated by parallel gc achieving higher
scores if there is no full gc during an iteration.

> To reproduce the measurements:
> 
> Hardware:
> Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz, 1x4GB, 1x16GB @ DDR3
> 1600MHz, 2 cores with hyperthreading enabled
> OS/Kernel:
> Linux version 4.4.0-72-generic (buildd at lcy01-17) (gcc version 5.4.0
> 20160609 
> (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #93-Ubuntu SMP Fri Mar 31 14:07:41
> UTC 2017

Running on an i5-3320M at 2.60Ghz with 8GB of RAM - I had to drop -Xmx to
6GB.

Using java8-b92 vs. java9-b169. GC-wise (particularly CMS and Parallel)
there should not be much difference between these and yours.

Just checking, did you make sure that external factors that typically
make a mess of benchmarking on a laptop (but not limited to it) like
turbo/THP/etc were disabled?

Note that the four threads on these laptop machines are not full cores,
because in the test on the website you seem to specifically disable
hyperthreading.

> git clone https://github.com/headissue/cache2k-benchmark.git
> cd cache2k-benchmark
> git checkout d68d7608f18ed6c5a10671f6dd3c48f76afdf0a8
> mvn -DskipTests package
> java -jar jmh-suite/target/benchmarks.jar \\.RandomSequenceBenchmark
> -jvmArgs 
> -server\ -Xmx20G\ -XX:BiasedLockingStartupDelay=0\ -verbose:gc\ -XX:
> +PrintGCDetails -f 1 -wi 0 -i 10 -r 20s -t 4 -prof 
> org.cache2k.benchmark.jmh.LinuxVmProfiler  -p 
> cacheFactory=org.cache2k.benchmark.Cache2kFactory -rf json -rff
> result.json
> 
> When not on Linux, strip "-prof
> org.cache2k.benchmark.jmh.LinuxVmProfiler".
> 
> I have the feeling this could be worth a closer look.
> If there are any questions or things I can help with, let me know.
> 
> I would be interested to know whether there is something that I can
> change in the code to avoid triggering this behavior.

The benchmark creates a lot of references from old to young gen that
impact pause times and so scores a lot. So anything that reduces these
in the application will help (for all collectors, specifically G1
benefits the most though).

Some tests indicated that what can help (and is specific to G1) is
early reclaim of large object arrays (of some hash table I guess)[3] if
it were available - by doing so it seems G1 could remove a lot of stale
references (in the old array, apparently after expanding) from the
remembered sets, that need to be updated all the time.

So maybe if the code tried to do less expansions for that hash table,
it would improve the situation? Or it may be worth a try to clear out
(null out) references in the old array if possible and not done yet.

Other than that there is lots of time spent in managing the remembered
sets (e.g. the time spent in add_reference() you noticed) concurrently.
There are multiple options to improve this, some you could do right
away, e.g. increasing heap region size (even to 32M), potentially
moving concurrent remembered set work into the pause (see the tuning
guide [0]) or others that need work, e.g. the rebuild remembered sets
on the fly idea [1], the throughput remembered set proposed earlier
[2].

Thanks,
  Thomas

[0] https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col
lector-tuning.htm#JSGCT-GUID-70E3F150-B68E-4787-BBF1-F91315AC9AB9
[1] https://bugs.openjdk.java.net/browse/JDK-8180415
[2] http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2016-November
/019215.html
[3] https://bugs.openjdk.java.net/browse/JDK-8048180