how reliable JMH's results are

Mon Mar 25 23:54:54 UTC 2019

On 3/20/19 2:56 AM, Vicente Romero wrote:
> Today at the amber meeting there was a debate about why some JMH's results vary between executions.
> Well first of all these are experiments and in all experiments, there is a certain degree of
> variability that can't be controlled.

I said this before, and I feel it needs to be said again: what you have in your tables is *data*,
not *results*. The distinction is important: data becomes results when you analyze (explain) why the
data is the way it is. One of the ways is to use the profilers to figure out the reason for
performance difference, both positive and negative. It is usually pointless to stare and numbers and
wonder. It is much, much more effective to dissect the benchmark while is is running.

If you are fighting the variance, it is important to understand the source of it. You don't need to
run hundreds of iterations for minutes and look at summaries, it is more important to see how the
timings differs from iteration to iteration.

For example, take IntrinsicsBenchmark.testHash0002Integer, run it with "-w 1s -r 1s":

# JMH version: 1.21
# VM version: JDK 13-internal, OpenJDK 64-Bit Server VM, 13-internal+0-adhoc.shade.jdk-jdk
# VM invoker: /home/shade/trunks/jdk-jdk/build/linux-x86_64-server-release/images/jdk/bin/java
# VM options: <none>
# Warmup: 10 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.openjdk.IntrinsicsBenchmark.testHash0002Integer

# Run progress: 0.00% complete, ETA 00:00:30
# Fork: 1 of 1
# Warmup Iteration   1: 57204.334 ops/ms
# Warmup Iteration   2: 103416.259 ops/ms
# Warmup Iteration   3: 162178.931 ops/ms
# Warmup Iteration   4: 162476.308 ops/ms
# Warmup Iteration   5: 144305.570 ops/ms
# Warmup Iteration   6: 161836.788 ops/ms
# Warmup Iteration   7: 169351.425 ops/ms
# Warmup Iteration   8: 164579.568 ops/ms
# Warmup Iteration   9: 164465.457 ops/ms
# Warmup Iteration  10: 164625.845 ops/ms
Iteration   1: 164322.220 ops/ms
Iteration   2: 170006.896 ops/ms
Iteration   3: 164746.236 ops/ms
Iteration   4: 170108.658 ops/ms
Iteration   5: 164642.814 ops/ms
Iteration   6: 164799.820 ops/ms
Iteration   7: 170020.497 ops/ms
Iteration   8: 170213.564 ops/ms
Iteration   9: 164649.448 ops/ms
Iteration  10: 164413.162 ops/ms
Iteration  11: 160188.524 ops/ms
Iteration  12: 164399.563 ops/ms
Iteration  13: 170135.645 ops/ms
Iteration  14: 164643.840 ops/ms
Iteration  15: 170198.652 ops/ms
Iteration  16: 170071.623 ops/ms
Iteration  17: 170608.835 ops/ms
Iteration  18: 164249.869 ops/ms
Iteration  19: 164742.289 ops/ms
Iteration  20: 170289.832 ops/ms

Result "org.openjdk.IntrinsicsBenchmark.testHash0002Integer":
  166872.599 ±(99.9%) 2799.021 ops/ms [Average]
  (min, avg, max) = (160188.524, 166872.599, 170608.835), stdev = 3223.356
  CI (99.9%): [164073.579, 169671.620] (assumes normal distribution)

I spy with my little eye the bimodal result: either it is 170 Kops/ms, or 164 Kops/ms. The "dips" in
iteration performance is quite often a GC. Okay, -prof gc then! It shows 1.3 GB/sec allocation rate.
Of course it is GC, for every reasonable heap size. We can go two ways from here: either make longer
iterations (which is tedious [*]), or we can tune down the heap to force more GC cycles in every
iterations, thus leveling out it dips.

([*] Philosophically, longer benchmarks are much worse, because they force you to stare at numbers
you've got from a day-long experiment, instead of doing tens and hundreds of much shorter
experiments, as you adjust, adjust, adjust the experimental setup.)

Okay, add --jvmArgs "-Xmx128m":

# JMH version: 1.21
# VM version: JDK 13-internal, OpenJDK 64-Bit Server VM, 13-internal+0-adhoc.shade.jdk-jdk
# VM invoker: /home/shade/trunks/jdk-jdk/build/linux-x86_64-server-release/images/jdk/bin/java
# VM options: -Xmx128m
# Warmup: 10 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.openjdk.IntrinsicsBenchmark.testHash0002Integer

# Run progress: 0.00% complete, ETA 00:00:30
# Fork: 1 of 1
# Warmup Iteration   1: 90518.942 ops/ms
# Warmup Iteration   2: 102680.621 ops/ms
# Warmup Iteration   3: 167249.769 ops/ms
# Warmup Iteration   4: 167161.628 ops/ms
# Warmup Iteration   5: 167234.602 ops/ms
# Warmup Iteration   6: 167525.645 ops/ms
# Warmup Iteration   7: 164963.326 ops/ms
# Warmup Iteration   8: 165397.347 ops/ms
# Warmup Iteration   9: 166592.091 ops/ms
# Warmup Iteration  10: 167274.903 ops/ms
Iteration   1: 167933.613 ops/ms
Iteration   2: 167545.883 ops/ms
Iteration   3: 167058.631 ops/ms
Iteration   4: 167141.522 ops/ms
Iteration   5: 166703.878 ops/ms
Iteration   6: 166629.714 ops/ms
Iteration   7: 167044.802 ops/ms
Iteration   8: 167431.836 ops/ms
Iteration   9: 167460.619 ops/ms
Iteration  10: 167366.268 ops/ms
Iteration  11: 167417.515 ops/ms
Iteration  12: 167343.442 ops/ms
Iteration  13: 167304.245 ops/ms
Iteration  14: 167340.101 ops/ms
Iteration  15: 167215.482 ops/ms
Iteration  16: 166505.524 ops/ms
Iteration  17: 165027.276 ops/ms
Iteration  18: 165409.731 ops/ms
Iteration  19: 165752.616 ops/ms
Iteration  20: 167030.566 ops/ms

Result "org.openjdk.IntrinsicsBenchmark.testHash0002Integer":
  166933.163 ±(99.9%) 649.583 ops/ms [Average]
  (min, avg, max) = (165027.276, 166933.163, 167933.613), stdev = 748.061
  CI (99.9%): [166283.580, 167582.746] (assumes normal distribution)

Voila. The mean error was 1.6% in the first run (already quite low), and now it is 0.4%.

But this is only single-invocation performance. There could be run-to-run variance lurking. So, we
need multiple forks and see what is up there. Adding "-f 5" gives me more iterations:

...

Result "org.openjdk.IntrinsicsBenchmark.testHash0002Integer":
  166341.783 ±(99.9%) 352.441 ops/ms [Average]
  (min, avg, max) = (161452.661, 166341.783, 168084.315), stdev = 1039.180
  CI (99.9%): [165989.342, 166694.223] (assumes normal distribution)

...and the mean error is even lower, at 0.2%.

This is, of course, on the system that is prepared for benchmarking: properly cooled, power
management quirks mitigated, process exclusivity either achieved by isolation or checked by
monitoring, etc.

The real fun part comes when we try to attach profiler to jdk13 build. For example, -prof perfasm.
Then, the performance on jdk13 dips to 80 Kops/sec! WTF. jdk12 is fine, though. Notice anything
strange in jdk13 profile?
  http://cr.openjdk.java.net/~shade/amber/intr-bench-jdk12.perfasm
  http://cr.openjdk.java.net/~shade/amber/intr-bench-jdk13.perfasm

I see java.lang.Integer::hashCode is not inlined. What's worse, it looks like the method prolog
before Verified Entry Point is hot, which is very weird. (The joys of bleeding edge JVMs, huh!)
Bisecting shows it started with JDK-6996807, which curiously takes Arrays.hashCode(sigs), and maybe
it pollutes profile enough for VM to break bad:
  http://hg.openjdk.java.net/jdk/jdk/rev/b561ea19a7b9

Anyway, it seems you need to see what's in there in perfasm/perfnorm in different tests. In
intr-bench-jdk12.perfasm above, the cost is clearly in varargs creation of Object[]. Intrinsics have
to demonstrate that allocation is gone and generated code is as efficient as it reasonably gets.

-Aleksey