how reliable JMH's results are

Tue Mar 26 01:21:36 UTC 2019

thanks Aleksey, very interesting,

Vicente

On 3/25/19 7:54 PM, Aleksey Shipilev wrote:
> On 3/20/19 2:56 AM, Vicente Romero wrote:
>> Today at the amber meeting there was a debate about why some JMH's results vary between executions.
>> Well first of all these are experiments and in all experiments, there is a certain degree of
>> variability that can't be controlled.
> I said this before, and I feel it needs to be said again: what you have in your tables is *data*,
> not *results*. The distinction is important: data becomes results when you analyze (explain) why the
> data is the way it is. One of the ways is to use the profilers to figure out the reason for
> performance difference, both positive and negative. It is usually pointless to stare and numbers and
> wonder. It is much, much more effective to dissect the benchmark while is is running.
>
> If you are fighting the variance, it is important to understand the source of it. You don't need to
> run hundreds of iterations for minutes and look at summaries, it is more important to see how the
> timings differs from iteration to iteration.
>
> For example, take IntrinsicsBenchmark.testHash0002Integer, run it with "-w 1s -r 1s":
>
> # JMH version: 1.21
> # VM version: JDK 13-internal, OpenJDK 64-Bit Server VM, 13-internal+0-adhoc.shade.jdk-jdk
> # VM invoker: /home/shade/trunks/jdk-jdk/build/linux-x86_64-server-release/images/jdk/bin/java
> # VM options: <none>
> # Warmup: 10 iterations, 1 s each
> # Measurement: 20 iterations, 1 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Throughput, ops/time
> # Benchmark: org.openjdk.IntrinsicsBenchmark.testHash0002Integer
>
> # Run progress: 0.00% complete, ETA 00:00:30
> # Fork: 1 of 1
> # Warmup Iteration   1: 57204.334 ops/ms
> # Warmup Iteration   2: 103416.259 ops/ms
> # Warmup Iteration   3: 162178.931 ops/ms
> # Warmup Iteration   4: 162476.308 ops/ms
> # Warmup Iteration   5: 144305.570 ops/ms
> # Warmup Iteration   6: 161836.788 ops/ms
> # Warmup Iteration   7: 169351.425 ops/ms
> # Warmup Iteration   8: 164579.568 ops/ms
> # Warmup Iteration   9: 164465.457 ops/ms
> # Warmup Iteration  10: 164625.845 ops/ms
> Iteration   1: 164322.220 ops/ms
> Iteration   2: 170006.896 ops/ms
> Iteration   3: 164746.236 ops/ms
> Iteration   4: 170108.658 ops/ms
> Iteration   5: 164642.814 ops/ms
> Iteration   6: 164799.820 ops/ms
> Iteration   7: 170020.497 ops/ms
> Iteration   8: 170213.564 ops/ms
> Iteration   9: 164649.448 ops/ms
> Iteration  10: 164413.162 ops/ms
> Iteration  11: 160188.524 ops/ms
> Iteration  12: 164399.563 ops/ms
> Iteration  13: 170135.645 ops/ms
> Iteration  14: 164643.840 ops/ms
> Iteration  15: 170198.652 ops/ms
> Iteration  16: 170071.623 ops/ms
> Iteration  17: 170608.835 ops/ms
> Iteration  18: 164249.869 ops/ms
> Iteration  19: 164742.289 ops/ms
> Iteration  20: 170289.832 ops/ms
>
> Result "org.openjdk.IntrinsicsBenchmark.testHash0002Integer":
>    166872.599 ±(99.9%) 2799.021 ops/ms [Average]
>    (min, avg, max) = (160188.524, 166872.599, 170608.835), stdev = 3223.356
>    CI (99.9%): [164073.579, 169671.620] (assumes normal distribution)
>
> I spy with my little eye the bimodal result: either it is 170 Kops/ms, or 164 Kops/ms. The "dips" in
> iteration performance is quite often a GC. Okay, -prof gc then! It shows 1.3 GB/sec allocation rate.
> Of course it is GC, for every reasonable heap size. We can go two ways from here: either make longer
> iterations (which is tedious [*]), or we can tune down the heap to force more GC cycles in every
> iterations, thus leveling out it dips.
>
> ([*] Philosophically, longer benchmarks are much worse, because they force you to stare at numbers
> you've got from a day-long experiment, instead of doing tens and hundreds of much shorter
> experiments, as you adjust, adjust, adjust the experimental setup.)
>
> Okay, add --jvmArgs "-Xmx128m":
>
> # JMH version: 1.21
> # VM version: JDK 13-internal, OpenJDK 64-Bit Server VM, 13-internal+0-adhoc.shade.jdk-jdk
> # VM invoker: /home/shade/trunks/jdk-jdk/build/linux-x86_64-server-release/images/jdk/bin/java
> # VM options: -Xmx128m
> # Warmup: 10 iterations, 1 s each
> # Measurement: 20 iterations, 1 s each
> # Timeout: 10 min per iteration
> # Threads: 1 thread, will synchronize iterations
> # Benchmark mode: Throughput, ops/time
> # Benchmark: org.openjdk.IntrinsicsBenchmark.testHash0002Integer
>
> # Run progress: 0.00% complete, ETA 00:00:30
> # Fork: 1 of 1
> # Warmup Iteration   1: 90518.942 ops/ms
> # Warmup Iteration   2: 102680.621 ops/ms
> # Warmup Iteration   3: 167249.769 ops/ms
> # Warmup Iteration   4: 167161.628 ops/ms
> # Warmup Iteration   5: 167234.602 ops/ms
> # Warmup Iteration   6: 167525.645 ops/ms
> # Warmup Iteration   7: 164963.326 ops/ms
> # Warmup Iteration   8: 165397.347 ops/ms
> # Warmup Iteration   9: 166592.091 ops/ms
> # Warmup Iteration  10: 167274.903 ops/ms
> Iteration   1: 167933.613 ops/ms
> Iteration   2: 167545.883 ops/ms
> Iteration   3: 167058.631 ops/ms
> Iteration   4: 167141.522 ops/ms
> Iteration   5: 166703.878 ops/ms
> Iteration   6: 166629.714 ops/ms
> Iteration   7: 167044.802 ops/ms
> Iteration   8: 167431.836 ops/ms
> Iteration   9: 167460.619 ops/ms
> Iteration  10: 167366.268 ops/ms
> Iteration  11: 167417.515 ops/ms
> Iteration  12: 167343.442 ops/ms
> Iteration  13: 167304.245 ops/ms
> Iteration  14: 167340.101 ops/ms
> Iteration  15: 167215.482 ops/ms
> Iteration  16: 166505.524 ops/ms
> Iteration  17: 165027.276 ops/ms
> Iteration  18: 165409.731 ops/ms
> Iteration  19: 165752.616 ops/ms
> Iteration  20: 167030.566 ops/ms
>
> Result "org.openjdk.IntrinsicsBenchmark.testHash0002Integer":
>    166933.163 ±(99.9%) 649.583 ops/ms [Average]
>    (min, avg, max) = (165027.276, 166933.163, 167933.613), stdev = 748.061
>    CI (99.9%): [166283.580, 167582.746] (assumes normal distribution)
>
>
> Voila. The mean error was 1.6% in the first run (already quite low), and now it is 0.4%.
>
> But this is only single-invocation performance. There could be run-to-run variance lurking. So, we
> need multiple forks and see what is up there. Adding "-f 5" gives me more iterations:
>
> ...
>
> Result "org.openjdk.IntrinsicsBenchmark.testHash0002Integer":
>    166341.783 ±(99.9%) 352.441 ops/ms [Average]
>    (min, avg, max) = (161452.661, 166341.783, 168084.315), stdev = 1039.180
>    CI (99.9%): [165989.342, 166694.223] (assumes normal distribution)
>
> ...and the mean error is even lower, at 0.2%.
>
> This is, of course, on the system that is prepared for benchmarking: properly cooled, power
> management quirks mitigated, process exclusivity either achieved by isolation or checked by
> monitoring, etc.
>
> The real fun part comes when we try to attach profiler to jdk13 build. For example, -prof perfasm.
> Then, the performance on jdk13 dips to 80 Kops/sec! WTF. jdk12 is fine, though. Notice anything
> strange in jdk13 profile?
>    http://cr.openjdk.java.net/~shade/amber/intr-bench-jdk12.perfasm
>    http://cr.openjdk.java.net/~shade/amber/intr-bench-jdk13.perfasm
>
> I see java.lang.Integer::hashCode is not inlined. What's worse, it looks like the method prolog
> before Verified Entry Point is hot, which is very weird. (The joys of bleeding edge JVMs, huh!)
> Bisecting shows it started with JDK-6996807, which curiously takes Arrays.hashCode(sigs), and maybe
> it pollutes profile enough for VM to break bad:
>    http://hg.openjdk.java.net/jdk/jdk/rev/b561ea19a7b9
>
> Anyway, it seems you need to see what's in there in perfasm/perfnorm in different tests. In
> intr-bench-jdk12.perfasm above, the cost is clearly in varargs creation of Object[]. Intrinsics have
> to demonstrate that allocation is gone and generated code is as efficient as it reasonably gets.
>
> -Aleksey
>