how reliable JMH's results are

Wed Mar 20 01:56:35 UTC 2019

Hi,

Today at the amber meeting there was a debate about why some JMH's 
results vary between executions. Well first of all these are experiments 
and in all experiments, there is a certain degree of variability that 
can't be controlled.

So I got the last version of our beloved IntrinsicsBenchmark class see 
[1] and made some changes to it, basically every field has been 
initialized with a different literal, as proposed by Maurizio. Then I 
took only the `int` oriented tests, the first 9 in the benchmark and 
executed then a number of times. All executions are done with vanilla 
JDK13 no intrinsics at play here!

First I executed the tests 3 times with the same conditions: 3 warm-up 
iterations and 5 iterations for measurements. See the wild differences 
in the score and the error columns [2], mostly for the tests with the 
smallest number of arguments. I have no explanation for this. Then I 
started playing with both parameters warm-up and measurements iterations 
and executed another 4 experiments. As expected it seems like we should 
trust more those experiments executed with more iterations, both for 
warm-up and measurements, as the errors gets reduced. But still the 
variability is high and it's possible to find non-overlapping intervals 
for the results of the same experiment even when the error is small. See 
for example: Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and 
the same row at Experiment 7. So should we interpret these numbers 
qualitatively too? Are we running the right amount of experiments with 
the right parameters? I have more questions than answers :)

Thanks for listening,
Vicente

PS, OK reading

[1] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
[2] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html