how reliable JMH's results are
Vicente Romero
vicente.romero at oracle.com
Wed Mar 20 01:56:35 UTC 2019
Hi,
Today at the amber meeting there was a debate about why some JMH's
results vary between executions. Well first of all these are experiments
and in all experiments, there is a certain degree of variability that
can't be controlled.
So I got the last version of our beloved IntrinsicsBenchmark class see
[1] and made some changes to it, basically every field has been
initialized with a different literal, as proposed by Maurizio. Then I
took only the `int` oriented tests, the first 9 in the benchmark and
executed then a number of times. All executions are done with vanilla
JDK13 no intrinsics at play here!
First I executed the tests 3 times with the same conditions: 3 warm-up
iterations and 5 iterations for measurements. See the wild differences
in the score and the error columns [2], mostly for the tests with the
smallest number of arguments. I have no explanation for this. Then I
started playing with both parameters warm-up and measurements iterations
and executed another 4 experiments. As expected it seems like we should
trust more those experiments executed with more iterations, both for
warm-up and measurements, as the errors gets reduced. But still the
variability is high and it's possible to find non-overlapping intervals
for the results of the same experiment even when the error is small. See
for example: Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and
the same row at Experiment 7. So should we interpret these numbers
qualitatively too? Are we running the right amount of experiments with
the right parameters? I have more questions than answers :)
Thanks for listening,
Vicente
PS, OK reading
[1]
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
[2]
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
More information about the amber-dev
mailing list