how reliable JMH's results are

Tagir Valeev amaembo at gmail.com
Wed Mar 20 02:11:40 UTC 2019


Hello!

It's possible that JIT compiles methods in slightly different time point
getting slightly different profile which affects its decisions on code
generation. After the method is JITted its performance is quite stable
until JVM restart, but may have different mode. I would check the generated
assembly for the hot methods. Probably it correlates with the observed
time.

With best regards,
Tagir Valeev.

ср, 20 марта 2019 г., 8:57 Vicente Romero <vicente.romero at oracle.com>:

> Hi,
>
> Today at the amber meeting there was a debate about why some JMH's
> results vary between executions. Well first of all these are experiments
> and in all experiments, there is a certain degree of variability that
> can't be controlled.
>
> So I got the last version of our beloved IntrinsicsBenchmark class see
> [1] and made some changes to it, basically every field has been
> initialized with a different literal, as proposed by Maurizio. Then I
> took only the `int` oriented tests, the first 9 in the benchmark and
> executed then a number of times. All executions are done with vanilla
> JDK13 no intrinsics at play here!
>
> First I executed the tests 3 times with the same conditions: 3 warm-up
> iterations and 5 iterations for measurements. See the wild differences
> in the score and the error columns [2], mostly for the tests with the
> smallest number of arguments. I have no explanation for this. Then I
> started playing with both parameters warm-up and measurements iterations
> and executed another 4 experiments. As expected it seems like we should
> trust more those experiments executed with more iterations, both for
> warm-up and measurements, as the errors gets reduced. But still the
> variability is high and it's possible to find non-overlapping intervals
> for the results of the same experiment even when the error is small. See
> for example: Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and
> the same row at Experiment 7. So should we interpret these numbers
> qualitatively too? Are we running the right amount of experiments with
> the right parameters? I have more questions than answers :)
>
> Thanks for listening,
> Vicente
>
> PS, OK reading
>
> [1]
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
> [2]
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>


More information about the amber-dev mailing list