how reliable JMH's results are
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Mar 21 11:16:18 UTC 2019
Hi Vicente,
honestly I wouldn't call these differences 'wild'. Looking at the first
column, for the first bench you go from a minimum score of 238982 to a
max score of 270472. The difference between the two is ~30000, which is
around 15% of the measured value, which seems to be a level of noise
comparable with the margin of error.
What I can't explain is the difference between these two tables:
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html
and
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html
The latter is markedly faster than the former - and not by a mere 12%,
sometimes by a 2x factor. Look at the Table 2: String - in the first
table it seems like we struggle to go past 1x (sometimes we're lower
even), in the second table we get 2x pretty much all over the place. I
think this is way beyond the level of variance you observed, and this is
something that should be investigated (perhaps with the help of some JIT
guru?). In general all the tables you have shared are more similar to v9
than to v7 - but I'm worried because if the numbers in v7 are to be
taken seriously, the speedup shown there doesn't look anything to write
home about.
Another (smaller) thing I fail to understand is how the mixed String/int
case seems to be sometimes faster than the separate String or int case
(or, in most case, close to the best speedup between the two). Here, an
important point is to notice that mixed StringInt #1 has to be compared
with the row #2 in the separate int/String benchmarks (because the mixed
bench takes two arguments). I would have expected the mixed test to fall
somewhat in between the int and String results (since one is defo faster
and the other seems mostly on par) - but that doesn't seem to be the case.
Maurizio
On 20/03/2019 01:56, Vicente Romero wrote:
> Hi,
>
> Today at the amber meeting there was a debate about why some JMH's
> results vary between executions. Well first of all these are
> experiments and in all experiments, there is a certain degree of
> variability that can't be controlled.
>
> So I got the last version of our beloved IntrinsicsBenchmark class see
> [1] and made some changes to it, basically every field has been
> initialized with a different literal, as proposed by Maurizio. Then I
> took only the `int` oriented tests, the first 9 in the benchmark and
> executed then a number of times. All executions are done with vanilla
> JDK13 no intrinsics at play here!
>
> First I executed the tests 3 times with the same conditions: 3 warm-up
> iterations and 5 iterations for measurements. See the wild differences
> in the score and the error columns [2], mostly for the tests with the
> smallest number of arguments. I have no explanation for this. Then I
> started playing with both parameters warm-up and measurements
> iterations and executed another 4 experiments. As expected it seems
> like we should trust more those experiments executed with more
> iterations, both for warm-up and measurements, as the errors gets
> reduced. But still the variability is high and it's possible to find
> non-overlapping intervals for the results of the same experiment even
> when the error is small. See for example: Experiment 6 row:
> `IntrinsicsBenchmark.testHash0040Int` and the same row at Experiment
> 7. So should we interpret these numbers qualitatively too? Are we
> running the right amount of experiments with the right parameters? I
> have more questions than answers :)
>
> Thanks for listening,
> Vicente
>
> PS, OK reading
>
> [1]
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
> [2]
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
More information about the amber-dev
mailing list