how reliable JMH's results are

Thu Mar 21 11:16:18 UTC 2019

Hi Vicente,
honestly I wouldn't call these differences 'wild'. Looking at the first 
column, for the first bench you go from a minimum score of 238982 to a 
max score of 270472. The difference between the two is ~30000, which is 
around 15% of the measured value, which seems to be a level of noise 
comparable with the margin of error.

What I can't explain is the difference between these two tables:

http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html

and

http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html

The latter is markedly faster than the former - and not by a mere 12%, 
sometimes by a 2x factor. Look at the Table 2: String - in the first 
table it seems like we struggle to go past 1x (sometimes we're lower 
even), in the second table we get 2x pretty much all over the place. I 
think this is way beyond the level of variance you observed, and this is 
something that should be investigated (perhaps with the help of some JIT 
guru?). In general all the tables you have shared are more similar to v9 
than to v7 - but I'm worried because if the numbers in v7 are to be 
taken seriously, the speedup shown there doesn't look anything to write 
home about.

Another (smaller) thing I fail to understand is how the mixed String/int 
case seems to be sometimes faster than the separate String or int case 
(or, in most case, close to the best speedup between the two). Here, an 
important point is to notice that mixed StringInt #1 has to be compared 
with the row #2 in the separate int/String benchmarks (because the mixed 
bench takes two arguments). I would have expected the mixed test to fall 
somewhat in between the int and String results (since one is defo faster 
and the other seems mostly on par) - but that doesn't seem to be the case.

Maurizio

On 20/03/2019 01:56, Vicente Romero wrote:
> Hi,
>
> Today at the amber meeting there was a debate about why some JMH's 
> results vary between executions. Well first of all these are 
> experiments and in all experiments, there is a certain degree of 
> variability that can't be controlled.
>
> So I got the last version of our beloved IntrinsicsBenchmark class see 
> [1] and made some changes to it, basically every field has been 
> initialized with a different literal, as proposed by Maurizio. Then I 
> took only the `int` oriented tests, the first 9 in the benchmark and 
> executed then a number of times. All executions are done with vanilla 
> JDK13 no intrinsics at play here!
>
> First I executed the tests 3 times with the same conditions: 3 warm-up 
> iterations and 5 iterations for measurements. See the wild differences 
> in the score and the error columns [2], mostly for the tests with the 
> smallest number of arguments. I have no explanation for this. Then I 
> started playing with both parameters warm-up and measurements 
> iterations and executed another 4 experiments. As expected it seems 
> like we should trust more those experiments executed with more 
> iterations, both for warm-up and measurements, as the errors gets 
> reduced. But still the variability is high and it's possible to find 
> non-overlapping intervals for the results of the same experiment even 
> when the error is small. See for example: Experiment 6 row: 
> `IntrinsicsBenchmark.testHash0040Int` and the same row at Experiment 
> 7. So should we interpret these numbers qualitatively too? Are we 
> running the right amount of experiments with the right parameters? I 
> have more questions than answers :)
>
> Thanks for listening,
> Vicente
>
> PS, OK reading
>
> [1] 
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
> [2] 
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html