how reliable JMH's results are

Fri Mar 22 12:12:19 UTC 2019

On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
> Hi Vicente,
> honestly I wouldn't call these differences 'wild'. Looking at the 
> first column, for the first bench you go from a minimum score of 
> 238982 to a max score of 270472. The difference between the two is 
> ~30000, which is around 15% of the measured value, which seems to be a 
> level of noise comparable with the margin of error.
>
> What I can't explain is the difference between these two tables:
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>
>
> and
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>

I'm rerunning both tests since yesterday with more measurement 
iterations, to try to get bigger sample and reduce the standard 
deviation. Then we can make a better comparison and see if the numbers 
are really different, but recall that we are dividing the average of two 
random variables so the error of the result could sometimes be bigger 
than the error of the separate random variables. Plus it seems to me 
that there was more variability in V7 and V9 in the measurements of 
vanilla JDK13 while the measurements of the intrinsics in both versions 
were more stable.

>
> The latter is markedly faster than the former - and not by a mere 12%, 
> sometimes by a 2x factor. Look at the Table 2: String - in the first 
> table it seems like we struggle to go past 1x (sometimes we're lower 
> even), in the second table we get 2x pretty much all over the place. I 
> think this is way beyond the level of variance you observed, and this 
> is something that should be investigated (perhaps with the help of 
> some JIT guru?). In general all the tables you have shared are more 
> similar to v9 than to v7 - but I'm worried because if the numbers in 
> v7 are to be taken seriously, the speedup shown there doesn't look 
> anything to write home about.
>
> Another (smaller) thing I fail to understand is how the mixed 
> String/int case seems to be sometimes faster than the separate String 
> or int case (or, in most case, close to the best speedup between the 
> two). Here, an important point is to notice that mixed StringInt #1 
> has to be compared with the row #2 in the separate int/String 
> benchmarks (because the mixed bench takes two arguments). I would have 
> expected the mixed test to fall somewhat in between the int and String 
> results (since one is defo faster and the other seems mostly on par) - 
> but that doesn't seem to be the case.
>
> Maurizio

Vicente

>
> On 20/03/2019 01:56, Vicente Romero wrote:
>> Hi,
>>
>> Today at the amber meeting there was a debate about why some JMH's 
>> results vary between executions. Well first of all these are 
>> experiments and in all experiments, there is a certain degree of 
>> variability that can't be controlled.
>>
>> So I got the last version of our beloved IntrinsicsBenchmark class 
>> see [1] and made some changes to it, basically every field has been 
>> initialized with a different literal, as proposed by Maurizio. Then I 
>> took only the `int` oriented tests, the first 9 in the benchmark and 
>> executed then a number of times. All executions are done with vanilla 
>> JDK13 no intrinsics at play here!
>>
>> First I executed the tests 3 times with the same conditions: 3 
>> warm-up iterations and 5 iterations for measurements. See the wild 
>> differences in the score and the error columns [2], mostly for the 
>> tests with the smallest number of arguments. I have no explanation 
>> for this. Then I started playing with both parameters warm-up and 
>> measurements iterations and executed another 4 experiments. As 
>> expected it seems like we should trust more those experiments 
>> executed with more iterations, both for warm-up and measurements, as 
>> the errors gets reduced. But still the variability is high and it's 
>> possible to find non-overlapping intervals for the results of the 
>> same experiment even when the error is small. See for example: 
>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the same 
>> row at Experiment 7. So should we interpret these numbers 
>> qualitatively too? Are we running the right amount of experiments 
>> with the right parameters? I have more questions than answers :)
>>
>> Thanks for listening,
>> Vicente
>>
>> PS, OK reading
>>
>> [1] 
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>> [2] 
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html