how reliable JMH's results are

Mon Mar 25 20:38:20 UTC 2019

Hi Maurizio,

I have made another run of V7 and V9 and compared them, please see [1]. 
This time I did 5 warm-up and 20 measurements iterations in order to 
obtain more data points and reduce the standard deviation. By eyeballing 
the results, it seemed to me that the reason for the difference came 
from the fact that while measurements for both intrinsified versions 
were pretty similar for the two experiments, there were a higher 
variability in the measurements for vanilla JDK13. Seeming to imply that 
vanilla JDK13 was more sensible to the changes in the benchmark than any 
of the intrinsified counterparts. To prove this I added several columns, 
named Normalized_Diff, to show the the difference between the two 
previous columns normalized to the (0, 1] interval. The closest to 1 the 
more similar are the values. The readings of these columns show that the 
biggest discrepancy is observed when the normalized difference is 
farthest from 1 for vanilla JDK13.

Thanks,
Vicente

[1] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html

On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
> Hi Vicente,
> honestly I wouldn't call these differences 'wild'. Looking at the 
> first column, for the first bench you go from a minimum score of 
> 238982 to a max score of 270472. The difference between the two is 
> ~30000, which is around 15% of the measured value, which seems to be a 
> level of noise comparable with the margin of error.
>
> What I can't explain is the difference between these two tables:
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>
>
> and
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>
>
> The latter is markedly faster than the former - and not by a mere 12%, 
> sometimes by a 2x factor. Look at the Table 2: String - in the first 
> table it seems like we struggle to go past 1x (sometimes we're lower 
> even), in the second table we get 2x pretty much all over the place. I 
> think this is way beyond the level of variance you observed, and this 
> is something that should be investigated (perhaps with the help of 
> some JIT guru?). In general all the tables you have shared are more 
> similar to v9 than to v7 - but I'm worried because if the numbers in 
> v7 are to be taken seriously, the speedup shown there doesn't look 
> anything to write home about.
>
> Another (smaller) thing I fail to understand is how the mixed 
> String/int case seems to be sometimes faster than the separate String 
> or int case (or, in most case, close to the best speedup between the 
> two). Here, an important point is to notice that mixed StringInt #1 
> has to be compared with the row #2 in the separate int/String 
> benchmarks (because the mixed bench takes two arguments). I would have 
> expected the mixed test to fall somewhat in between the int and String 
> results (since one is defo faster and the other seems mostly on par) - 
> but that doesn't seem to be the case.
>
> Maurizio
>
> On 20/03/2019 01:56, Vicente Romero wrote:
>> Hi,
>>
>> Today at the amber meeting there was a debate about why some JMH's 
>> results vary between executions. Well first of all these are 
>> experiments and in all experiments, there is a certain degree of 
>> variability that can't be controlled.
>>
>> So I got the last version of our beloved IntrinsicsBenchmark class 
>> see [1] and made some changes to it, basically every field has been 
>> initialized with a different literal, as proposed by Maurizio. Then I 
>> took only the `int` oriented tests, the first 9 in the benchmark and 
>> executed then a number of times. All executions are done with vanilla 
>> JDK13 no intrinsics at play here!
>>
>> First I executed the tests 3 times with the same conditions: 3 
>> warm-up iterations and 5 iterations for measurements. See the wild 
>> differences in the score and the error columns [2], mostly for the 
>> tests with the smallest number of arguments. I have no explanation 
>> for this. Then I started playing with both parameters warm-up and 
>> measurements iterations and executed another 4 experiments. As 
>> expected it seems like we should trust more those experiments 
>> executed with more iterations, both for warm-up and measurements, as 
>> the errors gets reduced. But still the variability is high and it's 
>> possible to find non-overlapping intervals for the results of the 
>> same experiment even when the error is small. See for example: 
>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the same 
>> row at Experiment 7. So should we interpret these numbers 
>> qualitatively too? Are we running the right amount of experiments 
>> with the right parameters? I have more questions than answers :)
>>
>> Thanks for listening,
>> Vicente
>>
>> PS, OK reading
>>
>> [1] 
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>> [2] 
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html