how reliable JMH's results are

Mon Mar 25 22:08:23 UTC 2019

On 3/25/19 5:40 PM, Maurizio Cimadamore wrote:
> Yep - they do look similar, so it is likely that between your previous 
> v7 and v9 measurements some other environmental factor (e.g. a VM 
> change?) was the likely culprit for the differences.

nope, I have been using the same builds all the time since then. The 
only difference between now and before is that now I got more data so 
the results are more reliable

>
> Maurizio

Vicente

>
> On 25/03/2019 20:38, Vicente Romero wrote:
>> Hi Maurizio,
>>
>> I have made another run of V7 and V9 and compared them, please see 
>> [1]. This time I did 5 warm-up and 20 measurements iterations in 
>> order to obtain more data points and reduce the standard deviation. 
>> By eyeballing the results, it seemed to me that the reason for the 
>> difference came from the fact that while measurements for both 
>> intrinsified versions were pretty similar for the two experiments, 
>> there were a higher variability in the measurements for vanilla 
>> JDK13. Seeming to imply that vanilla JDK13 was more sensible to the 
>> changes in the benchmark than any of the intrinsified counterparts. 
>> To prove this I added several columns, named Normalized_Diff, to show 
>> the the difference between the two previous columns normalized to the 
>> (0, 1] interval. The closest to 1 the more similar are the values. 
>> The readings of these columns show that the biggest discrepancy is 
>> observed when the normalized difference is farthest from 1 for 
>> vanilla JDK13.
>>
>> Thanks,
>> Vicente
>>
>> [1] 
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>>
>> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>>> Hi Vicente,
>>> honestly I wouldn't call these differences 'wild'. Looking at the 
>>> first column, for the first bench you go from a minimum score of 
>>> 238982 to a max score of 270472. The difference between the two is 
>>> ~30000, which is around 15% of the measured value, which seems to be 
>>> a level of noise comparable with the margin of error.
>>>
>>> What I can't explain is the difference between these two tables:
>>>
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>>>
>>>
>>> and
>>>
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>>>
>>>
>>> The latter is markedly faster than the former - and not by a mere 
>>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in the 
>>> first table it seems like we struggle to go past 1x (sometimes we're 
>>> lower even), in the second table we get 2x pretty much all over the 
>>> place. I think this is way beyond the level of variance you 
>>> observed, and this is something that should be investigated (perhaps 
>>> with the help of some JIT guru?). In general all the tables you have 
>>> shared are more similar to v9 than to v7 - but I'm worried because 
>>> if the numbers in v7 are to be taken seriously, the speedup shown 
>>> there doesn't look anything to write home about.
>>>
>>> Another (smaller) thing I fail to understand is how the mixed 
>>> String/int case seems to be sometimes faster than the separate 
>>> String or int case (or, in most case, close to the best speedup 
>>> between the two). Here, an important point is to notice that mixed 
>>> StringInt #1 has to be compared with the row #2 in the separate 
>>> int/String benchmarks (because the mixed bench takes two arguments). 
>>> I would have expected the mixed test to fall somewhat in between the 
>>> int and String results (since one is defo faster and the other seems 
>>> mostly on par) - but that doesn't seem to be the case.
>>>
>>> Maurizio
>>>
>>> On 20/03/2019 01:56, Vicente Romero wrote:
>>>> Hi,
>>>>
>>>> Today at the amber meeting there was a debate about why some JMH's 
>>>> results vary between executions. Well first of all these are 
>>>> experiments and in all experiments, there is a certain degree of 
>>>> variability that can't be controlled.
>>>>
>>>> So I got the last version of our beloved IntrinsicsBenchmark class 
>>>> see [1] and made some changes to it, basically every field has been 
>>>> initialized with a different literal, as proposed by Maurizio. Then 
>>>> I took only the `int` oriented tests, the first 9 in the benchmark 
>>>> and executed then a number of times. All executions are done with 
>>>> vanilla JDK13 no intrinsics at play here!
>>>>
>>>> First I executed the tests 3 times with the same conditions: 3 
>>>> warm-up iterations and 5 iterations for measurements. See the wild 
>>>> differences in the score and the error columns [2], mostly for the 
>>>> tests with the smallest number of arguments. I have no explanation 
>>>> for this. Then I started playing with both parameters warm-up and 
>>>> measurements iterations and executed another 4 experiments. As 
>>>> expected it seems like we should trust more those experiments 
>>>> executed with more iterations, both for warm-up and measurements, 
>>>> as the errors gets reduced. But still the variability is high and 
>>>> it's possible to find non-overlapping intervals for the results of 
>>>> the same experiment even when the error is small. See for example: 
>>>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the 
>>>> same row at Experiment 7. So should we interpret these numbers 
>>>> qualitatively too? Are we running the right amount of experiments 
>>>> with the right parameters? I have more questions than answers :)
>>>>
>>>> Thanks for listening,
>>>> Vicente
>>>>
>>>> PS, OK reading
>>>>
>>>> [1] 
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>>> [2] 
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>>