how reliable JMH's results are

Tue Mar 26 12:38:56 UTC 2019


On 3/26/19 6:52 AM, Maurizio Cimadamore wrote:
> Ok. That rules out a factor.
>
> But I guess I'm not 100% convinced that it's down to variance. Look 
> again at
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v7.html 
>
>
> vs.
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>
>
> What I find disturbing, specifically, is lines like these (mixed int 
> and String, raw #1):Intrinsified score v7: 131417
> Intrinsified (folding) score v9: 216897
>
> This same result is essentially 2x in v9 than it was in v7. First of all:
>
> * am I right in comparing 'intrinsified' column in v7 with 
> 'intrinsified folding' column in v9 ?
the "Intrinsified" column in V7 corresponds to the "Intrinsified 
Filtering" column in V9, sorry about the name change, it could be confusing
> * while statistical noise is always a factor, a consistent 2x factor 
> across all raws doesn't look like variance
>
> The results in your latest bench are in line with v9 - that is, the 
> 'intrinsified folding' seems around 2x faster than in v7. Which leaves 
> to me one explanation (excluding environmental factors such as CPU 
> throttling and the likes): has the JDK intrinsified build changed from 
> v7 to v9? That is, is this an apple to apple comparison, or have some 
> changes been made to the bootstrap methods too, which ended up 
> speeding up the bench?

the only difference here is the benchmark, the builds used are exactly 
the same, and I think those original data you are referring to have a 
lot of noise, so I would trust more the table I published yesterday

>
> Maurizio

Vicente

>
>
> On 25/03/2019 22:08, Vicente Romero wrote:
>>
>>
>> On 3/25/19 5:40 PM, Maurizio Cimadamore wrote:
>>> Yep - they do look similar, so it is likely that between your 
>>> previous v7 and v9 measurements some other environmental factor 
>>> (e.g. a VM change?) was the likely culprit for the differences.
>>
>> nope, I have been using the same builds all the time since then. The 
>> only difference between now and before is that now I got more data so 
>> the results are more reliable
>>
>>>
>>> Maurizio
>>
>> Vicente
>>
>>>
>>> On 25/03/2019 20:38, Vicente Romero wrote:
>>>> Hi Maurizio,
>>>>
>>>> I have made another run of V7 and V9 and compared them, please see 
>>>> [1]. This time I did 5 warm-up and 20 measurements iterations in 
>>>> order to obtain more data points and reduce the standard deviation. 
>>>> By eyeballing the results, it seemed to me that the reason for the 
>>>> difference came from the fact that while measurements for both 
>>>> intrinsified versions were pretty similar for the two experiments, 
>>>> there were a higher variability in the measurements for vanilla 
>>>> JDK13. Seeming to imply that vanilla JDK13 was more sensible to the 
>>>> changes in the benchmark than any of the intrinsified counterparts. 
>>>> To prove this I added several columns, named Normalized_Diff, to 
>>>> show the the difference between the two previous columns normalized 
>>>> to the (0, 1] interval. The closest to 1 the more similar are the 
>>>> values. The readings of these columns show that the biggest 
>>>> discrepancy is observed when the normalized difference is farthest 
>>>> from 1 for vanilla JDK13.
>>>>
>>>> Thanks,
>>>> Vicente
>>>>
>>>> [1] 
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>>>>
>>>> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>>>>> Hi Vicente,
>>>>> honestly I wouldn't call these differences 'wild'. Looking at the 
>>>>> first column, for the first bench you go from a minimum score of 
>>>>> 238982 to a max score of 270472. The difference between the two is 
>>>>> ~30000, which is around 15% of the measured value, which seems to 
>>>>> be a level of noise comparable with the margin of error.
>>>>>
>>>>> What I can't explain is the difference between these two tables:
>>>>>
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>>>>>
>>>>>
>>>>> The latter is markedly faster than the former - and not by a mere 
>>>>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in 
>>>>> the first table it seems like we struggle to go past 1x (sometimes 
>>>>> we're lower even), in the second table we get 2x pretty much all 
>>>>> over the place. I think this is way beyond the level of variance 
>>>>> you observed, and this is something that should be investigated 
>>>>> (perhaps with the help of some JIT guru?). In general all the 
>>>>> tables you have shared are more similar to v9 than to v7 - but I'm 
>>>>> worried because if the numbers in v7 are to be taken seriously, 
>>>>> the speedup shown there doesn't look anything to write home about.
>>>>>
>>>>> Another (smaller) thing I fail to understand is how the mixed 
>>>>> String/int case seems to be sometimes faster than the separate 
>>>>> String or int case (or, in most case, close to the best speedup 
>>>>> between the two). Here, an important point is to notice that mixed 
>>>>> StringInt #1 has to be compared with the row #2 in the separate 
>>>>> int/String benchmarks (because the mixed bench takes two 
>>>>> arguments). I would have expected the mixed test to fall somewhat 
>>>>> in between the int and String results (since one is defo faster 
>>>>> and the other seems mostly on par) - but that doesn't seem to be 
>>>>> the case.
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 20/03/2019 01:56, Vicente Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Today at the amber meeting there was a debate about why some 
>>>>>> JMH's results vary between executions. Well first of all these 
>>>>>> are experiments and in all experiments, there is a certain degree 
>>>>>> of variability that can't be controlled.
>>>>>>
>>>>>> So I got the last version of our beloved IntrinsicsBenchmark 
>>>>>> class see [1] and made some changes to it, basically every field 
>>>>>> has been initialized with a different literal, as proposed by 
>>>>>> Maurizio. Then I took only the `int` oriented tests, the first 9 
>>>>>> in the benchmark and executed then a number of times. All 
>>>>>> executions are done with vanilla JDK13 no intrinsics at play here!
>>>>>>
>>>>>> First I executed the tests 3 times with the same conditions: 3 
>>>>>> warm-up iterations and 5 iterations for measurements. See the 
>>>>>> wild differences in the score and the error columns [2], mostly 
>>>>>> for the tests with the smallest number of arguments. I have no 
>>>>>> explanation for this. Then I started playing with both parameters 
>>>>>> warm-up and measurements iterations and executed another 4 
>>>>>> experiments. As expected it seems like we should trust more those 
>>>>>> experiments executed with more iterations, both for warm-up and 
>>>>>> measurements, as the errors gets reduced. But still the 
>>>>>> variability is high and it's possible to find non-overlapping 
>>>>>> intervals for the results of the same experiment even when the 
>>>>>> error is small. See for example: Experiment 6 row: 
>>>>>> `IntrinsicsBenchmark.testHash0040Int` and the same row at 
>>>>>> Experiment 7. So should we interpret these numbers qualitatively 
>>>>>> too? Are we running the right amount of experiments with the 
>>>>>> right parameters? I have more questions than answers :)
>>>>>>
>>>>>> Thanks for listening,
>>>>>> Vicente
>>>>>>
>>>>>> PS, OK reading
>>>>>>
>>>>>> [1] 
>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>>>>> [2] 
>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>>>>
>>