how reliable JMH's results are

Vicente Romero vicente.romero at oracle.com
Tue Mar 26 13:09:26 UTC 2019



On 3/26/19 8:38 AM, Vicente Romero wrote:
>
>
> On 3/26/19 6:52 AM, Maurizio Cimadamore wrote:
>> Ok. That rules out a factor.
>>
>> But I guess I'm not 100% convinced that it's down to variance. Look 
>> again at
>>
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v7.html 
>>
>>
>> vs.
>>
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>>
>>
>> What I find disturbing, specifically, is lines like these (mixed int 
>> and String, raw #1):Intrinsified score v7: 131417
>> Intrinsified (folding) score v9: 216897
>>
>> This same result is essentially 2x in v9 than it was in v7. First of 
>> all:
>>
>> * am I right in comparing 'intrinsified' column in v7 with 
>> 'intrinsified folding' column in v9 ?
> the "Intrinsified" column in V7 corresponds to the "Intrinsified 
> Filtering" column in V9, sorry about the name change, it could be 
> confusing
>> * while statistical noise is always a factor, a consistent 2x factor 
>> across all raws doesn't look like variance
>>
>> The results in your latest bench are in line with v9 - that is, the 
>> 'intrinsified folding' seems around 2x faster than in v7. Which 
>> leaves to me one explanation (excluding environmental factors such as 
>> CPU throttling and the likes): has the JDK intrinsified build changed 
>> from v7 to v9? That is, is this an apple to apple comparison, or have 
>> some changes been made to the bootstrap methods too, which ended up 
>> speeding up the bench?
>
> the only difference here is the benchmark, the builds used are exactly 
> the same, and I think those original data you are referring to have a 
> lot of noise, so I would trust more the table I published yesterday

I mean the same build has been used to obtain the same comparable column 
in each table

>
>>
>> Maurizio
>
> Vicente
>
>>
>>
>> On 25/03/2019 22:08, Vicente Romero wrote:
>>>
>>>
>>> On 3/25/19 5:40 PM, Maurizio Cimadamore wrote:
>>>> Yep - they do look similar, so it is likely that between your 
>>>> previous v7 and v9 measurements some other environmental factor 
>>>> (e.g. a VM change?) was the likely culprit for the differences.
>>>
>>> nope, I have been using the same builds all the time since then. The 
>>> only difference between now and before is that now I got more data 
>>> so the results are more reliable
>>>
>>>>
>>>> Maurizio
>>>
>>> Vicente
>>>
>>>>
>>>> On 25/03/2019 20:38, Vicente Romero wrote:
>>>>> Hi Maurizio,
>>>>>
>>>>> I have made another run of V7 and V9 and compared them, please see 
>>>>> [1]. This time I did 5 warm-up and 20 measurements iterations in 
>>>>> order to obtain more data points and reduce the standard 
>>>>> deviation. By eyeballing the results, it seemed to me that the 
>>>>> reason for the difference came from the fact that while 
>>>>> measurements for both intrinsified versions were pretty similar 
>>>>> for the two experiments, there were a higher variability in the 
>>>>> measurements for vanilla JDK13. Seeming to imply that vanilla 
>>>>> JDK13 was more sensible to the changes in the benchmark than any 
>>>>> of the intrinsified counterparts. To prove this I added several 
>>>>> columns, named Normalized_Diff, to show the the difference between 
>>>>> the two previous columns normalized to the (0, 1] interval. The 
>>>>> closest to 1 the more similar are the values. The readings of 
>>>>> these columns show that the biggest discrepancy is observed when 
>>>>> the normalized difference is farthest from 1 for vanilla JDK13.
>>>>>
>>>>> Thanks,
>>>>> Vicente
>>>>>
>>>>> [1] 
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>>>>>
>>>>> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>>>>>> Hi Vicente,
>>>>>> honestly I wouldn't call these differences 'wild'. Looking at the 
>>>>>> first column, for the first bench you go from a minimum score of 
>>>>>> 238982 to a max score of 270472. The difference between the two 
>>>>>> is ~30000, which is around 15% of the measured value, which seems 
>>>>>> to be a level of noise comparable with the margin of error.
>>>>>>
>>>>>> What I can't explain is the difference between these two tables:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>>>>>>
>>>>>>
>>>>>> The latter is markedly faster than the former - and not by a mere 
>>>>>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in 
>>>>>> the first table it seems like we struggle to go past 1x 
>>>>>> (sometimes we're lower even), in the second table we get 2x 
>>>>>> pretty much all over the place. I think this is way beyond the 
>>>>>> level of variance you observed, and this is something that should 
>>>>>> be investigated (perhaps with the help of some JIT guru?). In 
>>>>>> general all the tables you have shared are more similar to v9 
>>>>>> than to v7 - but I'm worried because if the numbers in v7 are to 
>>>>>> be taken seriously, the speedup shown there doesn't look anything 
>>>>>> to write home about.
>>>>>>
>>>>>> Another (smaller) thing I fail to understand is how the mixed 
>>>>>> String/int case seems to be sometimes faster than the separate 
>>>>>> String or int case (or, in most case, close to the best speedup 
>>>>>> between the two). Here, an important point is to notice that 
>>>>>> mixed StringInt #1 has to be compared with the row #2 in the 
>>>>>> separate int/String benchmarks (because the mixed bench takes two 
>>>>>> arguments). I would have expected the mixed test to fall somewhat 
>>>>>> in between the int and String results (since one is defo faster 
>>>>>> and the other seems mostly on par) - but that doesn't seem to be 
>>>>>> the case.
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> On 20/03/2019 01:56, Vicente Romero wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Today at the amber meeting there was a debate about why some 
>>>>>>> JMH's results vary between executions. Well first of all these 
>>>>>>> are experiments and in all experiments, there is a certain 
>>>>>>> degree of variability that can't be controlled.
>>>>>>>
>>>>>>> So I got the last version of our beloved IntrinsicsBenchmark 
>>>>>>> class see [1] and made some changes to it, basically every field 
>>>>>>> has been initialized with a different literal, as proposed by 
>>>>>>> Maurizio. Then I took only the `int` oriented tests, the first 9 
>>>>>>> in the benchmark and executed then a number of times. All 
>>>>>>> executions are done with vanilla JDK13 no intrinsics at play here!
>>>>>>>
>>>>>>> First I executed the tests 3 times with the same conditions: 3 
>>>>>>> warm-up iterations and 5 iterations for measurements. See the 
>>>>>>> wild differences in the score and the error columns [2], mostly 
>>>>>>> for the tests with the smallest number of arguments. I have no 
>>>>>>> explanation for this. Then I started playing with both 
>>>>>>> parameters warm-up and measurements iterations and executed 
>>>>>>> another 4 experiments. As expected it seems like we should trust 
>>>>>>> more those experiments executed with more iterations, both for 
>>>>>>> warm-up and measurements, as the errors gets reduced. But still 
>>>>>>> the variability is high and it's possible to find 
>>>>>>> non-overlapping intervals for the results of the same experiment 
>>>>>>> even when the error is small. See for example: Experiment 6 row: 
>>>>>>> `IntrinsicsBenchmark.testHash0040Int` and the same row at 
>>>>>>> Experiment 7. So should we interpret these numbers qualitatively 
>>>>>>> too? Are we running the right amount of experiments with the 
>>>>>>> right parameters? I have more questions than answers :)
>>>>>>>
>>>>>>> Thanks for listening,
>>>>>>> Vicente
>>>>>>>
>>>>>>> PS, OK reading
>>>>>>>
>>>>>>> [1] 
>>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>>>>>> [2] 
>>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>>>>>
>>>
>



More information about the amber-dev mailing list