how reliable JMH's results are

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Mon Mar 25 21:40:08 UTC 2019


Yep - they do look similar, so it is likely that between your previous 
v7 and v9 measurements some other environmental factor (e.g. a VM 
change?) was the likely culprit for the differences.

Maurizio

On 25/03/2019 20:38, Vicente Romero wrote:
> Hi Maurizio,
>
> I have made another run of V7 and V9 and compared them, please see 
> [1]. This time I did 5 warm-up and 20 measurements iterations in order 
> to obtain more data points and reduce the standard deviation. By 
> eyeballing the results, it seemed to me that the reason for the 
> difference came from the fact that while measurements for both 
> intrinsified versions were pretty similar for the two experiments, 
> there were a higher variability in the measurements for vanilla JDK13. 
> Seeming to imply that vanilla JDK13 was more sensible to the changes 
> in the benchmark than any of the intrinsified counterparts. To prove 
> this I added several columns, named Normalized_Diff, to show the the 
> difference between the two previous columns normalized to the (0, 1] 
> interval. The closest to 1 the more similar are the values. The 
> readings of these columns show that the biggest discrepancy is 
> observed when the normalized difference is farthest from 1 for vanilla 
> JDK13.
>
> Thanks,
> Vicente
>
> [1] 
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>
> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>> Hi Vicente,
>> honestly I wouldn't call these differences 'wild'. Looking at the 
>> first column, for the first bench you go from a minimum score of 
>> 238982 to a max score of 270472. The difference between the two is 
>> ~30000, which is around 15% of the measured value, which seems to be 
>> a level of noise comparable with the margin of error.
>>
>> What I can't explain is the difference between these two tables:
>>
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>>
>>
>> and
>>
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>>
>>
>> The latter is markedly faster than the former - and not by a mere 
>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in the 
>> first table it seems like we struggle to go past 1x (sometimes we're 
>> lower even), in the second table we get 2x pretty much all over the 
>> place. I think this is way beyond the level of variance you observed, 
>> and this is something that should be investigated (perhaps with the 
>> help of some JIT guru?). In general all the tables you have shared 
>> are more similar to v9 than to v7 - but I'm worried because if the 
>> numbers in v7 are to be taken seriously, the speedup shown there 
>> doesn't look anything to write home about.
>>
>> Another (smaller) thing I fail to understand is how the mixed 
>> String/int case seems to be sometimes faster than the separate String 
>> or int case (or, in most case, close to the best speedup between the 
>> two). Here, an important point is to notice that mixed StringInt #1 
>> has to be compared with the row #2 in the separate int/String 
>> benchmarks (because the mixed bench takes two arguments). I would 
>> have expected the mixed test to fall somewhat in between the int and 
>> String results (since one is defo faster and the other seems mostly 
>> on par) - but that doesn't seem to be the case.
>>
>> Maurizio
>>
>> On 20/03/2019 01:56, Vicente Romero wrote:
>>> Hi,
>>>
>>> Today at the amber meeting there was a debate about why some JMH's 
>>> results vary between executions. Well first of all these are 
>>> experiments and in all experiments, there is a certain degree of 
>>> variability that can't be controlled.
>>>
>>> So I got the last version of our beloved IntrinsicsBenchmark class 
>>> see [1] and made some changes to it, basically every field has been 
>>> initialized with a different literal, as proposed by Maurizio. Then 
>>> I took only the `int` oriented tests, the first 9 in the benchmark 
>>> and executed then a number of times. All executions are done with 
>>> vanilla JDK13 no intrinsics at play here!
>>>
>>> First I executed the tests 3 times with the same conditions: 3 
>>> warm-up iterations and 5 iterations for measurements. See the wild 
>>> differences in the score and the error columns [2], mostly for the 
>>> tests with the smallest number of arguments. I have no explanation 
>>> for this. Then I started playing with both parameters warm-up and 
>>> measurements iterations and executed another 4 experiments. As 
>>> expected it seems like we should trust more those experiments 
>>> executed with more iterations, both for warm-up and measurements, as 
>>> the errors gets reduced. But still the variability is high and it's 
>>> possible to find non-overlapping intervals for the results of the 
>>> same experiment even when the error is small. See for example: 
>>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the same 
>>> row at Experiment 7. So should we interpret these numbers 
>>> qualitatively too? Are we running the right amount of experiments 
>>> with the right parameters? I have more questions than answers :)
>>>
>>> Thanks for listening,
>>> Vicente
>>>
>>> PS, OK reading
>>>
>>> [1] 
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>> [2] 
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>


More information about the amber-dev mailing list