how reliable JMH's results are
Vicente Romero
vicente.romero at oracle.com
Mon Mar 25 22:08:23 UTC 2019
On 3/25/19 5:40 PM, Maurizio Cimadamore wrote:
> Yep - they do look similar, so it is likely that between your previous
> v7 and v9 measurements some other environmental factor (e.g. a VM
> change?) was the likely culprit for the differences.
nope, I have been using the same builds all the time since then. The
only difference between now and before is that now I got more data so
the results are more reliable
>
> Maurizio
Vicente
>
> On 25/03/2019 20:38, Vicente Romero wrote:
>> Hi Maurizio,
>>
>> I have made another run of V7 and V9 and compared them, please see
>> [1]. This time I did 5 warm-up and 20 measurements iterations in
>> order to obtain more data points and reduce the standard deviation.
>> By eyeballing the results, it seemed to me that the reason for the
>> difference came from the fact that while measurements for both
>> intrinsified versions were pretty similar for the two experiments,
>> there were a higher variability in the measurements for vanilla
>> JDK13. Seeming to imply that vanilla JDK13 was more sensible to the
>> changes in the benchmark than any of the intrinsified counterparts.
>> To prove this I added several columns, named Normalized_Diff, to show
>> the the difference between the two previous columns normalized to the
>> (0, 1] interval. The closest to 1 the more similar are the values.
>> The readings of these columns show that the biggest discrepancy is
>> observed when the normalized difference is farthest from 1 for
>> vanilla JDK13.
>>
>> Thanks,
>> Vicente
>>
>> [1]
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>>
>> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>>> Hi Vicente,
>>> honestly I wouldn't call these differences 'wild'. Looking at the
>>> first column, for the first bench you go from a minimum score of
>>> 238982 to a max score of 270472. The difference between the two is
>>> ~30000, which is around 15% of the measured value, which seems to be
>>> a level of noise comparable with the margin of error.
>>>
>>> What I can't explain is the difference between these two tables:
>>>
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html
>>>
>>>
>>> and
>>>
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html
>>>
>>>
>>> The latter is markedly faster than the former - and not by a mere
>>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in the
>>> first table it seems like we struggle to go past 1x (sometimes we're
>>> lower even), in the second table we get 2x pretty much all over the
>>> place. I think this is way beyond the level of variance you
>>> observed, and this is something that should be investigated (perhaps
>>> with the help of some JIT guru?). In general all the tables you have
>>> shared are more similar to v9 than to v7 - but I'm worried because
>>> if the numbers in v7 are to be taken seriously, the speedup shown
>>> there doesn't look anything to write home about.
>>>
>>> Another (smaller) thing I fail to understand is how the mixed
>>> String/int case seems to be sometimes faster than the separate
>>> String or int case (or, in most case, close to the best speedup
>>> between the two). Here, an important point is to notice that mixed
>>> StringInt #1 has to be compared with the row #2 in the separate
>>> int/String benchmarks (because the mixed bench takes two arguments).
>>> I would have expected the mixed test to fall somewhat in between the
>>> int and String results (since one is defo faster and the other seems
>>> mostly on par) - but that doesn't seem to be the case.
>>>
>>> Maurizio
>>>
>>> On 20/03/2019 01:56, Vicente Romero wrote:
>>>> Hi,
>>>>
>>>> Today at the amber meeting there was a debate about why some JMH's
>>>> results vary between executions. Well first of all these are
>>>> experiments and in all experiments, there is a certain degree of
>>>> variability that can't be controlled.
>>>>
>>>> So I got the last version of our beloved IntrinsicsBenchmark class
>>>> see [1] and made some changes to it, basically every field has been
>>>> initialized with a different literal, as proposed by Maurizio. Then
>>>> I took only the `int` oriented tests, the first 9 in the benchmark
>>>> and executed then a number of times. All executions are done with
>>>> vanilla JDK13 no intrinsics at play here!
>>>>
>>>> First I executed the tests 3 times with the same conditions: 3
>>>> warm-up iterations and 5 iterations for measurements. See the wild
>>>> differences in the score and the error columns [2], mostly for the
>>>> tests with the smallest number of arguments. I have no explanation
>>>> for this. Then I started playing with both parameters warm-up and
>>>> measurements iterations and executed another 4 experiments. As
>>>> expected it seems like we should trust more those experiments
>>>> executed with more iterations, both for warm-up and measurements,
>>>> as the errors gets reduced. But still the variability is high and
>>>> it's possible to find non-overlapping intervals for the results of
>>>> the same experiment even when the error is small. See for example:
>>>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the
>>>> same row at Experiment 7. So should we interpret these numbers
>>>> qualitatively too? Are we running the right amount of experiments
>>>> with the right parameters? I have more questions than answers :)
>>>>
>>>> Thanks for listening,
>>>> Vicente
>>>>
>>>> PS, OK reading
>>>>
>>>> [1]
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>>> [2]
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>>
More information about the amber-dev
mailing list