how reliable JMH's results are
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Mar 25 21:40:08 UTC 2019
Yep - they do look similar, so it is likely that between your previous
v7 and v9 measurements some other environmental factor (e.g. a VM
change?) was the likely culprit for the differences.
Maurizio
On 25/03/2019 20:38, Vicente Romero wrote:
> Hi Maurizio,
>
> I have made another run of V7 and V9 and compared them, please see
> [1]. This time I did 5 warm-up and 20 measurements iterations in order
> to obtain more data points and reduce the standard deviation. By
> eyeballing the results, it seemed to me that the reason for the
> difference came from the fact that while measurements for both
> intrinsified versions were pretty similar for the two experiments,
> there were a higher variability in the measurements for vanilla JDK13.
> Seeming to imply that vanilla JDK13 was more sensible to the changes
> in the benchmark than any of the intrinsified counterparts. To prove
> this I added several columns, named Normalized_Diff, to show the the
> difference between the two previous columns normalized to the (0, 1]
> interval. The closest to 1 the more similar are the values. The
> readings of these columns show that the biggest discrepancy is
> observed when the normalized difference is farthest from 1 for vanilla
> JDK13.
>
> Thanks,
> Vicente
>
> [1]
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>
> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>> Hi Vicente,
>> honestly I wouldn't call these differences 'wild'. Looking at the
>> first column, for the first bench you go from a minimum score of
>> 238982 to a max score of 270472. The difference between the two is
>> ~30000, which is around 15% of the measured value, which seems to be
>> a level of noise comparable with the margin of error.
>>
>> What I can't explain is the difference between these two tables:
>>
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html
>>
>>
>> and
>>
>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html
>>
>>
>> The latter is markedly faster than the former - and not by a mere
>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in the
>> first table it seems like we struggle to go past 1x (sometimes we're
>> lower even), in the second table we get 2x pretty much all over the
>> place. I think this is way beyond the level of variance you observed,
>> and this is something that should be investigated (perhaps with the
>> help of some JIT guru?). In general all the tables you have shared
>> are more similar to v9 than to v7 - but I'm worried because if the
>> numbers in v7 are to be taken seriously, the speedup shown there
>> doesn't look anything to write home about.
>>
>> Another (smaller) thing I fail to understand is how the mixed
>> String/int case seems to be sometimes faster than the separate String
>> or int case (or, in most case, close to the best speedup between the
>> two). Here, an important point is to notice that mixed StringInt #1
>> has to be compared with the row #2 in the separate int/String
>> benchmarks (because the mixed bench takes two arguments). I would
>> have expected the mixed test to fall somewhat in between the int and
>> String results (since one is defo faster and the other seems mostly
>> on par) - but that doesn't seem to be the case.
>>
>> Maurizio
>>
>> On 20/03/2019 01:56, Vicente Romero wrote:
>>> Hi,
>>>
>>> Today at the amber meeting there was a debate about why some JMH's
>>> results vary between executions. Well first of all these are
>>> experiments and in all experiments, there is a certain degree of
>>> variability that can't be controlled.
>>>
>>> So I got the last version of our beloved IntrinsicsBenchmark class
>>> see [1] and made some changes to it, basically every field has been
>>> initialized with a different literal, as proposed by Maurizio. Then
>>> I took only the `int` oriented tests, the first 9 in the benchmark
>>> and executed then a number of times. All executions are done with
>>> vanilla JDK13 no intrinsics at play here!
>>>
>>> First I executed the tests 3 times with the same conditions: 3
>>> warm-up iterations and 5 iterations for measurements. See the wild
>>> differences in the score and the error columns [2], mostly for the
>>> tests with the smallest number of arguments. I have no explanation
>>> for this. Then I started playing with both parameters warm-up and
>>> measurements iterations and executed another 4 experiments. As
>>> expected it seems like we should trust more those experiments
>>> executed with more iterations, both for warm-up and measurements, as
>>> the errors gets reduced. But still the variability is high and it's
>>> possible to find non-overlapping intervals for the results of the
>>> same experiment even when the error is small. See for example:
>>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the same
>>> row at Experiment 7. So should we interpret these numbers
>>> qualitatively too? Are we running the right amount of experiments
>>> with the right parameters? I have more questions than answers :)
>>>
>>> Thanks for listening,
>>> Vicente
>>>
>>> PS, OK reading
>>>
>>> [1]
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>> [2]
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>
More information about the amber-dev
mailing list