how reliable JMH's results are
Vicente Romero
vicente.romero at oracle.com
Tue Mar 26 12:38:56 UTC 2019
On 3/26/19 6:52 AM, Maurizio Cimadamore wrote:
> Ok. That rules out a factor.
>
> But I guess I'm not 100% convinced that it's down to variance. Look
> again at
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v7.html
>
>
> vs.
>
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html
>
>
> What I find disturbing, specifically, is lines like these (mixed int
> and String, raw #1):Intrinsified score v7: 131417
> Intrinsified (folding) score v9: 216897
>
> This same result is essentially 2x in v9 than it was in v7. First of all:
>
> * am I right in comparing 'intrinsified' column in v7 with
> 'intrinsified folding' column in v9 ?
the "Intrinsified" column in V7 corresponds to the "Intrinsified
Filtering" column in V9, sorry about the name change, it could be confusing
> * while statistical noise is always a factor, a consistent 2x factor
> across all raws doesn't look like variance
>
> The results in your latest bench are in line with v9 - that is, the
> 'intrinsified folding' seems around 2x faster than in v7. Which leaves
> to me one explanation (excluding environmental factors such as CPU
> throttling and the likes): has the JDK intrinsified build changed from
> v7 to v9? That is, is this an apple to apple comparison, or have some
> changes been made to the bootstrap methods too, which ended up
> speeding up the bench?
the only difference here is the benchmark, the builds used are exactly
the same, and I think those original data you are referring to have a
lot of noise, so I would trust more the table I published yesterday
>
> Maurizio
Vicente
>
>
> On 25/03/2019 22:08, Vicente Romero wrote:
>>
>>
>> On 3/25/19 5:40 PM, Maurizio Cimadamore wrote:
>>> Yep - they do look similar, so it is likely that between your
>>> previous v7 and v9 measurements some other environmental factor
>>> (e.g. a VM change?) was the likely culprit for the differences.
>>
>> nope, I have been using the same builds all the time since then. The
>> only difference between now and before is that now I got more data so
>> the results are more reliable
>>
>>>
>>> Maurizio
>>
>> Vicente
>>
>>>
>>> On 25/03/2019 20:38, Vicente Romero wrote:
>>>> Hi Maurizio,
>>>>
>>>> I have made another run of V7 and V9 and compared them, please see
>>>> [1]. This time I did 5 warm-up and 20 measurements iterations in
>>>> order to obtain more data points and reduce the standard deviation.
>>>> By eyeballing the results, it seemed to me that the reason for the
>>>> difference came from the fact that while measurements for both
>>>> intrinsified versions were pretty similar for the two experiments,
>>>> there were a higher variability in the measurements for vanilla
>>>> JDK13. Seeming to imply that vanilla JDK13 was more sensible to the
>>>> changes in the benchmark than any of the intrinsified counterparts.
>>>> To prove this I added several columns, named Normalized_Diff, to
>>>> show the the difference between the two previous columns normalized
>>>> to the (0, 1] interval. The closest to 1 the more similar are the
>>>> values. The readings of these columns show that the biggest
>>>> discrepancy is observed when the normalized difference is farthest
>>>> from 1 for vanilla JDK13.
>>>>
>>>> Thanks,
>>>> Vicente
>>>>
>>>> [1]
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>>>>
>>>> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>>>>> Hi Vicente,
>>>>> honestly I wouldn't call these differences 'wild'. Looking at the
>>>>> first column, for the first bench you go from a minimum score of
>>>>> 238982 to a max score of 270472. The difference between the two is
>>>>> ~30000, which is around 15% of the measured value, which seems to
>>>>> be a level of noise comparable with the margin of error.
>>>>>
>>>>> What I can't explain is the difference between these two tables:
>>>>>
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html
>>>>>
>>>>>
>>>>> The latter is markedly faster than the former - and not by a mere
>>>>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in
>>>>> the first table it seems like we struggle to go past 1x (sometimes
>>>>> we're lower even), in the second table we get 2x pretty much all
>>>>> over the place. I think this is way beyond the level of variance
>>>>> you observed, and this is something that should be investigated
>>>>> (perhaps with the help of some JIT guru?). In general all the
>>>>> tables you have shared are more similar to v9 than to v7 - but I'm
>>>>> worried because if the numbers in v7 are to be taken seriously,
>>>>> the speedup shown there doesn't look anything to write home about.
>>>>>
>>>>> Another (smaller) thing I fail to understand is how the mixed
>>>>> String/int case seems to be sometimes faster than the separate
>>>>> String or int case (or, in most case, close to the best speedup
>>>>> between the two). Here, an important point is to notice that mixed
>>>>> StringInt #1 has to be compared with the row #2 in the separate
>>>>> int/String benchmarks (because the mixed bench takes two
>>>>> arguments). I would have expected the mixed test to fall somewhat
>>>>> in between the int and String results (since one is defo faster
>>>>> and the other seems mostly on par) - but that doesn't seem to be
>>>>> the case.
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 20/03/2019 01:56, Vicente Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Today at the amber meeting there was a debate about why some
>>>>>> JMH's results vary between executions. Well first of all these
>>>>>> are experiments and in all experiments, there is a certain degree
>>>>>> of variability that can't be controlled.
>>>>>>
>>>>>> So I got the last version of our beloved IntrinsicsBenchmark
>>>>>> class see [1] and made some changes to it, basically every field
>>>>>> has been initialized with a different literal, as proposed by
>>>>>> Maurizio. Then I took only the `int` oriented tests, the first 9
>>>>>> in the benchmark and executed then a number of times. All
>>>>>> executions are done with vanilla JDK13 no intrinsics at play here!
>>>>>>
>>>>>> First I executed the tests 3 times with the same conditions: 3
>>>>>> warm-up iterations and 5 iterations for measurements. See the
>>>>>> wild differences in the score and the error columns [2], mostly
>>>>>> for the tests with the smallest number of arguments. I have no
>>>>>> explanation for this. Then I started playing with both parameters
>>>>>> warm-up and measurements iterations and executed another 4
>>>>>> experiments. As expected it seems like we should trust more those
>>>>>> experiments executed with more iterations, both for warm-up and
>>>>>> measurements, as the errors gets reduced. But still the
>>>>>> variability is high and it's possible to find non-overlapping
>>>>>> intervals for the results of the same experiment even when the
>>>>>> error is small. See for example: Experiment 6 row:
>>>>>> `IntrinsicsBenchmark.testHash0040Int` and the same row at
>>>>>> Experiment 7. So should we interpret these numbers qualitatively
>>>>>> too? Are we running the right amount of experiments with the
>>>>>> right parameters? I have more questions than answers :)
>>>>>>
>>>>>> Thanks for listening,
>>>>>> Vicente
>>>>>>
>>>>>> PS, OK reading
>>>>>>
>>>>>> [1]
>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>>>>> [2]
>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>>>>
>>
More information about the amber-dev
mailing list