how reliable JMH's results are

Tue Mar 26 10:52:04 UTC 2019

Ok. That rules out a factor.

But I guess I'm not 100% convinced that it's down to variance. Look again at

http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v7.html

vs.

http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html

What I find disturbing, specifically, is lines like these (mixed int and 
String, raw #1):Intrinsified score v7: 131417
Intrinsified (folding) score v9: 216897

This same result is essentially 2x in v9 than it was in v7. First of all:

* am I right in comparing 'intrinsified' column in v7 with 'intrinsified 
folding' column in v9 ?
* while statistical noise is always a factor, a consistent 2x factor 
across all raws doesn't look like variance

The results in your latest bench are in line with v9 - that is, the 
'intrinsified folding' seems around 2x faster than in v7. Which leaves 
to me one explanation (excluding environmental factors such as CPU 
throttling and the likes): has the JDK intrinsified build changed from 
v7 to v9? That is, is this an apple to apple comparison, or have some 
changes been made to the bootstrap methods too, which ended up speeding 
up the bench?

Maurizio

On 25/03/2019 22:08, Vicente Romero wrote:
>
>
> On 3/25/19 5:40 PM, Maurizio Cimadamore wrote:
>> Yep - they do look similar, so it is likely that between your 
>> previous v7 and v9 measurements some other environmental factor (e.g. 
>> a VM change?) was the likely culprit for the differences.
>
> nope, I have been using the same builds all the time since then. The 
> only difference between now and before is that now I got more data so 
> the results are more reliable
>
>>
>> Maurizio
>
> Vicente
>
>>
>> On 25/03/2019 20:38, Vicente Romero wrote:
>>> Hi Maurizio,
>>>
>>> I have made another run of V7 and V9 and compared them, please see 
>>> [1]. This time I did 5 warm-up and 20 measurements iterations in 
>>> order to obtain more data points and reduce the standard deviation. 
>>> By eyeballing the results, it seemed to me that the reason for the 
>>> difference came from the fact that while measurements for both 
>>> intrinsified versions were pretty similar for the two experiments, 
>>> there were a higher variability in the measurements for vanilla 
>>> JDK13. Seeming to imply that vanilla JDK13 was more sensible to the 
>>> changes in the benchmark than any of the intrinsified counterparts. 
>>> To prove this I added several columns, named Normalized_Diff, to 
>>> show the the difference between the two previous columns normalized 
>>> to the (0, 1] interval. The closest to 1 the more similar are the 
>>> values. The readings of these columns show that the biggest 
>>> discrepancy is observed when the normalized difference is farthest 
>>> from 1 for vanilla JDK13.
>>>
>>> Thanks,
>>> Vicente
>>>
>>> [1] 
>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v7_vs_v9/benchmarkResults_intrinsics_experiment_comparison_v7_vs_v9.html
>>>
>>> On 3/21/19 7:16 AM, Maurizio Cimadamore wrote:
>>>> Hi Vicente,
>>>> honestly I wouldn't call these differences 'wild'. Looking at the 
>>>> first column, for the first bench you go from a minimum score of 
>>>> 238982 to a max score of 270472. The difference between the two is 
>>>> ~30000, which is around 15% of the measured value, which seems to 
>>>> be a level of noise comparable with the margin of error.
>>>>
>>>> What I can't explain is the difference between these two tables:
>>>>
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html 
>>>>
>>>>
>>>> and
>>>>
>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html 
>>>>
>>>>
>>>> The latter is markedly faster than the former - and not by a mere 
>>>> 12%, sometimes by a 2x factor. Look at the Table 2: String - in the 
>>>> first table it seems like we struggle to go past 1x (sometimes 
>>>> we're lower even), in the second table we get 2x pretty much all 
>>>> over the place. I think this is way beyond the level of variance 
>>>> you observed, and this is something that should be investigated 
>>>> (perhaps with the help of some JIT guru?). In general all the 
>>>> tables you have shared are more similar to v9 than to v7 - but I'm 
>>>> worried because if the numbers in v7 are to be taken seriously, the 
>>>> speedup shown there doesn't look anything to write home about.
>>>>
>>>> Another (smaller) thing I fail to understand is how the mixed 
>>>> String/int case seems to be sometimes faster than the separate 
>>>> String or int case (or, in most case, close to the best speedup 
>>>> between the two). Here, an important point is to notice that mixed 
>>>> StringInt #1 has to be compared with the row #2 in the separate 
>>>> int/String benchmarks (because the mixed bench takes two 
>>>> arguments). I would have expected the mixed test to fall somewhat 
>>>> in between the int and String results (since one is defo faster and 
>>>> the other seems mostly on par) - but that doesn't seem to be the case.
>>>>
>>>> Maurizio
>>>>
>>>> On 20/03/2019 01:56, Vicente Romero wrote:
>>>>> Hi,
>>>>>
>>>>> Today at the amber meeting there was a debate about why some JMH's 
>>>>> results vary between executions. Well first of all these are 
>>>>> experiments and in all experiments, there is a certain degree of 
>>>>> variability that can't be controlled.
>>>>>
>>>>> So I got the last version of our beloved IntrinsicsBenchmark class 
>>>>> see [1] and made some changes to it, basically every field has 
>>>>> been initialized with a different literal, as proposed by 
>>>>> Maurizio. Then I took only the `int` oriented tests, the first 9 
>>>>> in the benchmark and executed then a number of times. All 
>>>>> executions are done with vanilla JDK13 no intrinsics at play here!
>>>>>
>>>>> First I executed the tests 3 times with the same conditions: 3 
>>>>> warm-up iterations and 5 iterations for measurements. See the wild 
>>>>> differences in the score and the error columns [2], mostly for the 
>>>>> tests with the smallest number of arguments. I have no explanation 
>>>>> for this. Then I started playing with both parameters warm-up and 
>>>>> measurements iterations and executed another 4 experiments. As 
>>>>> expected it seems like we should trust more those experiments 
>>>>> executed with more iterations, both for warm-up and measurements, 
>>>>> as the errors gets reduced. But still the variability is high and 
>>>>> it's possible to find non-overlapping intervals for the results of 
>>>>> the same experiment even when the error is small. See for example: 
>>>>> Experiment 6 row: `IntrinsicsBenchmark.testHash0040Int` and the 
>>>>> same row at Experiment 7. So should we interpret these numbers 
>>>>> qualitatively too? Are we running the right amount of experiments 
>>>>> with the right parameters? I have more questions than answers :)
>>>>>
>>>>> Thanks for listening,
>>>>> Vicente
>>>>>
>>>>> PS, OK reading
>>>>>
>>>>> [1] 
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/IntrinsicsBenchmark.java
>>>>> [2] 
>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/experiments/v1/benchmarkResults_intrinsics_experiments_v1.html
>>>
>