[intrinsics] performance improvements for the intrinsified version of Objects::hash
Brian Goetz
brian.goetz at oracle.com
Tue Mar 12 12:36:21 UTC 2019
In the real world, I would think the vast majority of cases are in the 2-10 arguments range. So thats where we need to beat Objects::hash, if we’re going to do it at all.
> On Mar 6, 2019, at 1:05 PM, Vicente Romero <vicente.romero at oracle.com> wrote:
>
> Hi Hannes,
>
> Thanks for the results, yes the change I made was more oriented to large number of arguments so it makes sense that it is not as good for a smaller number of arguments which I agree with you should be the most common case. I think we are gathering a good case for the next amber meeting to come to a decision on the project, on the Objects::hash area. I think that there are two options:
>
> 1. do nothing: Objects::hash is not intrisified at all
> 2. generate the fastest callsite for a small number of arguments and
> generate the legit code when the number of arguments is above a
> given threshold
>
> Comments?
>
> Vicente
>
>
> On 3/6/19 12:48 PM, Hannes Wallnöfer wrote:
>> Vicente,
>>
>> I ran a number of the Objects::hash benchmarks ranging from 1 to 100 arguments with current intrinsics-project tip (a0a3f9977a7c) as well as the older version of ObjectsBootstraps.java (pre 0f40d5752eb9, see patch attached). The results I get show that with a single argument, performance is pretty much the same, but with any of number of arguments between 2 and 50 the old version is about 2x faster. It’s only with 100 arguments that the old version becomes pathetically slow, while the new version still performs decently.
>>
>> I guess the change in implementation was based on a too narrow set of benchmarks, and in the light of these results we should go back to the old implementation. Since invocations with > 50 arguments should be fairly uncommon I guess a simple solution such as invoking the original method via a vararg method handle would be sufficient?
>>
>> Hannes
>>
>>
>>
>> ** intrinsics-project tip **
>>
>> Benchmark Mode Cnt Score Error Units
>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt 5 41721,730 ± 2477,413 ops/ms
>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt 5 35289,443 ± 1030,636 ops/ms
>> IntrinsicsBenchmark.testHash2Ints thrpt 5 21830,140 ± 852,642 ops/ms
>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt 5 19806,343 ± 1027,029 ops/ms
>> IntrinsicsBenchmark.testHash2Strings thrpt 5 18409,232 ± 676,311 ops/ms
>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt 5 16842,651 ± 664,615 ops/ms
>> IntrinsicsBenchmark.testHash5Int5Strings thrpt 5 9183,019 ± 409,206 ops/ms
>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt 5 8349,475 ± 886,015 ops/ms
>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt 5 9155,629 ± 335,129 ops/ms
>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt 5 8171,108 ± 385,375 ops/ms
>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt 5 9246,458 ± 502,251 ops/ms
>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt 5 8419,192 ± 372,626 ops/ms
>> IntrinsicsBenchmark.testHash100Ints thrpt 5 796,460 ± 27,123 ops/ms
>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt 5 792,775 ± 23,402 ops/ms
>> IntrinsicsBenchmark.testHash100Strings thrpt 5 69,753 ± 45,947 ops/ms
>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt 5 619,774 ± 14,503 ops/ms
>> IntrinsicsBenchmark.testHash20Int20Strings thrpt 5 1669,774 ± 53,991 ops/ms
>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt 5 1644,342 ± 62,946 ops/ms
>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt 5 1651,428 ± 170,146 ops/ms
>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt 5 1635,878 ± 62,007 ops/ms
>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt 5 1673,927 ± 51,559 ops/ms
>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt 5 1641,524 ± 52,172 ops/ms
>> IntrinsicsBenchmark.testHash25Int25Strings thrpt 5 1286,407 ± 27,920 ops/ms
>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt 5 1286,250 ± 26,125 ops/ms
>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt 5 1290,251 ± 25,217 ops/ms
>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt 5 1285,060 ± 33,297 ops/ms
>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt 5 1288,374 ± 40,727 ops/ms
>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt 5 1277,822 ± 20,674 ops/ms
>> IntrinsicsBenchmark.testHash50Int50Strings thrpt 5 185,692 ± 753,244 ops/ms
>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt 5 622,333 ± 12,212 ops/ms
>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt 5 266,784 ± 8,620 ops/ms
>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt 5 623,180 ± 15,583 ops/ms
>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt 5 391,497 ± 33,738 ops/ms
>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt 5 621,560 ± 12,652 ops/ms
>>
>> ** old version of hash bootstraps **
>>
>> Benchmark Mode Cnt Score Error Units
>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt 5 42161,466 ± 975,935 ops/ms
>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt 5 35445,612 ± 1320,095 ops/ms
>> IntrinsicsBenchmark.testHash2Ints thrpt 5 52157,722 ± 1089,439 ops/ms
>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt 5 42223,291 ± 1107,430 ops/ms
>> IntrinsicsBenchmark.testHash2Strings thrpt 5 46702,129 ± 1360,150 ops/ms
>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt 5 35577,294 ± 939,756 ops/ms
>> IntrinsicsBenchmark.testHash5Int5Strings thrpt 5 20351,503 ± 495,039 ops/ms
>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt 5 20087,351 ± 518,764 ops/ms
>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt 5 20884,773 ± 648,380 ops/ms
>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt 5 19990,317 ± 492,250 ops/ms
>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt 5 20913,291 ± 615,744 ops/ms
>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt 5 20053,959 ± 538,675 ops/ms
>> IntrinsicsBenchmark.testHash100Ints thrpt 5 6,625 ± 0,157 ops/ms
>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt 5 6,747 ± 0,201 ops/ms
>> IntrinsicsBenchmark.testHash100Strings thrpt 5 6,305 ± 1,031 ops/ms
>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt 5 6,826 ± 0,154 ops/ms
>> IntrinsicsBenchmark.testHash20Int20Strings thrpt 5 4188,399 ± 61,255 ops/ms
>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt 5 4220,161 ± 55,642 ops/ms
>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt 5 4188,347 ± 126,054 ops/ms
>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt 5 4251,327 ± 86,473 ops/ms
>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt 5 4206,733 ± 66,459 ops/ms
>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt 5 4227,479 ± 78,542 ops/ms
>> IntrinsicsBenchmark.testHash25Int25Strings thrpt 5 3162,896 ± 68,946 ops/ms
>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt 5 3190,599 ± 71,035 ops/ms
>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt 5 3153,547 ± 59,114 ops/ms
>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt 5 3200,687 ± 56,650 ops/ms
>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt 5 3166,884 ± 46,123 ops/ms
>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt 5 3202,177 ± 36,159 ops/ms
>> IntrinsicsBenchmark.testHash50Int50Strings thrpt 5 6,485 ± 0,054 ops/ms
>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt 5 6,543 ± 0,265 ops/ms
>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt 5 6,376 ± 0,150 ops/ms
>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt 5 6,774 ± 0,159 ops/ms
>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt 5 6,548 ± 0,303 ops/ms
>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt 5 6,687 ± 0,081 ops/ms
>>
>>> Am 05.03.2019 um 16:31 schrieb Vicente Romero <vicente.romero at oracle.com>:
>>>
>>>
>>>
>>> On 3/5/19 9:02 AM, Hannes Wallnöfer wrote:
>>>> Vicente,
>>>>
>>>> could it be that your your change in Object::hash bootstraps[1] mostly benefits invocations with very large numbers of parameters (like your original 100 parameter tests) but hurts performance with medium-to-lower number of parameters? I don’t have your latest benchmark sources, but I did some quick tests such as a testHash5Ints5Strings that suggest that may be the case.
>>> that could be, but the intrinsified version is still faster for those cases with small number of arguments. That's probably why I have focused on the larger number of argument case but we can change priorities or even have different callsites depending on the number of arguments
>>>
>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>>
>>>> Hannes
>>> Vicente
>>>
>>>>
>>>>> Am 05.03.2019 um 03:52 schrieb Vicente Romero <vicente.romero at oracle.com>:
>>>>>
>>>>>
>>>>>
>>>>> On 3/4/19 8:11 PM, Alex Buckley wrote:
>>>>>> // Adopting a zero-decimal-places policy because precision to multiple decimal places is less important than accuracy and repeatability.
>>>>>>
>>>>>> On 3/4/2019 4:28 PM, Vicente Romero wrote:
>>>>>>> I have uploaded another round of experiments for Objects::hash, see [1].
>>>>>>> The main variation I have included a variant of most of the tests in
>>>>>>> which instead of invoking Objects::hash 10 times sequentially, the same
>>>>>>> invocation occurs inside a loop which is executed 10 times. This shows
>>>>>>> that when the call site is reused, the execution time trumps vanilla
>>>>>>> JDK13 most of the time.
>>>>>> That's not really the story though :-) Yes, the *Int*StringsLoop10 tests run faster with intrinsified invocation than with vanilla invocation, but generally, the *Int*StringsLoop10 tests enjoy less impressive speedups than the *Int*Strings tests. (Example: 25Int25Strings gets a 21x speedup, but 25Int25StringsLoop10 only gets a 2x speedup.)
>>>>>>
>>>>>> This is because the *Int*StringsLoop10 tests already run faster on vanilla JDK 13 than the *Int*Strings tests, presumably thanks to inlining ("the call site is reused").
>>>>>>
>>>>>> I guess that 1IntLoop10, 2IntsLoop10, and 2Ints2StringsLoop10 would have such high throughput on vanilla JDK 13 that their speedups with intrinsification might be significantly <1.
>>>>> not in all cases, see [1] the new information is highlighted in yellow
>>>>>> Alex
>>>>> Vicente
>>>>>
>>>>> [1] http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data_v4.html
>
More information about the amber-dev
mailing list