[intrinsics] performance improvements for the intrinsified version of Objects::hash

Tue Mar 12 12:36:21 UTC 2019

In the real world, I would think the vast majority of cases are in the 2-10 arguments range.  So thats where we need to beat Objects::hash, if we’re going to do it at all.  

> On Mar 6, 2019, at 1:05 PM, Vicente Romero <vicente.romero at oracle.com> wrote:
> 
> Hi Hannes,
> 
> Thanks for the results, yes the change I made was more oriented to large number of arguments so it makes sense that it is not as good for a smaller number of arguments which I agree with you should be the most common case. I think we are gathering a good case for the next amber meeting to come to a decision on the project, on the Objects::hash area. I think that there are two options:
> 
> 1. do nothing: Objects::hash is not intrisified at all
> 2. generate the fastest callsite for a small number of arguments and
>   generate the legit code when the number of arguments is above a
>   given threshold
> 
> Comments?
> 
> Vicente
> 
> 
> On 3/6/19 12:48 PM, Hannes Wallnöfer wrote:
>> Vicente,
>> 
>> I ran a number of the Objects::hash benchmarks ranging from 1 to 100 arguments with current intrinsics-project tip (a0a3f9977a7c) as well as the older version of ObjectsBootstraps.java (pre 0f40d5752eb9, see patch attached). The results I get show that with a single argument, performance is pretty much the same, but with any of number of arguments between 2 and 50 the old version is about 2x faster. It’s only with 100 arguments that the old version becomes pathetically slow, while the new version still performs decently.
>> 
>> I guess the change in implementation was based on a too narrow set of benchmarks, and in the light of these results we should go back to the old implementation. Since invocations with > 50 arguments should be fairly uncommon I guess a simple solution such as invoking the original method via a vararg method handle would be sufficient?
>> 
>> Hannes
>> 
>> 
>> 
>> ** intrinsics-project tip **
>> 
>> Benchmark                                               Mode  Cnt      Score      Error   Units
>> IntrinsicsBenchmark.testHash1IntLoop10                 thrpt    5  41721,730 ± 2477,413  ops/ms
>> IntrinsicsBenchmark.testHash1StringLoop10              thrpt    5  35289,443 ± 1030,636  ops/ms
>> IntrinsicsBenchmark.testHash2Ints                      thrpt    5  21830,140 ±  852,642  ops/ms
>> IntrinsicsBenchmark.testHash2IntsLoop10                thrpt    5  19806,343 ± 1027,029  ops/ms
>> IntrinsicsBenchmark.testHash2Strings                   thrpt    5  18409,232 ±  676,311  ops/ms
>> IntrinsicsBenchmark.testHash2StringsLoop10             thrpt    5  16842,651 ±  664,615  ops/ms
>> IntrinsicsBenchmark.testHash5Int5Strings               thrpt    5   9183,019 ±  409,206  ops/ms
>> IntrinsicsBenchmark.testHash5Int5StringsLoop10         thrpt    5   8349,475 ±  886,015  ops/ms
>> IntrinsicsBenchmark.testHash5Integers5Strings          thrpt    5   9155,629 ±  335,129  ops/ms
>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10    thrpt    5   8171,108 ±  385,375  ops/ms
>> IntrinsicsBenchmark.testHash5doubles5Strings           thrpt    5   9246,458 ±  502,251  ops/ms
>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10     thrpt    5   8419,192 ±  372,626  ops/ms
>> IntrinsicsBenchmark.testHash100Ints                    thrpt    5    796,460 ±   27,123  ops/ms
>> IntrinsicsBenchmark.testHash100IntsLoop10              thrpt    5    792,775 ±   23,402  ops/ms
>> IntrinsicsBenchmark.testHash100Strings                 thrpt    5     69,753 ±   45,947  ops/ms
>> IntrinsicsBenchmark.testHash100StringsLoop10           thrpt    5    619,774 ±   14,503  ops/ms
>> IntrinsicsBenchmark.testHash20Int20Strings             thrpt    5   1669,774 ±   53,991  ops/ms
>> IntrinsicsBenchmark.testHash20Int20StringsLoop10       thrpt    5   1644,342 ±   62,946  ops/ms
>> IntrinsicsBenchmark.testHash20Integers20Strings        thrpt    5   1651,428 ±  170,146  ops/ms
>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10  thrpt    5   1635,878 ±   62,007  ops/ms
>> IntrinsicsBenchmark.testHash20doubles20Strings         thrpt    5   1673,927 ±   51,559  ops/ms
>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10   thrpt    5   1641,524 ±   52,172  ops/ms
>> IntrinsicsBenchmark.testHash25Int25Strings             thrpt    5   1286,407 ±   27,920  ops/ms
>> IntrinsicsBenchmark.testHash25Int25StringsLoop10       thrpt    5   1286,250 ±   26,125  ops/ms
>> IntrinsicsBenchmark.testHash25Integers25Strings        thrpt    5   1290,251 ±   25,217  ops/ms
>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10  thrpt    5   1285,060 ±   33,297  ops/ms
>> IntrinsicsBenchmark.testHash25doubles25Strings         thrpt    5   1288,374 ±   40,727  ops/ms
>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10   thrpt    5   1277,822 ±   20,674  ops/ms
>> IntrinsicsBenchmark.testHash50Int50Strings             thrpt    5    185,692 ±  753,244  ops/ms
>> IntrinsicsBenchmark.testHash50Int50StringsLoop10       thrpt    5    622,333 ±   12,212  ops/ms
>> IntrinsicsBenchmark.testHash50Integers50Strings        thrpt    5    266,784 ±    8,620  ops/ms
>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10  thrpt    5    623,180 ±   15,583  ops/ms
>> IntrinsicsBenchmark.testHash50doubles50Strings         thrpt    5    391,497 ±   33,738  ops/ms
>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10   thrpt    5    621,560 ±   12,652  ops/ms
>> 
>> ** old version of hash bootstraps **
>> 
>> Benchmark                                               Mode  Cnt      Score      Error   Units
>> IntrinsicsBenchmark.testHash1IntLoop10                 thrpt    5  42161,466 ±  975,935  ops/ms
>> IntrinsicsBenchmark.testHash1StringLoop10              thrpt    5  35445,612 ± 1320,095  ops/ms
>> IntrinsicsBenchmark.testHash2Ints                      thrpt    5  52157,722 ± 1089,439  ops/ms
>> IntrinsicsBenchmark.testHash2IntsLoop10                thrpt    5  42223,291 ± 1107,430  ops/ms
>> IntrinsicsBenchmark.testHash2Strings                   thrpt    5  46702,129 ± 1360,150  ops/ms
>> IntrinsicsBenchmark.testHash2StringsLoop10             thrpt    5  35577,294 ±  939,756  ops/ms
>> IntrinsicsBenchmark.testHash5Int5Strings               thrpt    5  20351,503 ±  495,039  ops/ms
>> IntrinsicsBenchmark.testHash5Int5StringsLoop10         thrpt    5  20087,351 ±  518,764  ops/ms
>> IntrinsicsBenchmark.testHash5Integers5Strings          thrpt    5  20884,773 ±  648,380  ops/ms
>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10    thrpt    5  19990,317 ±  492,250  ops/ms
>> IntrinsicsBenchmark.testHash5doubles5Strings           thrpt    5  20913,291 ±  615,744  ops/ms
>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10     thrpt    5  20053,959 ±  538,675  ops/ms
>> IntrinsicsBenchmark.testHash100Ints                    thrpt    5      6,625 ±    0,157  ops/ms
>> IntrinsicsBenchmark.testHash100IntsLoop10              thrpt    5      6,747 ±    0,201  ops/ms
>> IntrinsicsBenchmark.testHash100Strings                 thrpt    5      6,305 ±    1,031  ops/ms
>> IntrinsicsBenchmark.testHash100StringsLoop10           thrpt    5      6,826 ±    0,154  ops/ms
>> IntrinsicsBenchmark.testHash20Int20Strings             thrpt    5   4188,399 ±   61,255  ops/ms
>> IntrinsicsBenchmark.testHash20Int20StringsLoop10       thrpt    5   4220,161 ±   55,642  ops/ms
>> IntrinsicsBenchmark.testHash20Integers20Strings        thrpt    5   4188,347 ±  126,054  ops/ms
>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10  thrpt    5   4251,327 ±   86,473  ops/ms
>> IntrinsicsBenchmark.testHash20doubles20Strings         thrpt    5   4206,733 ±   66,459  ops/ms
>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10   thrpt    5   4227,479 ±   78,542  ops/ms
>> IntrinsicsBenchmark.testHash25Int25Strings             thrpt    5   3162,896 ±   68,946  ops/ms
>> IntrinsicsBenchmark.testHash25Int25StringsLoop10       thrpt    5   3190,599 ±   71,035  ops/ms
>> IntrinsicsBenchmark.testHash25Integers25Strings        thrpt    5   3153,547 ±   59,114  ops/ms
>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10  thrpt    5   3200,687 ±   56,650  ops/ms
>> IntrinsicsBenchmark.testHash25doubles25Strings         thrpt    5   3166,884 ±   46,123  ops/ms
>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10   thrpt    5   3202,177 ±   36,159  ops/ms
>> IntrinsicsBenchmark.testHash50Int50Strings             thrpt    5      6,485 ±    0,054  ops/ms
>> IntrinsicsBenchmark.testHash50Int50StringsLoop10       thrpt    5      6,543 ±    0,265  ops/ms
>> IntrinsicsBenchmark.testHash50Integers50Strings        thrpt    5      6,376 ±    0,150  ops/ms
>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10  thrpt    5      6,774 ±    0,159  ops/ms
>> IntrinsicsBenchmark.testHash50doubles50Strings         thrpt    5      6,548 ±    0,303  ops/ms
>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10   thrpt    5      6,687 ±    0,081  ops/ms
>> 
>>> Am 05.03.2019 um 16:31 schrieb Vicente Romero <vicente.romero at oracle.com>:
>>> 
>>> 
>>> 
>>> On 3/5/19 9:02 AM, Hannes Wallnöfer wrote:
>>>> Vicente,
>>>> 
>>>> could it be that your your change in Object::hash bootstraps[1] mostly benefits invocations with very large numbers of parameters (like your original 100 parameter tests) but hurts performance with medium-to-lower number of parameters? I don’t have your latest benchmark sources, but I did some quick tests such as a testHash5Ints5Strings that suggest that may be the case.
>>> that could be, but the intrinsified version is still faster for those cases with small number of arguments. That's probably why I have focused on the larger number of argument case but we can change priorities or even have different callsites depending on the number of arguments
>>> 
>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>> 
>>>> Hannes
>>> Vicente
>>> 
>>>> 
>>>>> Am 05.03.2019 um 03:52 schrieb Vicente Romero <vicente.romero at oracle.com>:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 3/4/19 8:11 PM, Alex Buckley wrote:
>>>>>> // Adopting a zero-decimal-places policy because precision to multiple decimal places is less important than accuracy and repeatability.
>>>>>> 
>>>>>> On 3/4/2019 4:28 PM, Vicente Romero wrote:
>>>>>>> I have uploaded another round of experiments for Objects::hash, see [1].
>>>>>>> The main variation I have included a variant of most of the tests in
>>>>>>> which instead of invoking Objects::hash 10 times sequentially, the same
>>>>>>> invocation occurs inside a loop which is executed 10 times. This shows
>>>>>>> that when the call site is reused, the execution time trumps vanilla
>>>>>>> JDK13 most of the time.
>>>>>> That's not really the story though :-) Yes, the *Int*StringsLoop10 tests run faster with intrinsified invocation than with vanilla invocation, but generally, the *Int*StringsLoop10 tests enjoy less impressive speedups than the *Int*Strings tests. (Example: 25Int25Strings gets a 21x speedup, but 25Int25StringsLoop10 only gets a 2x speedup.)
>>>>>> 
>>>>>> This is because the *Int*StringsLoop10 tests already run faster on vanilla JDK 13 than the *Int*Strings tests, presumably thanks to inlining ("the call site is reused").
>>>>>> 
>>>>>> I guess that 1IntLoop10, 2IntsLoop10, and 2Ints2StringsLoop10 would have such high throughput on vanilla JDK 13 that their speedups with intrinsification might be significantly <1.
>>>>> not in all cases, see [1] the new information is highlighted in yellow
>>>>>> Alex
>>>>> Vicente
>>>>> 
>>>>> [1] http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data_v4.html
>