[intrinsics] performance improvements for the intrinsified version of Objects::hash

Wed Mar 13 00:18:47 UTC 2019

Hi,

I have redone the experiments for Objects::hash. Now accessing fields 
instead of local variables and using the @State annotation. Please see 
the results at [1]. I have kept both implementations in the table for 
comparison. Now the numbers aren't as great as before but they are still 
better than the non-intrinsified version.

Thanks,
Vicente

[1] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html

On 3/12/19 8:57 AM, Vicente Romero wrote:
>
>
> On 3/12/19 8:36 AM, Brian Goetz wrote:
>> In the real world, I would think the vast majority of cases are in 
>> the 2-10 arguments range.  So thats where we need to beat 
>> Objects::hash, if we’re going to do it at all.
>
> if that is the case we already have a solution for that
>
>>
>>> On Mar 6, 2019, at 1:05 PM, Vicente Romero 
>>> <vicente.romero at oracle.com> wrote:
>>>
>>> Hi Hannes,
>>>
>>> Thanks for the results, yes the change I made was more oriented to 
>>> large number of arguments so it makes sense that it is not as good 
>>> for a smaller number of arguments which I agree with you should be 
>>> the most common case. I think we are gathering a good case for the 
>>> next amber meeting to come to a decision on the project, on the 
>>> Objects::hash area. I think that there are two options:
>>>
>>> 1. do nothing: Objects::hash is not intrisified at all
>>> 2. generate the fastest callsite for a small number of arguments and
>>>    generate the legit code when the number of arguments is above a
>>>    given threshold
>>>
>>> Comments?
>>>
>>> Vicente
>>>
>>>
>>> On 3/6/19 12:48 PM, Hannes Wallnöfer wrote:
>>>> Vicente,
>>>>
>>>> I ran a number of the Objects::hash benchmarks ranging from 1 to 
>>>> 100 arguments with current intrinsics-project tip (a0a3f9977a7c) as 
>>>> well as the older version of ObjectsBootstraps.java (pre 
>>>> 0f40d5752eb9, see patch attached). The results I get show that with 
>>>> a single argument, performance is pretty much the same, but with 
>>>> any of number of arguments between 2 and 50 the old version is 
>>>> about 2x faster. It’s only with 100 arguments that the old version 
>>>> becomes pathetically slow, while the new version still performs 
>>>> decently.
>>>>
>>>> I guess the change in implementation was based on a too narrow set 
>>>> of benchmarks, and in the light of these results we should go back 
>>>> to the old implementation. Since invocations with > 50 arguments 
>>>> should be fairly uncommon I guess a simple solution such as 
>>>> invoking the original method via a vararg method handle would be 
>>>> sufficient?
>>>>
>>>> Hannes
>>>>
>>>>
>>>>
>>>> ** intrinsics-project tip **
>>>>
>>>> Benchmark Mode  Cnt      Score      Error   Units
>>>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt    5  41721,730 ± 
>>>> 2477,413  ops/ms
>>>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt    5  35289,443 ± 
>>>> 1030,636  ops/ms
>>>> IntrinsicsBenchmark.testHash2Ints thrpt    5  21830,140 ±  852,642  
>>>> ops/ms
>>>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt    5  19806,343 ± 
>>>> 1027,029  ops/ms
>>>> IntrinsicsBenchmark.testHash2Strings thrpt    5  18409,232 ±  
>>>> 676,311  ops/ms
>>>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt    5  16842,651 ±  
>>>> 664,615  ops/ms
>>>> IntrinsicsBenchmark.testHash5Int5Strings thrpt    5   9183,019 ±  
>>>> 409,206  ops/ms
>>>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt    5   
>>>> 8349,475 ±  886,015  ops/ms
>>>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt    5   9155,629 
>>>> ±  335,129  ops/ms
>>>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt    5   
>>>> 8171,108 ±  385,375  ops/ms
>>>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt    5   9246,458 
>>>> ±  502,251  ops/ms
>>>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt    5   
>>>> 8419,192 ±  372,626  ops/ms
>>>> IntrinsicsBenchmark.testHash100Ints thrpt    5    796,460 ±   
>>>> 27,123  ops/ms
>>>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt    5    792,775 ±   
>>>> 23,402  ops/ms
>>>> IntrinsicsBenchmark.testHash100Strings thrpt    5     69,753 ±   
>>>> 45,947  ops/ms
>>>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt    5    619,774 
>>>> ±   14,503  ops/ms
>>>> IntrinsicsBenchmark.testHash20Int20Strings thrpt    5   1669,774 
>>>> ±   53,991  ops/ms
>>>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt    5   
>>>> 1644,342 ±   62,946  ops/ms
>>>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt    5   
>>>> 1651,428 ±  170,146  ops/ms
>>>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt    5   
>>>> 1635,878 ±   62,007  ops/ms
>>>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt    5   
>>>> 1673,927 ±   51,559  ops/ms
>>>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt    5   
>>>> 1641,524 ±   52,172  ops/ms
>>>> IntrinsicsBenchmark.testHash25Int25Strings thrpt    5   1286,407 
>>>> ±   27,920  ops/ms
>>>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt    5   
>>>> 1286,250 ±   26,125  ops/ms
>>>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt    5   
>>>> 1290,251 ±   25,217  ops/ms
>>>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt    5   
>>>> 1285,060 ±   33,297  ops/ms
>>>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt    5   
>>>> 1288,374 ±   40,727  ops/ms
>>>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt    5   
>>>> 1277,822 ±   20,674  ops/ms
>>>> IntrinsicsBenchmark.testHash50Int50Strings thrpt    5    185,692 ±  
>>>> 753,244  ops/ms
>>>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt    5    
>>>> 622,333 ±   12,212  ops/ms
>>>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt    5    
>>>> 266,784 ±    8,620  ops/ms
>>>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt    5    
>>>> 623,180 ±   15,583  ops/ms
>>>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt    5    
>>>> 391,497 ±   33,738  ops/ms
>>>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt    5    
>>>> 621,560 ±   12,652  ops/ms
>>>>
>>>> ** old version of hash bootstraps **
>>>>
>>>> Benchmark Mode  Cnt      Score      Error   Units
>>>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt    5  42161,466 ±  
>>>> 975,935  ops/ms
>>>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt    5  35445,612 ± 
>>>> 1320,095  ops/ms
>>>> IntrinsicsBenchmark.testHash2Ints thrpt    5  52157,722 ± 1089,439  
>>>> ops/ms
>>>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt    5  42223,291 ± 
>>>> 1107,430  ops/ms
>>>> IntrinsicsBenchmark.testHash2Strings thrpt    5  46702,129 ± 
>>>> 1360,150  ops/ms
>>>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt    5  35577,294 ±  
>>>> 939,756  ops/ms
>>>> IntrinsicsBenchmark.testHash5Int5Strings thrpt    5  20351,503 ±  
>>>> 495,039  ops/ms
>>>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt    5  
>>>> 20087,351 ±  518,764  ops/ms
>>>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt    5  20884,773 
>>>> ±  648,380  ops/ms
>>>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt    5  
>>>> 19990,317 ±  492,250  ops/ms
>>>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt    5  20913,291 
>>>> ±  615,744  ops/ms
>>>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt    5  
>>>> 20053,959 ±  538,675  ops/ms
>>>> IntrinsicsBenchmark.testHash100Ints thrpt    5      6,625 ±    
>>>> 0,157  ops/ms
>>>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt    5      6,747 
>>>> ±    0,201  ops/ms
>>>> IntrinsicsBenchmark.testHash100Strings thrpt    5      6,305 ±    
>>>> 1,031  ops/ms
>>>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt    5      6,826 
>>>> ±    0,154  ops/ms
>>>> IntrinsicsBenchmark.testHash20Int20Strings thrpt    5   4188,399 
>>>> ±   61,255  ops/ms
>>>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt    5   
>>>> 4220,161 ±   55,642  ops/ms
>>>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt    5   
>>>> 4188,347 ±  126,054  ops/ms
>>>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt    5   
>>>> 4251,327 ±   86,473  ops/ms
>>>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt    5   
>>>> 4206,733 ±   66,459  ops/ms
>>>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt    5   
>>>> 4227,479 ±   78,542  ops/ms
>>>> IntrinsicsBenchmark.testHash25Int25Strings thrpt    5   3162,896 
>>>> ±   68,946  ops/ms
>>>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt    5   
>>>> 3190,599 ±   71,035  ops/ms
>>>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt    5   
>>>> 3153,547 ±   59,114  ops/ms
>>>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt    5   
>>>> 3200,687 ±   56,650  ops/ms
>>>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt    5   
>>>> 3166,884 ±   46,123  ops/ms
>>>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt    5   
>>>> 3202,177 ±   36,159  ops/ms
>>>> IntrinsicsBenchmark.testHash50Int50Strings thrpt    5      6,485 
>>>> ±    0,054  ops/ms
>>>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt    5      
>>>> 6,543 ±    0,265  ops/ms
>>>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt    5      
>>>> 6,376 ±    0,150  ops/ms
>>>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt    
>>>> 5      6,774 ±    0,159  ops/ms
>>>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt    5      
>>>> 6,548 ±    0,303  ops/ms
>>>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt    
>>>> 5      6,687 ±    0,081  ops/ms
>>>>
>>>>> Am 05.03.2019 um 16:31 schrieb Vicente Romero 
>>>>> <vicente.romero at oracle.com>:
>>>>>
>>>>>
>>>>>
>>>>> On 3/5/19 9:02 AM, Hannes Wallnöfer wrote:
>>>>>> Vicente,
>>>>>>
>>>>>> could it be that your your change in Object::hash bootstraps[1] 
>>>>>> mostly benefits invocations with very large numbers of parameters 
>>>>>> (like your original 100 parameter tests) but hurts performance 
>>>>>> with medium-to-lower number of parameters? I don’t have your 
>>>>>> latest benchmark sources, but I did some quick tests such as a 
>>>>>> testHash5Ints5Strings that suggest that may be the case.
>>>>> that could be, but the intrinsified version is still faster for 
>>>>> those cases with small number of arguments. That's probably why I 
>>>>> have focused on the larger number of argument case but we can 
>>>>> change priorities or even have different callsites depending on 
>>>>> the number of arguments
>>>>>
>>>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>>>>
>>>>>> Hannes
>>>>> Vicente
>>>>>
>>>>>>> Am 05.03.2019 um 03:52 schrieb Vicente Romero 
>>>>>>> <vicente.romero at oracle.com>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/4/19 8:11 PM, Alex Buckley wrote:
>>>>>>>> // Adopting a zero-decimal-places policy because precision to 
>>>>>>>> multiple decimal places is less important than accuracy and 
>>>>>>>> repeatability.
>>>>>>>>
>>>>>>>> On 3/4/2019 4:28 PM, Vicente Romero wrote:
>>>>>>>>> I have uploaded another round of experiments for 
>>>>>>>>> Objects::hash, see [1].
>>>>>>>>> The main variation I have included a variant of most of the 
>>>>>>>>> tests in
>>>>>>>>> which instead of invoking Objects::hash 10 times sequentially, 
>>>>>>>>> the same
>>>>>>>>> invocation occurs inside a loop which is executed 10 times. 
>>>>>>>>> This shows
>>>>>>>>> that when the call site is reused, the execution time trumps 
>>>>>>>>> vanilla
>>>>>>>>> JDK13 most of the time.
>>>>>>>> That's not really the story though :-) Yes, the 
>>>>>>>> *Int*StringsLoop10 tests run faster with intrinsified 
>>>>>>>> invocation than with vanilla invocation, but generally, the 
>>>>>>>> *Int*StringsLoop10 tests enjoy less impressive speedups than 
>>>>>>>> the *Int*Strings tests. (Example: 25Int25Strings gets a 21x 
>>>>>>>> speedup, but 25Int25StringsLoop10 only gets a 2x speedup.)
>>>>>>>>
>>>>>>>> This is because the *Int*StringsLoop10 tests already run faster 
>>>>>>>> on vanilla JDK 13 than the *Int*Strings tests, presumably 
>>>>>>>> thanks to inlining ("the call site is reused").
>>>>>>>>
>>>>>>>> I guess that 1IntLoop10, 2IntsLoop10, and 2Ints2StringsLoop10 
>>>>>>>> would have such high throughput on vanilla JDK 13 that their 
>>>>>>>> speedups with intrinsification might be significantly <1.
>>>>>>> not in all cases, see [1] the new information is highlighted in 
>>>>>>> yellow
>>>>>>>> Alex
>>>>>>> Vicente
>>>>>>>
>>>>>>> [1] 
>>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data_v4.html
>