[intrinsics] performance improvements for the intrinsified version of Objects::hash

Fri Mar 15 02:13:04 UTC 2019

Please see the performance of both implementations for the 
multi-threaded version [1].

[1] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html

On 3/12/19 8:18 PM, Vicente Romero wrote:
> Hi,
>
> I have redone the experiments for Objects::hash. Now accessing fields 
> instead of local variables and using the @State annotation. Please see 
> the results at [1]. I have kept both implementations in the table for 
> comparison. Now the numbers aren't as great as before but they are 
> still better than the non-intrinsified version.
>
> Thanks,
> Vicente
>
> [1] 
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html
>
> On 3/12/19 8:57 AM, Vicente Romero wrote:
>>
>>
>> On 3/12/19 8:36 AM, Brian Goetz wrote:
>>> In the real world, I would think the vast majority of cases are in 
>>> the 2-10 arguments range.  So thats where we need to beat 
>>> Objects::hash, if we’re going to do it at all.
>>
>> if that is the case we already have a solution for that
>>
>>>
>>>> On Mar 6, 2019, at 1:05 PM, Vicente Romero 
>>>> <vicente.romero at oracle.com> wrote:
>>>>
>>>> Hi Hannes,
>>>>
>>>> Thanks for the results, yes the change I made was more oriented to 
>>>> large number of arguments so it makes sense that it is not as good 
>>>> for a smaller number of arguments which I agree with you should be 
>>>> the most common case. I think we are gathering a good case for the 
>>>> next amber meeting to come to a decision on the project, on the 
>>>> Objects::hash area. I think that there are two options:
>>>>
>>>> 1. do nothing: Objects::hash is not intrisified at all
>>>> 2. generate the fastest callsite for a small number of arguments and
>>>>    generate the legit code when the number of arguments is above a
>>>>    given threshold
>>>>
>>>> Comments?
>>>>
>>>> Vicente
>>>>
>>>>
>>>> On 3/6/19 12:48 PM, Hannes Wallnöfer wrote:
>>>>> Vicente,
>>>>>
>>>>> I ran a number of the Objects::hash benchmarks ranging from 1 to 
>>>>> 100 arguments with current intrinsics-project tip (a0a3f9977a7c) 
>>>>> as well as the older version of ObjectsBootstraps.java (pre 
>>>>> 0f40d5752eb9, see patch attached). The results I get show that 
>>>>> with a single argument, performance is pretty much the same, but 
>>>>> with any of number of arguments between 2 and 50 the old version 
>>>>> is about 2x faster. It’s only with 100 arguments that the old 
>>>>> version becomes pathetically slow, while the new version still 
>>>>> performs decently.
>>>>>
>>>>> I guess the change in implementation was based on a too narrow set 
>>>>> of benchmarks, and in the light of these results we should go back 
>>>>> to the old implementation. Since invocations with > 50 arguments 
>>>>> should be fairly uncommon I guess a simple solution such as 
>>>>> invoking the original method via a vararg method handle would be 
>>>>> sufficient?
>>>>>
>>>>> Hannes
>>>>>
>>>>>
>>>>>
>>>>> ** intrinsics-project tip **
>>>>>
>>>>> Benchmark Mode  Cnt      Score      Error   Units
>>>>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt    5 41721,730 ± 
>>>>> 2477,413  ops/ms
>>>>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt    5 35289,443 ± 
>>>>> 1030,636  ops/ms
>>>>> IntrinsicsBenchmark.testHash2Ints thrpt    5  21830,140 ± 852,642  
>>>>> ops/ms
>>>>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt    5 19806,343 ± 
>>>>> 1027,029  ops/ms
>>>>> IntrinsicsBenchmark.testHash2Strings thrpt    5  18409,232 ±  
>>>>> 676,311  ops/ms
>>>>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt    5 16842,651 ±  
>>>>> 664,615  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5Strings thrpt    5 9183,019 ±  
>>>>> 409,206  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt 5   8349,475 
>>>>> ±  886,015  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt    5 9155,629 
>>>>> ±  335,129  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt    5   
>>>>> 8171,108 ±  385,375  ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt    5 9246,458 
>>>>> ±  502,251  ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt    5   
>>>>> 8419,192 ±  372,626  ops/ms
>>>>> IntrinsicsBenchmark.testHash100Ints thrpt    5    796,460 ±   
>>>>> 27,123  ops/ms
>>>>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt    5 792,775 ±   
>>>>> 23,402  ops/ms
>>>>> IntrinsicsBenchmark.testHash100Strings thrpt    5 69,753 ±   
>>>>> 45,947  ops/ms
>>>>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt    5 619,774 
>>>>> ±   14,503  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20Strings thrpt    5 1669,774 ±   
>>>>> 53,991  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt 5   
>>>>> 1644,342 ±   62,946  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt 5   1651,428 
>>>>> ±  170,146  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt    5   
>>>>> 1635,878 ±   62,007  ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt 5   1673,927 
>>>>> ±   51,559  ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt    5   
>>>>> 1641,524 ±   52,172  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25Strings thrpt    5 1286,407 ±   
>>>>> 27,920  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt 5   
>>>>> 1286,250 ±   26,125  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt 5   1290,251 
>>>>> ±   25,217  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt    5   
>>>>> 1285,060 ±   33,297  ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt 5   1288,374 
>>>>> ±   40,727  ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt    5   
>>>>> 1277,822 ±   20,674  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50Strings thrpt    5 185,692 ±  
>>>>> 753,244  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt 5    
>>>>> 622,333 ±   12,212  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt 5    266,784 
>>>>> ±    8,620  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt    
>>>>> 5    623,180 ±   15,583  ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt 5    391,497 
>>>>> ±   33,738  ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt    5    
>>>>> 621,560 ±   12,652  ops/ms
>>>>>
>>>>> ** old version of hash bootstraps **
>>>>>
>>>>> Benchmark Mode  Cnt      Score      Error   Units
>>>>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt    5 42161,466 ±  
>>>>> 975,935  ops/ms
>>>>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt    5 35445,612 ± 
>>>>> 1320,095  ops/ms
>>>>> IntrinsicsBenchmark.testHash2Ints thrpt    5  52157,722 ± 
>>>>> 1089,439  ops/ms
>>>>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt    5 42223,291 ± 
>>>>> 1107,430  ops/ms
>>>>> IntrinsicsBenchmark.testHash2Strings thrpt    5  46702,129 ± 
>>>>> 1360,150  ops/ms
>>>>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt    5 35577,294 ±  
>>>>> 939,756  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5Strings thrpt    5 20351,503 ±  
>>>>> 495,039  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt    5 
>>>>> 20087,351 ±  518,764  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt    5 20884,773 
>>>>> ±  648,380  ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt    5  
>>>>> 19990,317 ±  492,250  ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt    5 20913,291 
>>>>> ±  615,744  ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt    5  
>>>>> 20053,959 ±  538,675  ops/ms
>>>>> IntrinsicsBenchmark.testHash100Ints thrpt    5      6,625 ±    
>>>>> 0,157  ops/ms
>>>>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt    5 6,747 ±    
>>>>> 0,201  ops/ms
>>>>> IntrinsicsBenchmark.testHash100Strings thrpt    5 6,305 ±    
>>>>> 1,031  ops/ms
>>>>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt 5      6,826 
>>>>> ±    0,154  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20Strings thrpt    5 4188,399 ±   
>>>>> 61,255  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt 5   
>>>>> 4220,161 ±   55,642  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt 5   4188,347 
>>>>> ±  126,054  ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt    5   
>>>>> 4251,327 ±   86,473  ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt 5   4206,733 
>>>>> ±   66,459  ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt    5   
>>>>> 4227,479 ±   78,542  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25Strings thrpt    5 3162,896 ±   
>>>>> 68,946  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt 5   
>>>>> 3190,599 ±   71,035  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt 5   3153,547 
>>>>> ±   59,114  ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt    5   
>>>>> 3200,687 ±   56,650  ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt 5   3166,884 
>>>>> ±   46,123  ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt    5   
>>>>> 3202,177 ±   36,159  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50Strings thrpt    5 6,485 ±    
>>>>> 0,054  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt 5      
>>>>> 6,543 ±    0,265  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt 5      6,376 
>>>>> ±    0,150  ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt    
>>>>> 5      6,774 ±    0,159  ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt 5      6,548 
>>>>> ±    0,303  ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt    
>>>>> 5      6,687 ±    0,081  ops/ms
>>>>>
>>>>>> Am 05.03.2019 um 16:31 schrieb Vicente Romero 
>>>>>> <vicente.romero at oracle.com>:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/5/19 9:02 AM, Hannes Wallnöfer wrote:
>>>>>>> Vicente,
>>>>>>>
>>>>>>> could it be that your your change in Object::hash bootstraps[1] 
>>>>>>> mostly benefits invocations with very large numbers of 
>>>>>>> parameters (like your original 100 parameter tests) but hurts 
>>>>>>> performance with medium-to-lower number of parameters? I don’t 
>>>>>>> have your latest benchmark sources, but I did some quick tests 
>>>>>>> such as a testHash5Ints5Strings that suggest that may be the case.
>>>>>> that could be, but the intrinsified version is still faster for 
>>>>>> those cases with small number of arguments. That's probably why I 
>>>>>> have focused on the larger number of argument case but we can 
>>>>>> change priorities or even have different callsites depending on 
>>>>>> the number of arguments
>>>>>>
>>>>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>>>>>
>>>>>>> Hannes
>>>>>> Vicente
>>>>>>
>>>>>>>> Am 05.03.2019 um 03:52 schrieb Vicente Romero 
>>>>>>>> <vicente.romero at oracle.com>:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/4/19 8:11 PM, Alex Buckley wrote:
>>>>>>>>> // Adopting a zero-decimal-places policy because precision to 
>>>>>>>>> multiple decimal places is less important than accuracy and 
>>>>>>>>> repeatability.
>>>>>>>>>
>>>>>>>>> On 3/4/2019 4:28 PM, Vicente Romero wrote:
>>>>>>>>>> I have uploaded another round of experiments for 
>>>>>>>>>> Objects::hash, see [1].
>>>>>>>>>> The main variation I have included a variant of most of the 
>>>>>>>>>> tests in
>>>>>>>>>> which instead of invoking Objects::hash 10 times 
>>>>>>>>>> sequentially, the same
>>>>>>>>>> invocation occurs inside a loop which is executed 10 times. 
>>>>>>>>>> This shows
>>>>>>>>>> that when the call site is reused, the execution time trumps 
>>>>>>>>>> vanilla
>>>>>>>>>> JDK13 most of the time.
>>>>>>>>> That's not really the story though :-) Yes, the 
>>>>>>>>> *Int*StringsLoop10 tests run faster with intrinsified 
>>>>>>>>> invocation than with vanilla invocation, but generally, the 
>>>>>>>>> *Int*StringsLoop10 tests enjoy less impressive speedups than 
>>>>>>>>> the *Int*Strings tests. (Example: 25Int25Strings gets a 21x 
>>>>>>>>> speedup, but 25Int25StringsLoop10 only gets a 2x speedup.)
>>>>>>>>>
>>>>>>>>> This is because the *Int*StringsLoop10 tests already run 
>>>>>>>>> faster on vanilla JDK 13 than the *Int*Strings tests, 
>>>>>>>>> presumably thanks to inlining ("the call site is reused").
>>>>>>>>>
>>>>>>>>> I guess that 1IntLoop10, 2IntsLoop10, and 2Ints2StringsLoop10 
>>>>>>>>> would have such high throughput on vanilla JDK 13 that their 
>>>>>>>>> speedups with intrinsification might be significantly <1.
>>>>>>>> not in all cases, see [1] the new information is highlighted in 
>>>>>>>> yellow
>>>>>>>>> Alex
>>>>>>>> Vicente
>>>>>>>>
>>>>>>>> [1] 
>>>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data_v4.html
>>
>