[intrinsics] performance improvements for the intrinsified version of Objects::hash
Vicente Romero
vicente.romero at oracle.com
Fri Mar 15 02:13:04 UTC 2019
Please see the performance of both implementations for the
multi-threaded version [1].
[1]
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v9/benchmarkResults_intrinsics_all_data_v9.html
On 3/12/19 8:18 PM, Vicente Romero wrote:
> Hi,
>
> I have redone the experiments for Objects::hash. Now accessing fields
> instead of local variables and using the @State annotation. Please see
> the results at [1]. I have kept both implementations in the table for
> comparison. Now the numbers aren't as great as before but they are
> still better than the non-intrinsified version.
>
> Thanks,
> Vicente
>
> [1]
> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/v7/benchmarkResults_intrinsics_all_data_v7.html
>
> On 3/12/19 8:57 AM, Vicente Romero wrote:
>>
>>
>> On 3/12/19 8:36 AM, Brian Goetz wrote:
>>> In the real world, I would think the vast majority of cases are in
>>> the 2-10 arguments range. So thats where we need to beat
>>> Objects::hash, if we’re going to do it at all.
>>
>> if that is the case we already have a solution for that
>>
>>>
>>>> On Mar 6, 2019, at 1:05 PM, Vicente Romero
>>>> <vicente.romero at oracle.com> wrote:
>>>>
>>>> Hi Hannes,
>>>>
>>>> Thanks for the results, yes the change I made was more oriented to
>>>> large number of arguments so it makes sense that it is not as good
>>>> for a smaller number of arguments which I agree with you should be
>>>> the most common case. I think we are gathering a good case for the
>>>> next amber meeting to come to a decision on the project, on the
>>>> Objects::hash area. I think that there are two options:
>>>>
>>>> 1. do nothing: Objects::hash is not intrisified at all
>>>> 2. generate the fastest callsite for a small number of arguments and
>>>> generate the legit code when the number of arguments is above a
>>>> given threshold
>>>>
>>>> Comments?
>>>>
>>>> Vicente
>>>>
>>>>
>>>> On 3/6/19 12:48 PM, Hannes Wallnöfer wrote:
>>>>> Vicente,
>>>>>
>>>>> I ran a number of the Objects::hash benchmarks ranging from 1 to
>>>>> 100 arguments with current intrinsics-project tip (a0a3f9977a7c)
>>>>> as well as the older version of ObjectsBootstraps.java (pre
>>>>> 0f40d5752eb9, see patch attached). The results I get show that
>>>>> with a single argument, performance is pretty much the same, but
>>>>> with any of number of arguments between 2 and 50 the old version
>>>>> is about 2x faster. It’s only with 100 arguments that the old
>>>>> version becomes pathetically slow, while the new version still
>>>>> performs decently.
>>>>>
>>>>> I guess the change in implementation was based on a too narrow set
>>>>> of benchmarks, and in the light of these results we should go back
>>>>> to the old implementation. Since invocations with > 50 arguments
>>>>> should be fairly uncommon I guess a simple solution such as
>>>>> invoking the original method via a vararg method handle would be
>>>>> sufficient?
>>>>>
>>>>> Hannes
>>>>>
>>>>>
>>>>>
>>>>> ** intrinsics-project tip **
>>>>>
>>>>> Benchmark Mode Cnt Score Error Units
>>>>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt 5 41721,730 ±
>>>>> 2477,413 ops/ms
>>>>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt 5 35289,443 ±
>>>>> 1030,636 ops/ms
>>>>> IntrinsicsBenchmark.testHash2Ints thrpt 5 21830,140 ± 852,642
>>>>> ops/ms
>>>>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt 5 19806,343 ±
>>>>> 1027,029 ops/ms
>>>>> IntrinsicsBenchmark.testHash2Strings thrpt 5 18409,232 ±
>>>>> 676,311 ops/ms
>>>>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt 5 16842,651 ±
>>>>> 664,615 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5Strings thrpt 5 9183,019 ±
>>>>> 409,206 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt 5 8349,475
>>>>> ± 886,015 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt 5 9155,629
>>>>> ± 335,129 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt 5
>>>>> 8171,108 ± 385,375 ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt 5 9246,458
>>>>> ± 502,251 ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt 5
>>>>> 8419,192 ± 372,626 ops/ms
>>>>> IntrinsicsBenchmark.testHash100Ints thrpt 5 796,460 ±
>>>>> 27,123 ops/ms
>>>>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt 5 792,775 ±
>>>>> 23,402 ops/ms
>>>>> IntrinsicsBenchmark.testHash100Strings thrpt 5 69,753 ±
>>>>> 45,947 ops/ms
>>>>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt 5 619,774
>>>>> ± 14,503 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20Strings thrpt 5 1669,774 ±
>>>>> 53,991 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt 5
>>>>> 1644,342 ± 62,946 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt 5 1651,428
>>>>> ± 170,146 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt 5
>>>>> 1635,878 ± 62,007 ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt 5 1673,927
>>>>> ± 51,559 ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt 5
>>>>> 1641,524 ± 52,172 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25Strings thrpt 5 1286,407 ±
>>>>> 27,920 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt 5
>>>>> 1286,250 ± 26,125 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt 5 1290,251
>>>>> ± 25,217 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt 5
>>>>> 1285,060 ± 33,297 ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt 5 1288,374
>>>>> ± 40,727 ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt 5
>>>>> 1277,822 ± 20,674 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50Strings thrpt 5 185,692 ±
>>>>> 753,244 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt 5
>>>>> 622,333 ± 12,212 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt 5 266,784
>>>>> ± 8,620 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt
>>>>> 5 623,180 ± 15,583 ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt 5 391,497
>>>>> ± 33,738 ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt 5
>>>>> 621,560 ± 12,652 ops/ms
>>>>>
>>>>> ** old version of hash bootstraps **
>>>>>
>>>>> Benchmark Mode Cnt Score Error Units
>>>>> IntrinsicsBenchmark.testHash1IntLoop10 thrpt 5 42161,466 ±
>>>>> 975,935 ops/ms
>>>>> IntrinsicsBenchmark.testHash1StringLoop10 thrpt 5 35445,612 ±
>>>>> 1320,095 ops/ms
>>>>> IntrinsicsBenchmark.testHash2Ints thrpt 5 52157,722 ±
>>>>> 1089,439 ops/ms
>>>>> IntrinsicsBenchmark.testHash2IntsLoop10 thrpt 5 42223,291 ±
>>>>> 1107,430 ops/ms
>>>>> IntrinsicsBenchmark.testHash2Strings thrpt 5 46702,129 ±
>>>>> 1360,150 ops/ms
>>>>> IntrinsicsBenchmark.testHash2StringsLoop10 thrpt 5 35577,294 ±
>>>>> 939,756 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5Strings thrpt 5 20351,503 ±
>>>>> 495,039 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Int5StringsLoop10 thrpt 5
>>>>> 20087,351 ± 518,764 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5Strings thrpt 5 20884,773
>>>>> ± 648,380 ops/ms
>>>>> IntrinsicsBenchmark.testHash5Integers5StringsLoop10 thrpt 5
>>>>> 19990,317 ± 492,250 ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5Strings thrpt 5 20913,291
>>>>> ± 615,744 ops/ms
>>>>> IntrinsicsBenchmark.testHash5doubles5StringsLoop10 thrpt 5
>>>>> 20053,959 ± 538,675 ops/ms
>>>>> IntrinsicsBenchmark.testHash100Ints thrpt 5 6,625 ±
>>>>> 0,157 ops/ms
>>>>> IntrinsicsBenchmark.testHash100IntsLoop10 thrpt 5 6,747 ±
>>>>> 0,201 ops/ms
>>>>> IntrinsicsBenchmark.testHash100Strings thrpt 5 6,305 ±
>>>>> 1,031 ops/ms
>>>>> IntrinsicsBenchmark.testHash100StringsLoop10 thrpt 5 6,826
>>>>> ± 0,154 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20Strings thrpt 5 4188,399 ±
>>>>> 61,255 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Int20StringsLoop10 thrpt 5
>>>>> 4220,161 ± 55,642 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20Strings thrpt 5 4188,347
>>>>> ± 126,054 ops/ms
>>>>> IntrinsicsBenchmark.testHash20Integers20StringsLoop10 thrpt 5
>>>>> 4251,327 ± 86,473 ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20Strings thrpt 5 4206,733
>>>>> ± 66,459 ops/ms
>>>>> IntrinsicsBenchmark.testHash20doubles20StringsLoop10 thrpt 5
>>>>> 4227,479 ± 78,542 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25Strings thrpt 5 3162,896 ±
>>>>> 68,946 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Int25StringsLoop10 thrpt 5
>>>>> 3190,599 ± 71,035 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25Strings thrpt 5 3153,547
>>>>> ± 59,114 ops/ms
>>>>> IntrinsicsBenchmark.testHash25Integers25StringsLoop10 thrpt 5
>>>>> 3200,687 ± 56,650 ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25Strings thrpt 5 3166,884
>>>>> ± 46,123 ops/ms
>>>>> IntrinsicsBenchmark.testHash25doubles25StringsLoop10 thrpt 5
>>>>> 3202,177 ± 36,159 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50Strings thrpt 5 6,485 ±
>>>>> 0,054 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Int50StringsLoop10 thrpt 5
>>>>> 6,543 ± 0,265 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50Strings thrpt 5 6,376
>>>>> ± 0,150 ops/ms
>>>>> IntrinsicsBenchmark.testHash50Integers50StringsLoop10 thrpt
>>>>> 5 6,774 ± 0,159 ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50Strings thrpt 5 6,548
>>>>> ± 0,303 ops/ms
>>>>> IntrinsicsBenchmark.testHash50doubles50StringsLoop10 thrpt
>>>>> 5 6,687 ± 0,081 ops/ms
>>>>>
>>>>>> Am 05.03.2019 um 16:31 schrieb Vicente Romero
>>>>>> <vicente.romero at oracle.com>:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/5/19 9:02 AM, Hannes Wallnöfer wrote:
>>>>>>> Vicente,
>>>>>>>
>>>>>>> could it be that your your change in Object::hash bootstraps[1]
>>>>>>> mostly benefits invocations with very large numbers of
>>>>>>> parameters (like your original 100 parameter tests) but hurts
>>>>>>> performance with medium-to-lower number of parameters? I don’t
>>>>>>> have your latest benchmark sources, but I did some quick tests
>>>>>>> such as a testHash5Ints5Strings that suggest that may be the case.
>>>>>> that could be, but the intrinsified version is still faster for
>>>>>> those cases with small number of arguments. That's probably why I
>>>>>> have focused on the larger number of argument case but we can
>>>>>> change priorities or even have different callsites depending on
>>>>>> the number of arguments
>>>>>>
>>>>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>>>>>
>>>>>>> Hannes
>>>>>> Vicente
>>>>>>
>>>>>>>> Am 05.03.2019 um 03:52 schrieb Vicente Romero
>>>>>>>> <vicente.romero at oracle.com>:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/4/19 8:11 PM, Alex Buckley wrote:
>>>>>>>>> // Adopting a zero-decimal-places policy because precision to
>>>>>>>>> multiple decimal places is less important than accuracy and
>>>>>>>>> repeatability.
>>>>>>>>>
>>>>>>>>> On 3/4/2019 4:28 PM, Vicente Romero wrote:
>>>>>>>>>> I have uploaded another round of experiments for
>>>>>>>>>> Objects::hash, see [1].
>>>>>>>>>> The main variation I have included a variant of most of the
>>>>>>>>>> tests in
>>>>>>>>>> which instead of invoking Objects::hash 10 times
>>>>>>>>>> sequentially, the same
>>>>>>>>>> invocation occurs inside a loop which is executed 10 times.
>>>>>>>>>> This shows
>>>>>>>>>> that when the call site is reused, the execution time trumps
>>>>>>>>>> vanilla
>>>>>>>>>> JDK13 most of the time.
>>>>>>>>> That's not really the story though :-) Yes, the
>>>>>>>>> *Int*StringsLoop10 tests run faster with intrinsified
>>>>>>>>> invocation than with vanilla invocation, but generally, the
>>>>>>>>> *Int*StringsLoop10 tests enjoy less impressive speedups than
>>>>>>>>> the *Int*Strings tests. (Example: 25Int25Strings gets a 21x
>>>>>>>>> speedup, but 25Int25StringsLoop10 only gets a 2x speedup.)
>>>>>>>>>
>>>>>>>>> This is because the *Int*StringsLoop10 tests already run
>>>>>>>>> faster on vanilla JDK 13 than the *Int*Strings tests,
>>>>>>>>> presumably thanks to inlining ("the call site is reused").
>>>>>>>>>
>>>>>>>>> I guess that 1IntLoop10, 2IntsLoop10, and 2Ints2StringsLoop10
>>>>>>>>> would have such high throughput on vanilla JDK 13 that their
>>>>>>>>> speedups with intrinsification might be significantly <1.
>>>>>>>> not in all cases, see [1] the new information is highlighted in
>>>>>>>> yellow
>>>>>>>>> Alex
>>>>>>>> Vicente
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data_v4.html
>>
>
More information about the amber-dev
mailing list