[intrinsics] performance improvements for the intrinsified version of Objects::hash

Tue Mar 5 00:28:22 UTC 2019

Hi,

I have uploaded another round of experiments for Objects::hash, see [1]. 
The main variation I have included a variant of most of the tests in 
which instead of invoking Objects::hash 10 times sequentially, the same 
invocation occurs inside a loop which is executed 10 times. This shows 
that when the call site is reused, the execution time trumps vanilla 
JDK13 most of the time. This adds another aspect to the intrinsification 
of Objects::hash but there is still an open question: should the 
intrinsics project include Objects::hash or should we go only with 
String::format?

Please share your feedback on this,
Vicente

[1] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data_v3.html

On 2/28/19 6:08 PM, Brian Goetz wrote:
> So, the realistic cases for Objects::hash are where there are a handful (3-5) of mixed strings and ints.  The data we have omits these cases — could we run a few of them?
>
> Barring some better numbers on these, it seems the sweet spot is to do String::format but not bother with hash?
>
>> On Feb 28, 2019, at 5:59 PM, Vicente Romero <vicente.romero at oracle.com> wrote:
>>
>> Hi all,
>>
>> I have done some additional experiments, please see the results at [1]. This document contains a summary of all the experiments done so far.  I re-ran the experiments, so some numbers can vary compared to previous data. The first two tables are comparing 3 different implementations:
>>
>> 1. JDK13 vanilla,
>> 2. Intrinsics using loop combinators for producing the callsite for
>>    Objects::hash, I sent yesterday a patch that implements the loop
>>    combinators,
>> 3. and the current intrinsics tip at [2]
>>
>> The last table has the results for String::format and as the loop combinators implementation doesn't affect that callsite, I didn't include any results in its corresponding column. This table is provided for completeness only.
>>
>> The interesting thing is that both intrinsics implementations seem to defeat the other for different types:
>>
>> * the loop combinators implementation is faster hashing `int`
>>    variables, see row starting with
>>    `IntrinsicsBenchmark.testHash100IntVariables` in the first table
>> * while the implementation at tip is faster when hashing strings, see
>>    the next row in the same table and second table which is completely
>>    dedicated to string hashing.
>>
>> For both implementations there is a cliff somewhere between 60 and 70 strings, see second table, but it is steeper for the loop combinators implementation. Anyways it could be that massaging the implementations one of them could prove to be significantly better than the other for all types. So I will keep them both for a while to experiment a bit more with them,
>>
>> Thanks,
>> Vicente
>>
>> [1] http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data.html
>> [2] http://hg.openjdk.java.net/amber/amber branch intrinsics-project
>>
>> On 2/28/19 8:12 AM, Vicente Romero wrote:
>>>
>>> On 2/27/19 8:18 PM, Alex Buckley wrote:
>>>> Believing that the second column is intended to be "Intrinsics_02_26", not "Intrinsics_02_22":
>>> that's correct sorry for the mistake on the column naming
>>>
>>>> The speedups for reference variables get worse with more arguments (though they may still be faster than vanilla invocation for a good while), and the speedups for primitive variables get better with more arguments.
>>>>
>>>> One metric is how many variables can be passed and still have intrinsification offer a speedup relative to vanilla invocation. (The cliff between 60 and 70.) Another metric is how many variables can be passed before the speedup stops growing, even if intrinsification is always faster than vanilla invocation. (The global maximum of performance, between 10 and 40.) Presumably, each metric is governed by a different factor.
>>> right good analysis, I will do some more research to try to see where the execution time is going to
>>>
>>>> Alex
>>> Vicente
>>>
>>>> On 2/26/2019 8:28 PM, Vicente Romero wrote:
>>>>> Hi all,
>>>>>
>>>>> I have investigated further about the degradation of the intrinsified
>>>>> version Objects::hash for reference types. I have made performance
>>>>> measures for different number of arguments. Please see the results
>>>>> attached. At least on my PC it seems like there is a cliff from 60 to 70
>>>>> arguments. Up to 60 the intrinsified version is faster than vanilla
>>>>> JDK13 but at 70 and on the intrinsified version start being slower.
>>>>> Interesting, also if the current implementation starts being worst
>>>>> starting at 70 non-primitive arguments, that seems like a very good
>>>>> compromise.
>>>>>
>>>>> Thanks,
>>>>> Vicente
>>>>>
>>>>> On 2/26/19 8:49 PM, Vicente Romero wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I have just pushed [1] which improves the performance of the
>>>>>> intrinsified version of Objects::hash in almost all of our performance
>>>>>> test cases. This is a big improvement compared to the previous state
>>>>>> but there is still work to be done. Please find attached a file with
>>>>>> the benchmark results. It includes the performance numbers obtained
>>>>>> with the intrinsics repo as of 02/22 plus the ones obtained, almost
>>>>>> now :), after pushing [1]. As it can be seen there is a noticeable
>>>>>> improvement in the performance. In the last performance measurement we
>>>>>> found a noticeable degradation in performance for large number of
>>>>>> arguments (~100), even for primitive types. Patch [1] improves the
>>>>>> performance for both primitive and reference types with the difference
>>>>>> that now the performance is much better than vanilla JDK13 for
>>>>>> primitive types but it is still worst than vanilla for reference
>>>>>> types. Although we are in better shape now compared to the state as of
>>>>>> 02/22. Keep tuned :)
>>>>>>
>>>>>> Thanks,
>>>>>> Vicente
>>>>>>
>>>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>>>>
>>>>>> On 2/22/19 4:46 PM, Vicente Romero wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> To complete the picture please find attached the performance results
>>>>>>> for Objects.hash for a number of experiments. In general they don't
>>>>>>> look as good as the ones for String::format. In general it seems like
>>>>>>> there is no much gain unless the number of parameters is large and
>>>>>>> all the parameters are constants. This is understandable because the
>>>>>>> compiler generates an LDC of the result. In all other cases the
>>>>>>> performance is just a bit better or a lot worst.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vicente
>>>>>>>
>>>>>>> On 2/22/19 12:33 PM, Vicente Romero wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have executed some performance tests on the intrinsics code to
>>>>>>>> compare the before and after. Please find the benchmark results and
>>>>>>>> the JMH based benchmark attached. This benchmark is based on a
>>>>>>>> previous one written by Hannes. The benchmark compares the execution
>>>>>>>> between the JDK built from [1], referred here as JDK13, and [2]
>>>>>>>> which is the amber repo, branch `intrinsics-project`.
>>>>>>>>
>>>>>>>> Some conclusions from the benchmark results:
>>>>>>>>
>>>>>>>>    * the intrinsified code is faster in all cases, for which
>>>>>>>>      intrinsified code is produced, compared to the legit (JDK13
>>>>>>>>      vanilla) code
>>>>>>>>    * there are wide variations though
>>>>>>>>
>>>>>>>> For example for the test: `testStringFormatBoxedArray` which is
>>>>>>>> basically benchmarking the performance of: `String.format("%s: %d ",
>>>>>>>> args);` where args is: `static final Object[] args = { "Bob", i23
>>>>>>>> };`, there is basically no visible gain as in this case the
>>>>>>>> intrinsification is bailing out and producing same code as vanilla
>>>>>>>> JDK13. This result is expected. The next test with not so much gain
>>>>>>>> is: `testStringFormat1ConstantFloat` which is testing:
>>>>>>>>
>>>>>>>>      `String.format("%g", 1.0)`
>>>>>>>>
>>>>>>>> the execution is ~2.5 times faster in the intrinsified version but
>>>>>>>> nothing compared to: `testStringFormat1ConstantStr` which is ~40
>>>>>>>> times faster. Another interesting conclusion is that the improvement
>>>>>>>> fades out with the number of parameters for some cases but keeps
>>>>>>>> constant for others. For example it is as fast to concatenate 1 or
>>>>>>>> 100 strings but formating one primitive int is ~45 times faster vs a
>>>>>>>> 3.5 improvement when formating a hundred.
>>>>>>>>
>>>>>>>> I have also attached the table I used to play with the numbers.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Vicente
>>>>>>>>
>>>>>>>> [1] http://hg.openjdk.java.net/jdk/jdk
>>>>>>>>
>>>>>>>> [2] http://hg.openjdk.java.net/amber/amber
>>>>>>>>