[intrinsics] performance improvements for the intrinsified version of Objects::hash

Thu Feb 28 22:59:41 UTC 2019

Hi all,

I have done some additional experiments, please see the results at [1]. 
This document contains a summary of all the experiments done so far.  I 
re-ran the experiments, so some numbers can vary compared to previous 
data. The first two tables are comparing 3 different implementations:

 1. JDK13 vanilla,
 2. Intrinsics using loop combinators for producing the callsite for
    Objects::hash, I sent yesterday a patch that implements the loop
    combinators,
 3. and the current intrinsics tip at [2]

The last table has the results for String::format and as the loop 
combinators implementation doesn't affect that callsite, I didn't 
include any results in its corresponding column. This table is provided 
for completeness only.

The interesting thing is that both intrinsics implementations seem to 
defeat the other for different types:

  * the loop combinators implementation is faster hashing `int`
    variables, see row starting with
    `IntrinsicsBenchmark.testHash100IntVariables` in the first table
  * while the implementation at tip is faster when hashing strings, see
    the next row in the same table and second table which is completely
    dedicated to string hashing.

For both implementations there is a cliff somewhere between 60 and 70 
strings, see second table, but it is steeper for the loop combinators 
implementation. Anyways it could be that massaging the implementations 
one of them could prove to be significantly better than the other for 
all types. So I will keep them both for a while to experiment a bit more 
with them,

Thanks,
Vicente

[1] 
http://cr.openjdk.java.net/~vromero/intrinsics_benchmark_results/benchmarkResults_intrinsics_all_data.html
[2] http://hg.openjdk.java.net/amber/amber branch intrinsics-project

On 2/28/19 8:12 AM, Vicente Romero wrote:
>
>
> On 2/27/19 8:18 PM, Alex Buckley wrote:
>> Believing that the second column is intended to be 
>> "Intrinsics_02_26", not "Intrinsics_02_22":
>
> that's correct sorry for the mistake on the column naming
>
>>
>> The speedups for reference variables get worse with more arguments 
>> (though they may still be faster than vanilla invocation for a good 
>> while), and the speedups for primitive variables get better with more 
>> arguments.
>>
>> One metric is how many variables can be passed and still have 
>> intrinsification offer a speedup relative to vanilla invocation. (The 
>> cliff between 60 and 70.) Another metric is how many variables can be 
>> passed before the speedup stops growing, even if intrinsification is 
>> always faster than vanilla invocation. (The global maximum of 
>> performance, between 10 and 40.) Presumably, each metric is governed 
>> by a different factor.
>
> right good analysis, I will do some more research to try to see where 
> the execution time is going to
>
>>
>> Alex
>
> Vicente
>
>>
>> On 2/26/2019 8:28 PM, Vicente Romero wrote:
>>> Hi all,
>>>
>>> I have investigated further about the degradation of the intrinsified
>>> version Objects::hash for reference types. I have made performance
>>> measures for different number of arguments. Please see the results
>>> attached. At least on my PC it seems like there is a cliff from 60 
>>> to 70
>>> arguments. Up to 60 the intrinsified version is faster than vanilla
>>> JDK13 but at 70 and on the intrinsified version start being slower.
>>> Interesting, also if the current implementation starts being worst
>>> starting at 70 non-primitive arguments, that seems like a very good
>>> compromise.
>>>
>>> Thanks,
>>> Vicente
>>>
>>> On 2/26/19 8:49 PM, Vicente Romero wrote:
>>>> Hi all,
>>>>
>>>> I have just pushed [1] which improves the performance of the
>>>> intrinsified version of Objects::hash in almost all of our performance
>>>> test cases. This is a big improvement compared to the previous state
>>>> but there is still work to be done. Please find attached a file with
>>>> the benchmark results. It includes the performance numbers obtained
>>>> with the intrinsics repo as of 02/22 plus the ones obtained, almost
>>>> now :), after pushing [1]. As it can be seen there is a noticeable
>>>> improvement in the performance. In the last performance measurement we
>>>> found a noticeable degradation in performance for large number of
>>>> arguments (~100), even for primitive types. Patch [1] improves the
>>>> performance for both primitive and reference types with the difference
>>>> that now the performance is much better than vanilla JDK13 for
>>>> primitive types but it is still worst than vanilla for reference
>>>> types. Although we are in better shape now compared to the state as of
>>>> 02/22. Keep tuned :)
>>>>
>>>> Thanks,
>>>> Vicente
>>>>
>>>> [1] http://hg.openjdk.java.net/amber/amber/rev/0f40d5752eb9
>>>>
>>>> On 2/22/19 4:46 PM, Vicente Romero wrote:
>>>>> Hi,
>>>>>
>>>>> To complete the picture please find attached the performance results
>>>>> for Objects.hash for a number of experiments. In general they don't
>>>>> look as good as the ones for String::format. In general it seems like
>>>>> there is no much gain unless the number of parameters is large and
>>>>> all the parameters are constants. This is understandable because the
>>>>> compiler generates an LDC of the result. In all other cases the
>>>>> performance is just a bit better or a lot worst.
>>>>>
>>>>> Thanks,
>>>>> Vicente
>>>>>
>>>>> On 2/22/19 12:33 PM, Vicente Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have executed some performance tests on the intrinsics code to
>>>>>> compare the before and after. Please find the benchmark results and
>>>>>> the JMH based benchmark attached. This benchmark is based on a
>>>>>> previous one written by Hannes. The benchmark compares the execution
>>>>>> between the JDK built from [1], referred here as JDK13, and [2]
>>>>>> which is the amber repo, branch `intrinsics-project`.
>>>>>>
>>>>>> Some conclusions from the benchmark results:
>>>>>>
>>>>>>   * the intrinsified code is faster in all cases, for which
>>>>>>     intrinsified code is produced, compared to the legit (JDK13
>>>>>>     vanilla) code
>>>>>>   * there are wide variations though
>>>>>>
>>>>>> For example for the test: `testStringFormatBoxedArray` which is
>>>>>> basically benchmarking the performance of: `String.format("%s: %d ",
>>>>>> args);` where args is: `static final Object[] args = { "Bob", i23
>>>>>> };`, there is basically no visible gain as in this case the
>>>>>> intrinsification is bailing out and producing same code as vanilla
>>>>>> JDK13. This result is expected. The next test with not so much gain
>>>>>> is: `testStringFormat1ConstantFloat` which is testing:
>>>>>>
>>>>>>     `String.format("%g", 1.0)`
>>>>>>
>>>>>> the execution is ~2.5 times faster in the intrinsified version but
>>>>>> nothing compared to: `testStringFormat1ConstantStr` which is ~40
>>>>>> times faster. Another interesting conclusion is that the improvement
>>>>>> fades out with the number of parameters for some cases but keeps
>>>>>> constant for others. For example it is as fast to concatenate 1 or
>>>>>> 100 strings but formating one primitive int is ~45 times faster vs a
>>>>>> 3.5 improvement when formating a hundred.
>>>>>>
>>>>>> I have also attached the table I used to play with the numbers.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Vicente
>>>>>>
>>>>>> [1] http://hg.openjdk.java.net/jdk/jdk
>>>>>>
>>>>>> [2] http://hg.openjdk.java.net/amber/amber
>>>>>>
>>>>>
>>>>
>>>
>