[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Wed Sep 6 17:39:13 UTC 2017

On 06.09.2017 15:43, Andrew Haley wrote:
> On 06/09/17 12:50, Dmitrij wrote:
>>
>> On 06.09.2017 12:53, Andrew Haley wrote:
>>> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>>>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>>>
>>>> The reasons for this is:
>>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>>>> can’t be changed within the function signature. Thus we can’t fully
>>>> utilize the potential of 64-bit multiplication.
>>>> 2. umulh instruction is more expensive than mul instruction.
>>> Ah, my apologies.  I wasn't thinking about mulAdd, but about
>>> squareToLen().  But did you look at the way x86 uses 64-bit
>>> multiplications?
>>>
>> Yes. It uses single x86 mulq instruction which performs 64x64
>> multiplication and placing 128 bit result in 2 registers. There is no
>> such single instruction on aarch64 and the most effective aarch64
>> instruction sequence i've found doesn't seem to be as fast as mulq.
> I think there is effectively a 64x64 - >128-bit instruction: it's just
> that you have to represent it as a mul and a umulh.  But I take your
> point.
>
>>>     One other thing I
>>> haven't checked: is the multiplyToLen() intrinisc called when
>>> squareToLen() is absent?
>>>
>> It could have been a good alternative, but it's not used instead of
>> squareToLen when squareToLen is not implemented. A java implementation
>> of squareToLen will be eventually compiled and used instead:
>> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039
> Please compare your squareToLen wih the
> MacroAssembler::multiply_to_len we already have.
>
I've compared it by calling square and multiply methods and got 
following results(ThunderX):

Benchmark                                        (size, ints)  Mode 
Cnt      Score     Error  Units
BigIntegerBench.implMutliplyToLenReflect       1  avgt    5 186.930 ?  
14.933  ns/op  (26% slower)
BigIntegerBench.implMutliplyToLenReflect       2  avgt    5 194.095 ?  
11.857  ns/op  (12% slower)
BigIntegerBench.implMutliplyToLenReflect       3  avgt    5 233.912 ?   
4.229  ns/op   (24% slower)
BigIntegerBench.implMutliplyToLenReflect       5  avgt    5 308.349 ?  
20.383  ns/op  (22% slower)
BigIntegerBench.implMutliplyToLenReflect      10  avgt    5 475.839 ?   
6.232  ns/op  (same)
BigIntegerBench.implMutliplyToLenReflect      50  avgt    5 6514.691 ?  
76.934  ns/op (same)
BigIntegerBench.implMutliplyToLenReflect      90  avgt    5 20347.040 ? 
224.290  ns/op (3% slower)
BigIntegerBench.implMutliplyToLenReflect     127  avgt    5 41929.302 ? 
181.053  ns/op (9% slower)

BigIntegerBench.implSquareToLenReflect         1  avgt    5 147.751 ?  
12.760  ns/op
BigIntegerBench.implSquareToLenReflect         2  avgt    5 173.804 ?   
4.850  ns/op
BigIntegerBench.implSquareToLenReflect         3  avgt    5 187.822 ?  
34.027  ns/op
BigIntegerBench.implSquareToLenReflect         5  avgt    5 251.995 ?  
19.711  ns/op
BigIntegerBench.implSquareToLenReflect        10  avgt    5 474.489 ?   
1.040  ns/op
BigIntegerBench.implSquareToLenReflect        50  avgt    5 6493.768 ?  
33.809  ns/op
BigIntegerBench.implSquareToLenReflect        90  avgt    5 19766.524 ?  
88.398  ns/op
BigIntegerBench.implSquareToLenReflect       127  avgt    5 38448.202 ? 
180.095  ns/op

As we can see, squareToLen is faster than multiplyToLen.

(I've updated benchmark code at 
http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java)

Thanks,
Dmitrij