[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Wed Sep 6 12:43:23 UTC 2017

On 06/09/17 12:50, Dmitrij wrote:
> 
> 
> On 06.09.2017 12:53, Andrew Haley wrote:
>> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>>
>>> The reasons for this is:
>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>>> can’t be changed within the function signature. Thus we can’t fully
>>> utilize the potential of 64-bit multiplication.
>>> 2. umulh instruction is more expensive than mul instruction.
>> Ah, my apologies.  I wasn't thinking about mulAdd, but about
>> squareToLen().  But did you look at the way x86 uses 64-bit
>> multiplications?
>>
> Yes. It uses single x86 mulq instruction which performs 64x64 
> multiplication and placing 128 bit result in 2 registers. There is no 
> such single instruction on aarch64 and the most effective aarch64 
> instruction sequence i've found doesn't seem to be as fast as mulq. 

I think there is effectively a 64x64 - >128-bit instruction: it's just
that you have to represent it as a mul and a umulh.  But I take your
point.

>>    One other thing I
>> haven't checked: is the multiplyToLen() intrinisc called when
>> squareToLen() is absent?
>>
> It could have been a good alternative, but it's not used instead of 
> squareToLen when squareToLen is not implemented. A java implementation 
> of squareToLen will be eventually compiled and used instead: 
> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039

Please compare your squareToLen wih the
MacroAssembler::multiply_to_len we already have.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671