[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Wed Sep 6 09:53:05 UTC 2017

On 05/09/17 18:34, Dmitrij Pochepko wrote:
> As you can see, it's up to 26% worse throughput with wider multiplication.
> 
> The reasons for this is:
> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it 
> can’t be changed within the function signature. Thus we can’t fully 
> utilize the potential of 64-bit multiplication.
> 2. umulh instruction is more expensive than mul instruction.

Ah, my apologies.  I wasn't thinking about mulAdd, but about
squareToLen().  But did you look at the way x86 uses 64-bit
multiplications?

> I haven't implemented wider multiplication for squareToLen intrinsic, 
> since it'll require much more code due to more corner cases. Also, 
> squaring algorithm in BigInteger doesn't handle more than 127 integers 
> in one squareToLen call(large integer arrays are divided to smaller 
> parts for squaring, so, 1..127 integers are squared at once), which 
> makes all additional off-loop penalties expensive in comparison to loop 
> execution time.

Should we intrinsify squareToLen() at all?  It's only used AFAICS by
C1 and interpreter when doing integer crypto.  One other thing I
haven't checked: is the multiplyToLen() intrinisc called when
squareToLen() is absent?

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671