[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
Dmitrij
dmitrij.pochepko at bell-sw.com
Wed Sep 6 17:39:13 UTC 2017
On 06.09.2017 15:43, Andrew Haley wrote:
> On 06/09/17 12:50, Dmitrij wrote:
>>
>> On 06.09.2017 12:53, Andrew Haley wrote:
>>> On 05/09/17 18:34, Dmitrij Pochepko wrote:
>>>> As you can see, it's up to 26% worse throughput with wider multiplication.
>>>>
>>>> The reasons for this is:
>>>> 1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it
>>>> can’t be changed within the function signature. Thus we can’t fully
>>>> utilize the potential of 64-bit multiplication.
>>>> 2. umulh instruction is more expensive than mul instruction.
>>> Ah, my apologies. I wasn't thinking about mulAdd, but about
>>> squareToLen(). But did you look at the way x86 uses 64-bit
>>> multiplications?
>>>
>> Yes. It uses single x86 mulq instruction which performs 64x64
>> multiplication and placing 128 bit result in 2 registers. There is no
>> such single instruction on aarch64 and the most effective aarch64
>> instruction sequence i've found doesn't seem to be as fast as mulq.
> I think there is effectively a 64x64 - >128-bit instruction: it's just
> that you have to represent it as a mul and a umulh. But I take your
> point.
>
>>> One other thing I
>>> haven't checked: is the multiplyToLen() intrinisc called when
>>> squareToLen() is absent?
>>>
>> It could have been a good alternative, but it's not used instead of
>> squareToLen when squareToLen is not implemented. A java implementation
>> of squareToLen will be eventually compiled and used instead:
>> http://hg.openjdk.java.net/jdk10/hs/jdk/file/tip/src/java.base/share/classes/java/math/BigInteger.java#l2039
> Please compare your squareToLen wih the
> MacroAssembler::multiply_to_len we already have.
>
I've compared it by calling square and multiply methods and got
following results(ThunderX):
Benchmark (size, ints) Mode
Cnt Score Error Units
BigIntegerBench.implMutliplyToLenReflect 1 avgt 5 186.930 ?
14.933 ns/op (26% slower)
BigIntegerBench.implMutliplyToLenReflect 2 avgt 5 194.095 ?
11.857 ns/op (12% slower)
BigIntegerBench.implMutliplyToLenReflect 3 avgt 5 233.912 ?
4.229 ns/op (24% slower)
BigIntegerBench.implMutliplyToLenReflect 5 avgt 5 308.349 ?
20.383 ns/op (22% slower)
BigIntegerBench.implMutliplyToLenReflect 10 avgt 5 475.839 ?
6.232 ns/op (same)
BigIntegerBench.implMutliplyToLenReflect 50 avgt 5 6514.691 ?
76.934 ns/op (same)
BigIntegerBench.implMutliplyToLenReflect 90 avgt 5 20347.040 ?
224.290 ns/op (3% slower)
BigIntegerBench.implMutliplyToLenReflect 127 avgt 5 41929.302 ?
181.053 ns/op (9% slower)
BigIntegerBench.implSquareToLenReflect 1 avgt 5 147.751 ?
12.760 ns/op
BigIntegerBench.implSquareToLenReflect 2 avgt 5 173.804 ?
4.850 ns/op
BigIntegerBench.implSquareToLenReflect 3 avgt 5 187.822 ?
34.027 ns/op
BigIntegerBench.implSquareToLenReflect 5 avgt 5 251.995 ?
19.711 ns/op
BigIntegerBench.implSquareToLenReflect 10 avgt 5 474.489 ?
1.040 ns/op
BigIntegerBench.implSquareToLenReflect 50 avgt 5 6493.768 ?
33.809 ns/op
BigIntegerBench.implSquareToLenReflect 90 avgt 5 19766.524 ?
88.398 ns/op
BigIntegerBench.implSquareToLenReflect 127 avgt 5 38448.202 ?
180.095 ns/op
As we can see, squareToLen is faster than multiplyToLen.
(I've updated benchmark code at
http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench.java)
Thanks,
Dmitrij
More information about the hotspot-compiler-dev
mailing list