[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Tue Sep 5 17:34:11 UTC 2017

Hi Andrew,

you can find my attempt to implement mulAdd intrinsic using wider 
multiplication here 
http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/ but as expected 
I got slower results on same benchmark compared to original webrev.01 
with 32-bit multiplication.

I've measured results on ThunderX:

webrev.01 version:

Benchmark                      (size)  Mode  Cnt    Score    Error Units
BigIntegerBench.mulAddReflect       1  avgt    5  194.809 ?  1.341 ns/op
BigIntegerBench.mulAddReflect       2  avgt    5  198.242 ?  1.348 ns/op
BigIntegerBench.mulAddReflect       3  avgt    5  201.213 ?  0.670 ns/op
BigIntegerBench.mulAddReflect       5  avgt    5  213.426 ?  7.441 ns/op
BigIntegerBench.mulAddReflect      10  avgt    5  236.396 ?  1.663 ns/op
BigIntegerBench.mulAddReflect      50  avgt    5  432.255 ? 24.718 ns/op
BigIntegerBench.mulAddReflect     100  avgt    5  653.961 ? 10.140 ns/op

webrev.02 version with wider multiplication:

Benchmark                      (size)  Mode  Cnt    Score     Error Units
BigIntegerBench.mulAddReflect       1  avgt    5  196.109 ?   0.663 ns/op
BigIntegerBench.mulAddReflect       2  avgt    5  213.438 ? 124.206 ns/op
BigIntegerBench.mulAddReflect       3  avgt    5  211.683 ?   3.206 ns/op
BigIntegerBench.mulAddReflect       5  avgt    5  217.324 ?   5.827 ns/op
BigIntegerBench.mulAddReflect      10  avgt    5  233.272 ?  21.560 ns/op
BigIntegerBench.mulAddReflect      50  avgt    5  455.337 ? 237.168 ns/op
BigIntegerBench.mulAddReflect     100  avgt    5  826.844 ?   4.972 ns/op

As you can see, it's up to 26% worse throughput with wider multiplication.

The reasons for this is:
1. mulAdd uses 32-bit multiplier (unlike multiplyToLen intrinsic) and it 
can’t be changed within the function signature. Thus we can’t fully 
utilize the potential of 64-bit multiplication.
2. umulh instruction is more expensive than mul instruction.

I haven't implemented wider multiplication for squareToLen intrinsic, 
since it'll require much more code due to more corner cases. Also, 
squaring algorithm in BigInteger doesn't handle more than 127 integers 
in one squareToLen call(large integer arrays are divided to smaller 
parts for squaring, so, 1..127 integers are squared at once), which 
makes all additional off-loop penalties expensive in comparison to loop 
execution time.

At this point I ran out of ideas how we could improve the performance 3x 
for these intrinsics. I understand one can do better with 64bit for the 
intrinsics you implemented, but squareToLen and mulAdd looks different. 
Do you have other suggestions, or can we proceed with initial version 
(webrev.01)?

Thanks,

Dmitrij

On 01.09.2017 10:51, Andrew Haley wrote:
> On 31/08/17 23:46, Dmitrij Pochepko wrote:
>> I tried a number of initial versions first. I also tried to use wider
>> multiplication via umulh (and larger load instructions like ldp/ldr),
>> but after measuring all versions I've found that version I've initially
>> sent appeared to be the fastest (I was measuring it on ThunderX which I
>> have in hand). It might be because of lots of additional ror(..., 32)
>> operations in other versions to convert values from initial layout to
>> register and back. Another reason might be more complex overall logic
>> and larger code, which triggers more icache lines to be loaded. Or even
>> some umulh specifics on some CPUs. So, after measuring, I've abandoned
>> these versions in a middle of development and polished the fastest one.
>> I have some raw development unpolished versions of such approaches
>> left(not sure I have debugged versions saved, but at least has an
>> overall idea).
>> I attached squares_v2.3.1.diff: early version which is using mul/umulh
>> for just one case. It was surprisingly slower for this case than version
>> I've sent to review, so, I've abandoned this approach.
>> I've also tried version with large load instructions(ldp/ldr):
>> squares_v1.diff and it was also slower(it has another, slower, mul_add
>> loop implementation, but I was comparing to the same version, which is
>> using ldrw-only).
>>
>> I'm not sure if I should use 64-bit multiplications and/or 64/128 bit
>> loads. I can try to return back to one of such versions and try to
>> polish it, but I'll probably get slower results again on h/w I have and
>> it's not clear if it'll be faster on any other h/w(which one? It takes a
>> lot of time to iteratively improve and measure every version on
>> respective h/w).
> I'm using Applied Micro hardware for my testing at the moment.
>
> I did the speed testing for Montgomery multiply on ThunderX.  I
> appreciate that it's difficult to get the 64-bit version right and
> fast, but you should see about 3 - 3.5* speedup over the pure Java
> version if you get it right.  That's what I saw when I did the
> Montgomery multiply.  You do have to pipeline the loads and the
> multiplies to avoid stalls.
>
> Be aware that squareToLen is not used at all when running the
> RSA benchmark with C2.
>