[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Thu Aug 31 22:46:39 UTC 2017

On 31.08.2017 20:27, Andrew Haley wrote:
> On 31/08/17 14:39, Dmitrij Pochepko wrote:
>> please review a patch for "8186915 - AARCH64: Intrinsify squareToLen and
>> mulAdd" which adds respective intrinsics.
>>
>> webrev: http://cr.openjdk.java.net/~dpochepk/8186915/webrev.01/
>> CR: https://bugs.openjdk.java.net/browse/JDK-8186915
>>
>> With these intrinsics implemented I see 8% improvement in specjvm2008
>> crypto.rsa: 2333.13 ops/m vs 2520.11 ops/m.
> I don't see anything like that.  I see an improvement of 1.6%, which is
> what I'd expect, given this profile:
>
> samples  cum. samples  %        cum. %  symbol name
> 31866443 31866443      59.7443  59.7443 montgomerySquare
> 6125600  37992043      11.4845  71.2287 montgomeryMultiply
> 4036511  42028554       7.5678  78.7965 java.math.MutableBigInteger java.math.MutableBigInteger.divideMagnitude(java.math.MutableBigInteger, java.math.MutableBigInteger, boolean)~1
> 2056787  44085341       3.8561  82.6527 java.math.BigInteger java.math.BigInteger.oddModPow(java.math.BigInteger, java.math.BigInteger)~2
> 1145996  45231337       2.1486  84.8012 Ljava/math/MutableBigInteger;divideMagnitude(Ljava/math/MutableBigInteger;Ljava/math/MutableBigInteger;Z)Ljava/math/MutableBigInteger;%32
> 1140132  46371469       2.1376  86.9388 int[] java.math.BigInteger.montReduce(int[], int[], int, int)~2
> 558960   46930429       1.0480  87.9867 java.security.Provider$Service java.security.Provider.getService(java.lang.String, java.lang.String)
>
> after your patch, I get:
>
> samples  cum. samples  %        cum. %  symbol name
> 32574982 32574982      60.3583  60.3583 montgomerySquare
> 6196936  38771918      11.4823  71.8407 montgomeryMultiply
> 5103970  43875888       9.4572  81.2978 java.math.MutableBigInteger java.math.MutableBigInteger.divideMagnitude(java.math.MutableBigInteger, java.math.MutableBigInteger, boolean)
> 1991144  45867032       3.6894  84.9872 java.math.BigInteger java.math.BigInteger.oddModPow(java.math.BigInteger, java.math.BigInteger)~1
> 792336   46659368       1.4681  86.4554 mulAdd
> 586130   47245498       1.0860  87.5414 java.security.Provider$Service java.security.Provider.getService(java.lang.String, java.lang.String)
>
> So we're seeing a boost to the performance of BigInteger.montReduce,
> which is dominated by mulAdd, which makes sense, but it's not a very
> large part of the total.
>
> Your mul_add routine is less efficient than it should be.  It uses
> 32-bit multiply operations when it could use 64-bit ones, just as the
> multiply_to_len does.  Your square_to_len routine has the same
> problem.
>
> There is an x86 example of how square_to_len should be done.
>
Hi,

I tried a number of initial versions first. I also tried to use wider 
multiplication via umulh (and larger load instructions like ldp/ldr), 
but after measuring all versions I've found that version I've initially 
sent appeared to be the fastest (I was measuring it on ThunderX which I 
have in hand). It might be because of lots of additional ror(..., 32) 
operations in other versions to convert values from initial layout to 
register and back. Another reason might be more complex overall logic 
and larger code, which triggers more icache lines to be loaded. Or even 
some umulh specifics on some CPUs. So, after measuring, I've abandoned 
these versions in a middle of development and polished the fastest one.
I have some raw development unpolished versions of such approaches 
left(not sure I have debugged versions saved, but at least has an 
overall idea).
I attached squares_v2.3.1.diff: early version which is using mul/umulh 
for just one case. It was surprisingly slower for this case than version 
I've sent to review, so, I've abandoned this approach.
I've also tried version with large load instructions(ldp/ldr): 
squares_v1.diff and it was also slower(it has another, slower, mul_add 
loop implementation, but I was comparing to the same version, which is 
using ldrw-only).

I'm not sure if I should use 64-bit multiplications and/or 64/128 bit 
loads. I can try to return back to one of such versions and try to 
polish it, but I'll probably get slower results again on h/w I have and 
it's not clear if it'll be faster on any other h/w(which one? It takes a 
lot of time to iteratively improve and measure every version on 
respective h/w).

Thanks,
Dmitrij
-------------- next part --------------
A non-text attachment was scrubbed...
Name: squares_v1.diff
Type: text/x-patch
Size: 15640 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20170901/717f77cf/squares_v1-0001.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: squares_v2.3.1.diff
Type: text/x-patch
Size: 8339 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20170901/717f77cf/squares_v2.3.1-0001.diff>