[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
Dmitrij Pochepko
dmitrij.pochepko at bell-sw.com
Thu Aug 31 22:46:39 UTC 2017
On 31.08.2017 20:27, Andrew Haley wrote:
> On 31/08/17 14:39, Dmitrij Pochepko wrote:
>> please review a patch for "8186915 - AARCH64: Intrinsify squareToLen and
>> mulAdd" which adds respective intrinsics.
>>
>> webrev: http://cr.openjdk.java.net/~dpochepk/8186915/webrev.01/
>> CR: https://bugs.openjdk.java.net/browse/JDK-8186915
>>
>> With these intrinsics implemented I see 8% improvement in specjvm2008
>> crypto.rsa: 2333.13 ops/m vs 2520.11 ops/m.
> I don't see anything like that. I see an improvement of 1.6%, which is
> what I'd expect, given this profile:
>
> samples cum. samples % cum. % symbol name
> 31866443 31866443 59.7443 59.7443 montgomerySquare
> 6125600 37992043 11.4845 71.2287 montgomeryMultiply
> 4036511 42028554 7.5678 78.7965 java.math.MutableBigInteger java.math.MutableBigInteger.divideMagnitude(java.math.MutableBigInteger, java.math.MutableBigInteger, boolean)~1
> 2056787 44085341 3.8561 82.6527 java.math.BigInteger java.math.BigInteger.oddModPow(java.math.BigInteger, java.math.BigInteger)~2
> 1145996 45231337 2.1486 84.8012 Ljava/math/MutableBigInteger;divideMagnitude(Ljava/math/MutableBigInteger;Ljava/math/MutableBigInteger;Z)Ljava/math/MutableBigInteger;%32
> 1140132 46371469 2.1376 86.9388 int[] java.math.BigInteger.montReduce(int[], int[], int, int)~2
> 558960 46930429 1.0480 87.9867 java.security.Provider$Service java.security.Provider.getService(java.lang.String, java.lang.String)
>
> after your patch, I get:
>
> samples cum. samples % cum. % symbol name
> 32574982 32574982 60.3583 60.3583 montgomerySquare
> 6196936 38771918 11.4823 71.8407 montgomeryMultiply
> 5103970 43875888 9.4572 81.2978 java.math.MutableBigInteger java.math.MutableBigInteger.divideMagnitude(java.math.MutableBigInteger, java.math.MutableBigInteger, boolean)
> 1991144 45867032 3.6894 84.9872 java.math.BigInteger java.math.BigInteger.oddModPow(java.math.BigInteger, java.math.BigInteger)~1
> 792336 46659368 1.4681 86.4554 mulAdd
> 586130 47245498 1.0860 87.5414 java.security.Provider$Service java.security.Provider.getService(java.lang.String, java.lang.String)
>
> So we're seeing a boost to the performance of BigInteger.montReduce,
> which is dominated by mulAdd, which makes sense, but it's not a very
> large part of the total.
>
> Your mul_add routine is less efficient than it should be. It uses
> 32-bit multiply operations when it could use 64-bit ones, just as the
> multiply_to_len does. Your square_to_len routine has the same
> problem.
>
> There is an x86 example of how square_to_len should be done.
>
Hi,
I tried a number of initial versions first. I also tried to use wider
multiplication via umulh (and larger load instructions like ldp/ldr),
but after measuring all versions I've found that version I've initially
sent appeared to be the fastest (I was measuring it on ThunderX which I
have in hand). It might be because of lots of additional ror(..., 32)
operations in other versions to convert values from initial layout to
register and back. Another reason might be more complex overall logic
and larger code, which triggers more icache lines to be loaded. Or even
some umulh specifics on some CPUs. So, after measuring, I've abandoned
these versions in a middle of development and polished the fastest one.
I have some raw development unpolished versions of such approaches
left(not sure I have debugged versions saved, but at least has an
overall idea).
I attached squares_v2.3.1.diff: early version which is using mul/umulh
for just one case. It was surprisingly slower for this case than version
I've sent to review, so, I've abandoned this approach.
I've also tried version with large load instructions(ldp/ldr):
squares_v1.diff and it was also slower(it has another, slower, mul_add
loop implementation, but I was comparing to the same version, which is
using ldrw-only).
I'm not sure if I should use 64-bit multiplications and/or 64/128 bit
loads. I can try to return back to one of such versions and try to
polish it, but I'll probably get slower results again on h/w I have and
it's not clear if it'll be faster on any other h/w(which one? It takes a
lot of time to iteratively improve and measure every version on
respective h/w).
Thanks,
Dmitrij
-------------- next part --------------
A non-text attachment was scrubbed...
Name: squares_v1.diff
Type: text/x-patch
Size: 15640 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20170901/717f77cf/squares_v1-0001.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: squares_v2.3.1.diff
Type: text/x-patch
Size: 8339 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20170901/717f77cf/squares_v2.3.1-0001.diff>
More information about the hotspot-compiler-dev
mailing list