[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Mon Sep 25 15:46:43 UTC 2017

Hi,

please take a look at v2. I've modified code to use multiplyToLen in 
squareToLen. Additional benefit: no more code in common part. I've left 
mulAdd unchanged.

http://cr.openjdk.java.net/~dpochepk/8186915/webrev.02/

I've also rerun benchmark on ThunderX and got these results: 
http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt

Thanks,
Dmitrij

On 22.09.2017 11:12, Andrew Haley wrote:
> On 21/09/17 19:19, Dmitrij Pochepko wrote:
>
>> thank you for looking into this and trying on APM(I have no access to
>> this h/w).
>>
>>
>> I've used modified benchmark you've sent and run it on ThunderX and
>> implSquareToLen still shows better results than implMultiplyToLen in
>> most cases on ThunderX (up to 10% on size=127. results:
>> http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt).
> For 10%, it's not worth doing, given the risks and that it's not used
> by crypto operations when C2-compiled.
>
>> However, since performance difference for APM is more than on
>> ThunderX, I think it'll be more logical to return back to your idea
>> and call multiplyToLen intrinsic inside squareToLen. Alternative
>> solution is to generate different code for APM and ThunderX, but I
>> prefer to have single version in case of such relatively small
>> difference in performance and it's still much faster than without
>> intrinsic at all.  What do you think?
> Yes.  Calling multiplyToLen would be fine.
>
>> fyi: regarding size 200 and 1000 - it's incorrect to measure these
>> sizes for squareToLen, because squareToLen is never called for size
>> more than 127(I've mentioned it before).
> It's not incorrect: it's a test for asymptotic behaviour.