RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]

Thu Jun 4 18:41:53 UTC 2015

On 04/06/15 19:32, Anthony Scarpino wrote:
> On 06/04/2015 10:08 AM, Andrew Haley wrote:
>> I'm sorry this is a rather long email, and I pray for your patience.
>>
>> I'm getting close to something I can put forward for review.  The
>> performance is encouraging.
>>
>> [ Some background: The kernel of RSA and Diffie-Hellman key exchange
>> is Montgomery multiplication.  Optimizing RSA basically comes down to
>> optimizing Montgomery multiplication.  The core of OpenJDK's RSA is
>> BigInteger.oddModPow().  ]
>>
>> My Montgomery multiply routine (mostly portable C, with a small
>> assembly-language insert) executes a 1024-bit multiply/reduce in about
>> 2000ns; the hand-coded assembly language equivalent in OpenSSL is
>> faster (as you'd expect) but not very much faster: about 1700ns.  In
>> other words, compiled C is only about 20% slower.
>>
>> Firstly, this is pretty remarkable performance by GCC (Yay!  Go GCC!)
>> and it shows I'm on the right track.  It also shows that there isn't a
>> huge amount to be gained by hand-coding Montgomery multiplication, but
>> anybody who fancies their hand can try to improve on GCC.  This is
>> also very nice because porting it to non-x86 hardware is fairly
>> straightforward -- certainly far easier than writing a large assembly-
>> language routine.
> 
> I'm sure when I see the code it will become clearer, but I'm guessing 
> you are taking a C version of Montgomery multiply, using GCC to turn it 
> into assembly with some cpu acceleration instructions and putting that 
> into an intrinsic?

I have written a Montgomery multiply routine: it is mostly C, with a
tiny bit (really, just a few instructions) of inline assembly
language.  It's called from a HotSpot intrinsic.

>> It would be nice to keep all of the data in an array of jlongs for the
>> duration of oddModPow().  Here's one idea: I could write a version of
>> oddModPow() which is guaranteed never to use the Java version of the
>> Montgomery multiply.  This would use a JNI method which calls the
>> native Montgomery multiply routine, guaranteeing that that we always
>> use that native routine, even from the interpreter and C1.  Then I
>> could keep all the internal state in native word order, and all this
>> word-swapping would just go away.  This would have the additional
>> benefit that it would be faster when using the interpreter and C1.
>> So, we'd have two versions of oddModPow() in BigInteger, and switch
>> between them depending on whether the platform had support for a
>> native Montgomery multiplication.
> 
> I'm assuming the JNI method is closer to OpenSSL numbers?

It's the same Montgomery multiplication as the intrinsic, but called
from a JNI method instead of a HotSpot intrinsic.

>> The downside of having two versions of oddModPow() would, of course,
>> be some code duplication.
>>
>> Or am I just making too much fuss about this?  Maybe I should be happy
>> with what I've got.
> 
> Looking at the 2k RSA sign numbers 2.3x better after your patch vs 
> hs-comp.  I'd be happy with that improvement.

OK.  I kinda guessed that would be the response, really.

Andrew.