RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]

Thu Jun 4 18:32:32 UTC 2015

On 06/04/2015 10:08 AM, Andrew Haley wrote:
> I'm sorry this is a rather long email, and I pray for your patience.
>
> I'm getting close to something I can put forward for review.  The
> performance is encouraging.
>
> [ Some background: The kernel of RSA and Diffie-Hellman key exchange
> is Montgomery multiplication.  Optimizing RSA basically comes down to
> optimizing Montgomery multiplication.  The core of OpenJDK's RSA is
> BigInteger.oddModPow().  ]
>
> My Montgomery multiply routine (mostly portable C, with a small
> assembly-language insert) executes a 1024-bit multiply/reduce in about
> 2000ns; the hand-coded assembly language equivalent in OpenSSL is
> faster (as you'd expect) but not very much faster: about 1700ns.  In
> other words, compiled C is only about 20% slower.
>
> Firstly, this is pretty remarkable performance by GCC (Yay!  Go GCC!)
> and it shows I'm on the right track.  It also shows that there isn't a
> huge amount to be gained by hand-coding Montgomery multiplication, but
> anybody who fancies their hand can try to improve on GCC.  This is
> also very nice because porting it to non-x86 hardware is fairly
> straightforward -- certainly far easier than writing a large assembly-
> language routine.

I'm sure when I see the code it will become clearer, but I'm guessing 
you are taking a C version of Montgomery multiply, using GCC to turn it 
into assembly with some cpu acceleration instructions and putting that 
into an intrinsic?

>
> Here are some numbers for comparison.
>
> Today's hs-comp:
>
>                    sign    verify    sign/s verify/s
> rsa  512 bits 0.000133s 0.000009s   7508.5 112819.1
> rsa 1024 bits 0.000667s 0.000028s   1498.6  35497.2
> rsa 2048 bits 0.003867s 0.000097s    258.6  10342.7
> rsa 4096 bits 0.026383s 0.000357s     37.9   2799.8
>
> After my patch:
>
>                    sign    verify    sign/s verify/s
> rsa  512 bits 0.000071s 0.000005s  14127.3 204112.4
> rsa 1024 bits 0.000292s 0.000013s   3424.5  74204.1
> rsa 2048 bits 0.001628s 0.000045s    614.4  22399.7
> rsa 4096 bits 0.010966s 0.000163s     91.2   6117.8
>
> So, it's about twice as fast we have at present.
>
> [ Note that this comparison includes the latest "8081778: Use Intel
> x64 CPU instructions for RSA acceleration" patch. ]
>
> However, even after my patch OpenJDK is still somewhat slower than
> OpenSSL, which is:
>
>                    sign    verify    sign/s verify/s
> rsa  512 bits 0.000048s 0.000004s  20687.1 257399.4
> rsa 1024 bits 0.000189s 0.000011s   5288.3  91711.5
> rsa 2048 bits 0.001174s 0.000037s    851.7  27367.2
> rsa 4096 bits 0.008475s 0.000137s    118.0   7305.4
>
> [ I am assuming that OpenSSL represents something like the "speed of
> light" for RSA on x86: this is carefully hand-coded assembly language
> and C, hand tuned.  Getting anywhere near OpenSSL is a major win. ]
>
> Here's my problem:
>
> Some of this slowdown is due to the overhead of using the JCE, but not
> very much.  Quite a lot of it, however, is due to the fact that the
> scratch memory used in oddModPow() is a big-endian array of jints.  I
> have to convert the big-endian jints into native jlongs to do the
> multiply on little-endian x86-64.

Given you have taken the effort to see the overheads caused by JCE, I'd 
be interested in see that data to see if there is anything we can do 
about it.

> If I do the word reversal during the multiply (i.e. keep all data in
> memory in little-endian form, swap words when reading and writing to
> memory) the performance is horrible: a 1024-bit multiply takes 3000ns,
> 50% longer.  This perhaps isn't very surprising: if you do the
> word-reversal before the multiplication you have O(N) swaps, if you do
> it during the multiplication you have O(N^2).
>
> I have found that the best thing to do is to word reverse all the data
> in memory into temporary little-endian arrays and do the work on them.
> It's much faster, but still really is very annoying: for 1024-bit RSA
> the word reversal is 14% of the total runtime.
>
> It would be nice to keep all of the data in an array of jlongs for the
> duration of oddModPow().  Here's one idea: I could write a version of
> oddModPow() which is guaranteed never to use the Java version of the
> Montgomery multiply.  This would use a JNI method which calls the
> native Montgomery multiply routine, guaranteeing that that we always
> use that native routine, even from the interpreter and C1.  Then I
> could keep all the internal state in native word order, and all this
> word-swapping would just go away.  This would have the additional
> benefit that it would be faster when using the interpreter and C1.
> So, we'd have two versions of oddModPow() in BigInteger, and switch
> between them depending on whether the platform had support for a
> native Montgomery multiplication.

I'm assuming the JNI method is closer to OpenSSL numbers?

>
> The downside of having two versions of oddModPow() would, of course,
> be some code duplication.
>
> Or am I just making too much fuss about this?  Maybe I should be happy
> with what I've got.

Looking at the 2k RSA sign numbers 2.3x better after your patch vs 
hs-comp.  I'd be happy with that improvement.

The JNI version maybe be better as a separate JCE provider you can make 
available for situations where appropriate.  Unless there turns out to 
be something very compelling I wouldn't think we'd integrate it as part 
security providers given our focus is to use intrinsics rather than JNI 
at this time.

>
> Thank you for reading this far,
>
> Andrew.
>