RFR: 8046943: RSA Acceleration

Mon Jun 15 16:50:35 UTC 2015

[Resent:  I dropped the security list by mistake]

On 06/12/2015 04:06 AM, Andrew Haley wrote:
> http://cr.openjdk.java.net/~aph/8046943-hs-1/
> http://cr.openjdk.java.net/~aph/8046943-jdk-1/

Please don't use the JEP 246 in the comments when you push. There are a 
number of changesets related to 246 and I'd rather not have one 
associated with it.  We can link the a separate bug id to the JEP.

>
> Before:
>
> rsa  512 bits 0.000134s 0.000009s   7444.2 111853.3
> rsa 1024 bits 0.000674s 0.000029s   1483.9  34456.9
> rsa 2048 bits 0.003945s 0.000100s    253.5   9994.7
> rsa 4096 bits 0.027015s 0.000373s     37.0   2681.6
>
> After:
>
> rsa  512 bits 0.000059s 0.000004s  17022.3 224141.1
> rsa 1024 bits 0.000258s 0.000013s   3871.5  78851.0
> rsa 2048 bits 0.001506s 0.000042s    664.1  23844.3
> rsa 4096 bits 0.010214s 0.000153s     97.9   6516.0
>
> There are some issues we need to discuss.
>
> The runtime code is in sharedRuntime_x86_64.cpp even though it's
> mostly written in C.  My thinking here is that porters will want to
> tweak the code for their back ends, or maybe write it in assembler.
> It's little-endian and would need some reworking for big-endian
> machines.  But maybe there should be a common version of the code in
> share/ ?
>
> Should it be in optoRuntime instead?  It could be called from C1 or
> even the interpreter, but it's C2-only right now.
>
> I've done nothing about 32-bit targets, but I think they would
> benefit.
>
> I had to make some small changes to the Java library.
>
> 1.  Montgomery multiplication works on whole words.  Given that this
> is a 64-bit implementation I had to force the size of the arrays in
> oddModPow to be a multiple of 64 bits, i.e. the length of the arrays
> must be even.  Given that RSA and Diffie-Hellman keys are usually a
> multiple of 64 bits in length I don't think this will be a real
> performance issue in practice.
>
> 2.  I needed a 64-bit inverse rather than a 32-bit inverse.  This is a
> single computation, done once for each modular exponentiation, so it
> makes an immeasurably small difference to the total runtime.
>
> 3.  I fused squaring and multiplication into a single
> montgomeryMultiply method.  This has two advantages.  Firstly, we only
> need a single intrinsic, and secondly the decision whether to use
> multiply or squaring can be made in the runtime library.  If the
> target does not support the montgomeryMultiply intrinsic there is no
> slowdown when using C2 because it removes the if (a == b) test in
>
>          if (a == b) {
>              product = squareToLen(a, len, product);
>          } else {
>              product = multiplyToLen(a, len, b, len, product);
>          }

I don't agree with fusing them together.  I think there should two 
separate intrinsics.  For one, SPARC has a montsqr and montmul 
instructions.  Additionally if someone wants to call montgomerySquare, 
they should be able to call it directly with it's needed number of 
arguments and not pass 'a' twice to satisfy an internal if().

Tony