RFR: 8046943: RSA Acceleration
Anthony Scarpino
anthony.scarpino at oracle.com
Mon Jun 15 16:38:44 UTC 2015
On 06/12/2015 04:06 AM, Andrew Haley wrote:
> http://cr.openjdk.java.net/~aph/8046943-hs-1/
> http://cr.openjdk.java.net/~aph/8046943-jdk-1/
Please don't use the JEP 246 in the comments when you push. There are a
number of changesets related to 246 and I'd rather not have one
associated with it. We can link the a separate bug id to the JEP.
>
> Before:
>
> rsa 512 bits 0.000134s 0.000009s 7444.2 111853.3
> rsa 1024 bits 0.000674s 0.000029s 1483.9 34456.9
> rsa 2048 bits 0.003945s 0.000100s 253.5 9994.7
> rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6
>
> After:
>
> rsa 512 bits 0.000059s 0.000004s 17022.3 224141.1
> rsa 1024 bits 0.000258s 0.000013s 3871.5 78851.0
> rsa 2048 bits 0.001506s 0.000042s 664.1 23844.3
> rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0
>
> There are some issues we need to discuss.
>
> The runtime code is in sharedRuntime_x86_64.cpp even though it's
> mostly written in C. My thinking here is that porters will want to
> tweak the code for their back ends, or maybe write it in assembler.
> It's little-endian and would need some reworking for big-endian
> machines. But maybe there should be a common version of the code in
> share/ ?
>
> Should it be in optoRuntime instead? It could be called from C1 or
> even the interpreter, but it's C2-only right now.
>
> I've done nothing about 32-bit targets, but I think they would
> benefit.
>
> I had to make some small changes to the Java library.
>
> 1. Montgomery multiplication works on whole words. Given that this
> is a 64-bit implementation I had to force the size of the arrays in
> oddModPow to be a multiple of 64 bits, i.e. the length of the arrays
> must be even. Given that RSA and Diffie-Hellman keys are usually a
> multiple of 64 bits in length I don't think this will be a real
> performance issue in practice.
>
> 2. I needed a 64-bit inverse rather than a 32-bit inverse. This is a
> single computation, done once for each modular exponentiation, so it
> makes an immeasurably small difference to the total runtime.
>
> 3. I fused squaring and multiplication into a single
> montgomeryMultiply method. This has two advantages. Firstly, we only
> need a single intrinsic, and secondly the decision whether to use
> multiply or squaring can be made in the runtime library. If the
> target does not support the montgomeryMultiply intrinsic there is no
> slowdown when using C2 because it removes the if (a == b) test in
>
> if (a == b) {
> product = squareToLen(a, len, product);
> } else {
> product = multiplyToLen(a, len, b, len, product);
> }
I don't agree with fusing them together. I think there should two
separate intrinsics. For one, SPARC has a montsqr and montmul
instructions. Additionally if someone wants to call montgomerySquare,
they should be able to call it directly with it's needed number of
arguments and not pass 'a' twice to satisfy an internal if().
Tony
More information about the hotspot-compiler-dev
mailing list