RFR: 8046943: RSA Acceleration

Fri Jun 12 11:06:22 UTC 2015

http://cr.openjdk.java.net/~aph/8046943-hs-1/
http://cr.openjdk.java.net/~aph/8046943-jdk-1/

Before:

rsa  512 bits 0.000134s 0.000009s   7444.2 111853.3
rsa 1024 bits 0.000674s 0.000029s   1483.9  34456.9
rsa 2048 bits 0.003945s 0.000100s    253.5   9994.7
rsa 4096 bits 0.027015s 0.000373s     37.0   2681.6

After:

rsa  512 bits 0.000059s 0.000004s  17022.3 224141.1
rsa 1024 bits 0.000258s 0.000013s   3871.5  78851.0
rsa 2048 bits 0.001506s 0.000042s    664.1  23844.3
rsa 4096 bits 0.010214s 0.000153s     97.9   6516.0

There are some issues we need to discuss.

The runtime code is in sharedRuntime_x86_64.cpp even though it's
mostly written in C.  My thinking here is that porters will want to
tweak the code for their back ends, or maybe write it in assembler.
It's little-endian and would need some reworking for big-endian
machines.  But maybe there should be a common version of the code in
share/ ?

Should it be in optoRuntime instead?  It could be called from C1 or
even the interpreter, but it's C2-only right now.

I've done nothing about 32-bit targets, but I think they would
benefit.

I had to make some small changes to the Java library.

1.  Montgomery multiplication works on whole words.  Given that this
is a 64-bit implementation I had to force the size of the arrays in
oddModPow to be a multiple of 64 bits, i.e. the length of the arrays
must be even.  Given that RSA and Diffie-Hellman keys are usually a
multiple of 64 bits in length I don't think this will be a real
performance issue in practice.

2.  I needed a 64-bit inverse rather than a 32-bit inverse.  This is a
single computation, done once for each modular exponentiation, so it
makes an immeasurably small difference to the total runtime.

3.  I fused squaring and multiplication into a single
montgomeryMultiply method.  This has two advantages.  Firstly, we only
need a single intrinsic, and secondly the decision whether to use
multiply or squaring can be made in the runtime library.  If the
target does not support the montgomeryMultiply intrinsic there is no
slowdown when using C2 because it removes the if (a == b) test in

        if (a == b) {
            product = squareToLen(a, len, product);
        } else {
            product = multiplyToLen(a, len, b, len, product);
        }

Andrew.