RFR: 8294248: Use less limbs for P256 in EC implementation
Xue-Lei Andrew Fan
xuelei at openjdk.org
Fri Sep 23 07:43:12 UTC 2022
On Thu, 22 Sep 2022 20:40:08 GMT, Xue-Lei Andrew Fan <xuelei at openjdk.org> wrote:
> Hi,
>
> Please review this performance improvement for Secp256R1 implementation in OpenJDK. With this update, there is an about 20% performance improvement for Secp256R1 key generation and signature.
>
> Basically, 256 bits EC curves could use 9 integer limbs for the computation. The current implementation use 10 limbs instead. By reducing the number of limbs, the implementation could benefit from less integer computation (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the performance.
>
> Here are the benchmark numbers without the patch:
>
> Benchmark (messageLength) Mode Cnt Score Error Units
> Signatures.sign 64 thrpt 15 1.414 ± 0.022 ops/ms
> Signatures.sign 512 thrpt 15 1.418 ± 0.004 ops/ms
> Signatures.sign 2048 thrpt 15 1.419 ± 0.005 ops/ms
> Signatures.sign 16384 thrpt 15 1.395 ± 0.003 ops/ms
>
> KeyGenerators.keyPairGen thrpt 15 1.475 ± 0.043 ops/ms
>
>
> And here are the numbers with the patch applied:
>
> Benchmark (messageLength) Mode Cnt Score Error Units
> ECSignature.sign 64 thrpt 15 1.719 ± 0.010 ops/ms
> ECSignature.sign 512 thrpt 15 1.704 ± 0.012 ops/ms
> ECSignature.sign 2048 thrpt 15 1.699 ± 0.018 ops/ms
> ECSignature.sign 16384 thrpt 15 1.681 ± 0.006 ops/ms
>
> KeyGenerators.keyPairGen thrpt 15 1.881 ± 0.008 ops/ms
>
>
> Thanks,
> Xuelei
Thank you for the feedback, Daniel.
> Can you prove that this is correct?
That would depend on the implementation details. The JDK testing passed, although it does not means there is no problem. I will try if I could describe the implementation logic more clear tomorrow. The following is just a quick reply for what I can see now for the case you described.
> My back-of-the-envelope calculations suggest that this may overflow:
>
> * starting with 29 bit limbs
> * after 2 additions (maxAdds) we have up to 31 bits
> * then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
> * and we didn't start reducing yet.
For the multiply operation, I guess the result should have been carry reduced and fix in the limbs and sizes. In the generated IntegerPolynomialP256.java, the code looks like:
protected void mult(long[] a, long[] b, long[] r) {
... snipped ...
long c16 = (a[8] * b[8]);
carryReduce(r, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16);
}
carryReduce() is called, and the result 'r' is a set of limbs fit on the limbs size, I guess.
private void carryReduce(...) {
... snipped ...
//carry from position 7
t0 = (c7 + CARRY_ADD) >> 29;
c7 -= (t0 << 29);
c8 += t0;
... snipped ...
}
In the description of IntegerPolynomial.java, it is said that "Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException."
Did you see any code that use more than one addition before the multiplication? Or did you refer to something else with "multiply limbs" other than the description in the IntegerPolynomial class?
-------------
PR: https://git.openjdk.org/jdk/pull/10398
More information about the build-dev
mailing list