RFR: 8294248: Use less limbs for P256 in EC implementation
Xue-Lei Andrew Fan
xuelei at openjdk.org
Sun Sep 25 05:55:10 UTC 2022
On Sat, 24 Sep 2022 08:17:56 GMT, Xue-Lei Andrew Fan <xuelei at openjdk.org> wrote:
>> Hi,
>>
>> Please review this performance improvement for Secp256R1 implementation in OpenJDK. With this update, there is an about 20% performance improvement for Secp256R1 key generation and signature.
>>
>> Basically, 256 bits EC curves could use 9 integer limbs for the computation. The current implementation use 10 limbs instead. By reducing the number of limbs, the implementation could benefit from less integer computation (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the performance.
>>
>> Here are the benchmark numbers without the patch:
>>
>> Benchmark (messageLength) Mode Cnt Score Error Units
>> Signatures.sign 64 thrpt 15 1.414 ± 0.022 ops/ms
>> Signatures.sign 512 thrpt 15 1.418 ± 0.004 ops/ms
>> Signatures.sign 2048 thrpt 15 1.419 ± 0.005 ops/ms
>> Signatures.sign 16384 thrpt 15 1.395 ± 0.003 ops/ms
>>
>> KeyGenerators.keyPairGen thrpt 15 1.475 ± 0.043 ops/ms
>>
>>
>> And here are the numbers with the patch applied:
>>
>> Benchmark (messageLength) Mode Cnt Score Error Units
>> ECSignature.sign 64 thrpt 15 1.719 ± 0.010 ops/ms
>> ECSignature.sign 512 thrpt 15 1.704 ± 0.012 ops/ms
>> ECSignature.sign 2048 thrpt 15 1.699 ± 0.018 ops/ms
>> ECSignature.sign 16384 thrpt 15 1.681 ± 0.006 ops/ms
>>
>> KeyGenerators.keyPairGen thrpt 15 1.881 ± 0.008 ops/ms
>>
>>
>> Thanks,
>> Xuelei
>
> I check with the real bytes for the following computation. For Secp256R1, the p is defined.
>
> p = FFFFFFFF 00000001 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
> = 2^224(2^32 −1) + 2^192 + 2^96 − 1
> = 2^256 - 2^224 + 2^192 + 2^96 − 1
>
>
> Let's use a full 256 bits integer for following computation, which is bigger than p:
>
>
> FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
>
>> starting with 29 bit limbs
>
> With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian order, as the following:
>
>
> a[] = 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 00FFFFFF
>
>
> From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits. 29 * 8 + 24 = 256
>
>> after 2 additions (maxAdds) we have up to 31 bits
>
> after 2 additions, the 9-29 bits words are:
>
> a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF
>
> From a[0] to a[7], 31 bits for each limbs, and a[8] is 26 bits. This is the biggest number we could get from the finite filed computation.
>
>> then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
>> and we didn't start reducing yet.
>
> Let's multiply two limbs:
>
> a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF
> b[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF
>
>
> The overflow we cared about is the following computation, the longest additions.
>
> long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] * b[0]);
>
> Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] and b[8] is 26-bit words.
>
> Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between 31-bits words and 26-bits words. So we have the equivalent computation for the a and b values above:
>
> a[1]=a[2]=a[3]=a[4]=a[5]=a[6]=a[7]=0x7FFFFFFF;
> b[1]=b[2]=b[3]=b[4]=b[5]=b[6]=b[7]=0x7FFFFFFF;
> long i1x7 = (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1])
> = 0xBFFFFFF900000007L
> long i0x8 = (a[0] * b[8]) + (a[8] * b[0]);
> = 0x03FFFFFEF8000002L
>
> long c8 = i1x7 + i0x8
> = 0xC3FFFFF7F8000009L
>
>
> c8 fills the 64 bits fully, but it does not overflow.
>
> @djelinski does it sound right to you?
>
> Although there is not overflow if I get the above computation right, it is still not a straightforward thing we can know without detailed computation. Instead, this kind of computation and checking could be performed in FieldGen code automatically, or just reduce limbs operations result (up to no more than twice the modulus) and remove the limit of maxAdd.
>
> This RFE, JDK-8294248, is one of a few performance improvement of the EC implementation in JDK ([JDK-8294188](https://bugs.openjdk.org/browse/JDK-8294188)). I would like to address the checking or/and reducing improvement in a separately RFE.
> @XueleiFan as far as I can tell, we cannot assume that the starting number will be 256 bit; the `setReduced` operation performs a limited reduce that only ensures that limbs fit in bitsPerLimb; the carry from the last limb is only performed in `finalCarryReduceLast`.
Thank you @djelinski, I missed this point.
I tried to prototype and allow only one addition, and use reduce if necessary, and get the following benchmarking numbers:
Benchmark (messageLength) Mode Cnt Score Error Units
Signatures.sign 64 thrpt 15 1.629 ± 0.008 ops/ms
Signatures.sign 512 thrpt 15 1.628 ± 0.007 ops/ms
Signatures.sign 2048 thrpt 15 1.625 ± 0.006 ops/ms
Signatures.sign 16384 thrpt 15 1.588 ± 0.014 ops/ms
ECKeyGen.keyPairGen thrpt 15 1.779 ±(99.9%) 0.007 ops/ms
```
The improvement looks positive to me (15%-20% improvement). It implies that it may be performance friendly by reducing or semi-reducing limbs operations rather than using maxAdd controlling. However, as the update would impact more curves in JDK, I would think it twice if it is good to start from it or not.
-------------
PR: https://git.openjdk.org/jdk/pull/10398
More information about the security-dev
mailing list