RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

Volodymyr Paprotski duke at openjdk.org
Tue Apr 2 16:10:55 UTC 2024


Performance. Before:

Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt     Score    Error  Units
SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  6443.934 ±  6.491  ops/s
SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3  6152.979 ±  4.954  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3  1895.410 ± 36.979  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3  1878.955 ± 45.487  ops/s
Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  1357.810 ± 26.584  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  1352.119 ± 23.547  ops/s
Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
PolynomialP256Bench.benchMultiply          false  thrpt    3  1746.126 ± 10.970  ops/s

Performance, no intrinsic:

Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt     Score     Error  Units
SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  6529.839 ±  42.420  ops/s
SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3  6199.747 ± 133.566  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3  1973.676 ±  54.071  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3  1932.127 ±  35.920  ops/s
Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  1355.788 ± 29.858  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  1346.523 ± 28.722  ops/s
Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
PolynomialP256Bench.benchMultiply           true  thrpt    3  1919.574 ± 10.591  ops/s

Performance, **with intrinsics**

Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  10384.591 ±  65.274  ops/s
SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3   9592.912 ± 236.411  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3   3479.494 ±  44.578  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3   3402.147 ±  26.772  ops/s
Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  2527.678 ± 64.791  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  2541.258 ± 66.634  ops/s
Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
PolynomialP256Bench.benchMultiply           true  thrpt    3  3021.139 ± 98.289  ops/s


Summary on design (see code for 'ASCII art', references and details on math):
- Added a new `IntegerPolynomial` field (`MontgomeryIntegerPolynomialP256`) with 52-bit limbs
   - `getElement(*)/fromMontgomery()` to convert numbers into/out of the field
 - `ECOperations` is the primary use of the new field
   - flattened some extra deep nested class hierarchy (also in prep for further other field optimizations)
   - `forParameters()/multiply()/setSum()` generates numbers in the new field
 - `ProjectivePoint/Montgomery{Imm|M}utable.asAffine()` to convert out of the new field
 - Added Fuzz Testing and KAT verified with OpenSSL

-------------

Commit messages:
 - remove trailing whitespace
 - Remeasure performance
 - Fix rebase typo
 - Address comments from Anas and thorough cleanup
 - conditionalAssign intrinsic
 - rebase

Changes: https://git.openjdk.org/jdk/pull/18583/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18583&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8329538
  Stats: 2335 lines in 34 files changed: 2037 ins; 162 del; 136 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


More information about the core-libs-dev mailing list