RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Ben Perez bperez at openjdk.org
Wed Feb 4 20:52:15 UTC 2026


> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%. 
> 
> Performance no intrinsic (Apple M1):
> 
> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 24.923  ops/s
> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 41.805  ops/s
> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 20.822  ops/s
> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 33.594  ops/s
> 
> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   8439.881 ±  29.838  ops/s
> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   7990.614 ±  30.998  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   2677.737 ±   8.400  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   2619.297 ±   9.737  ops/s
> 
> Benchmark                                         (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
> KeyAgreementBench.EC.generateSecret                      ECDH          256              EC              thrpt   40  1905.369 ±  3.745  ops/s
> 
> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1903.997 ± 4.092  ops/s
> 
> 
> Performance with intrinsic (Apple M1):
> 
> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 24.722  ops/s
> PolynomialP256Bench.benchMultiply          false  thrpt    8  1770.589 ±  2.584  ops/s
> PolynomialP256Bench.benchSqua...

Ben Perez has updated the pull request incrementally with one additional commit since the last revision:

  Created subroutine for 32 bit vector multiplication

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/27946/files
  - new: https://git.openjdk.org/jdk/pull/27946/files/3cadf6cf..673f6518

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=27946&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27946&range=03-04

  Stats: 27 lines in 1 file changed: 11 ins; 12 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/27946.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27946/head:pull/27946

PR: https://git.openjdk.org/jdk/pull/27946


More information about the hotspot-dev mailing list