RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]

Tue Feb 17 23:04:46 UTC 2026

On Tue, 17 Feb 2026 21:30:11 GMT, Ben Perez <bperez at openjdk.org> wrote:

>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256              EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added vs_tail method to simplify various VSeq operations, updated generate_intpoly_assign()

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4857:

> 4855:     }
> 4856:   }
> 4857: 

Please refactor these. I'd try passing a pointer to virtual member function or perhaps a macro.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2819483994