RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]

Mon Jan 26 13:55:39 UTC 2026

On Mon, 26 Jan 2026 04:25:44 GMT, Ben Perez <bperez at openjdk.org> wrote:

>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic:
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256              EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1770.589 ±  2.584  op...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid neon/gpr approach

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7933:

> 7931:       __ ldr(b9, Address(bLimbs, 64));
> 7932:       __ ldr(b10, Address(bLimbs, 72));
> 7933: 

You could use the existing macro generator method `vs_ldpq` to plant these load instructions

    vs_ldpq(a_vec, aLimbs);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2727688196