RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]

Mon Jan 26 13:42:28 UTC 2026

On Mon, 26 Jan 2026 04:25:44 GMT, Ben Perez <bperez at openjdk.org> wrote:

>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic:
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256              EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1770.589 ±  2.584  op...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid neon/gpr approach

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7282:

> 7280:     __ mov(mul_ptr, sp);
> 7281: 
> 7282:     __ umullv(A[0], __ T2D, b_lows, __ T2S, a_vals, __ S, 0);

I notice that this 4 line insn pattern recurs several times. You might consider defining a macro generator method to generate these sequences i.e.

vs_umullv(VSeq<4> vs, FloatRegister bs, FloatRegister as, int lane_lo) {
    __ umullv(vs[0], __ T2D, bs, __ T2S, as, __S, lane_lo);
    __ umull2v(vs[1], __ T2D, bs, __ T4S. as, __S, lane_lo);
    __ umullv(vs[2], __ T2D, bs, __T2S, as, __S, lane_lo + 2);
    __ umull2v(vs[3], __ T2D, bs, __T4S, as, __S, lane_lo + 2);
}

So, in this case you would call

  vs_umullv(A, b_lows, a_vals, 0);

and in other cases you can simply vary which arguments you provide for `vs`, `as`, `bs` and `lane_lo`.

I'm suggesting that because the various cross multiplies generated by this macro-method appear to implement a specific subsequence of the mont mul computation (similar to what is done in e.g. the kyber code). So, this generator method abstraction should also abstract that a clear step in any pseudo-code algorithm you provide and ought therefore to make clear that (also make clear /how/) the generated code implements that algorithm. If you can provide such a pseudo-code algorithm then we might even be able come up with a better name for the generator method (e.g. montmul_uproduct or something else that reflects what is happening here).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2727648074