RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v6]

Mon Feb 9 21:33:07 UTC 2026

On Thu, 5 Feb 2026 21:36:09 GMT, Ben Perez <bperez at openjdk.org> wrote:

>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256              EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fixed indexing bug in vs_ldpq, simplified vector loads in generate_intpoly_assign()

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7964:

> 7962:     __ BIND(L_Length14);
> 7963:     {
> 7964:       Register a10 = r5;

It might be nice if these general purpose register operations could be condensed using e.g. an template type RSeq<N> and rs_xxx methods as has been done with the vector register operations. Even better if we could implement RSeq and VSeq as subtypes of a common template type Seq<N, R> with R bound to Register or FloatRegister as a type parameter.

I'm not suggesting that for this PR but we should look into it via a follow-up PR.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7988:

> 7986:       __ ld1(a_vec[0], __ T2D, aLimbs);
> 7987:       __ ldpq(a_vec[1], a_vec[2], Address(aLimbs, 16));
> 7988:       __ ldpq(a_vec[3], a_vec[4], Address(aLimbs, 48));

I notice that here and elsewhere you have a 5 vector sequence and hence are not using vs_ldpq/stpq operations (because they only operate on even length sequences). However, if you add a bit of extra 'apparatus' to register.hpp you can then use the vs_ldpq/stpq operations.

Your code processes the first register individually via ld1/st1 and then the remaining registers using a pair of loads i.e. operate as if the latter were a VSeq<4>. So, in register_aarch64.hpp you can add these functions:

template<int N>
FloatRegister vs_head(const VSeq<N>& v) {
  static_assert(N > 1), "sequence length must be greater than 1");
  return v.base();
}

template<int N>
VSeq<N> vs_tail(const VSeq<N+1>& v) {
  static_assert(N > 1, "tail sequence length must be greater than 2");
  return VSeq<N>(v.base() + v.delta(), v.delta());
}

With those methods available you should be able to do all these VSeq<5> loads and stores using an ld1/st1 followed by an vs_ldpq_indexed or vs_stpq_indexed with a suitable start index and the same constant offset array e.g. here you could use

Suggestion:

      int offsets[2] = { 0, 32 };
      __ ld1(vs_head(a_vec), __ T2D, aLimbs);
      vs_ldpq_indexed(vs_tail(a_vec), aLimbs, 16, offsets);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782146418
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782125440