RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v6]
Andrew Dinn
adinn at openjdk.org
Mon Feb 9 21:33:07 UTC 2026
On Thu, 5 Feb 2026 21:36:09 GMT, Ben Perez <bperez at openjdk.org> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ± 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ± 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ± 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ± 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ± 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
>
> fixed indexing bug in vs_ldpq, simplified vector loads in generate_intpoly_assign()
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7964:
> 7962: __ BIND(L_Length14);
> 7963: {
> 7964: Register a10 = r5;
It might be nice if these general purpose register operations could be condensed using e.g. an template type RSeq<N> and rs_xxx methods as has been done with the vector register operations. Even better if we could implement RSeq and VSeq as subtypes of a common template type Seq<N, R> with R bound to Register or FloatRegister as a type parameter.
I'm not suggesting that for this PR but we should look into it via a follow-up PR.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7988:
> 7986: __ ld1(a_vec[0], __ T2D, aLimbs);
> 7987: __ ldpq(a_vec[1], a_vec[2], Address(aLimbs, 16));
> 7988: __ ldpq(a_vec[3], a_vec[4], Address(aLimbs, 48));
I notice that here and elsewhere you have a 5 vector sequence and hence are not using vs_ldpq/stpq operations (because they only operate on even length sequences). However, if you add a bit of extra 'apparatus' to register.hpp you can then use the vs_ldpq/stpq operations.
Your code processes the first register individually via ld1/st1 and then the remaining registers using a pair of loads i.e. operate as if the latter were a VSeq<4>. So, in register_aarch64.hpp you can add these functions:
template<int N>
FloatRegister vs_head(const VSeq<N>& v) {
static_assert(N > 1), "sequence length must be greater than 1");
return v.base();
}
template<int N>
VSeq<N> vs_tail(const VSeq<N+1>& v) {
static_assert(N > 1, "tail sequence length must be greater than 2");
return VSeq<N>(v.base() + v.delta(), v.delta());
}
With those methods available you should be able to do all these VSeq<5> loads and stores using an ld1/st1 followed by an vs_ldpq_indexed or vs_stpq_indexed with a suitable start index and the same constant offset array e.g. here you could use
Suggestion:
int offsets[2] = { 0, 32 };
__ ld1(vs_head(a_vec), __ T2D, aLimbs);
vs_ldpq_indexed(vs_tail(a_vec), aLimbs, 16, offsets);
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782146418
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782125440
More information about the security-dev
mailing list