RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v4]
Andrew Dinn
adinn at openjdk.org
Mon Jan 26 13:55:39 UTC 2026
On Mon, 26 Jan 2026 04:25:44 GMT, Ben Perez <bperez at openjdk.org> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic:
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ± 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ± 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ± 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ± 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ± 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1770.589 ± 2.584 op...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
>
> Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid neon/gpr approach
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7933:
> 7931: __ ldr(b9, Address(bLimbs, 64));
> 7932: __ ldr(b10, Address(bLimbs, 72));
> 7933:
You could use the existing macro generator method `vs_ldpq` to plant these load instructions
vs_ldpq(a_vec, aLimbs);
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2727688196
More information about the security-dev
mailing list