RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]
Andrew Haley
aph at openjdk.org
Tue Feb 17 23:04:46 UTC 2026
On Tue, 17 Feb 2026 21:30:11 GMT, Ben Perez <bperez at openjdk.org> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ± 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ± 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ± 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ± 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ± 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
>
> Added vs_tail method to simplify various VSeq operations, updated generate_intpoly_assign()
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4857:
> 4855: }
> 4856: }
> 4857:
Please refactor these. I'd try passing a pointer to virtual member function or perhaps a macro.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2819483994
More information about the security-dev
mailing list