RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]
Ferenc Rakoczi
duke at openjdk.org
Thu Feb 5 13:55:56 UTC 2026
On Wed, 4 Feb 2026 20:52:15 GMT, Ben Perez <bperez at openjdk.org> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ± 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ± 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ± 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ± 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ± 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
>
> Created subroutine for 32 bit vector multiplication
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3181:
> 3179: void umullv(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3180: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3181: assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");
umullv -> umull{2}v in the assertion message (or consider moving the assertions into the calling function)
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3182:
> 3180: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3181: assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");
> 3182: assert(Ta == T4S ? (Tb == T4H && Ts == H) : (Tb == T2S && Ts == S), "umullv register arrangements must adhere to spec");
umullv -> umull{2}v in the assertion message (or consider moving the assertions into the calling function)
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3188:
> 3186: void umull2v(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3187: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3188: assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");
umullv -> umull2v in the assertion
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3189:
> 3187: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3188: assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");
> 3189: assert(Ta == T4S ? (Tb == T8H && Ts == H) : (Tb == T4S && Ts == S), "umullv register arrangements must adhere to spec");
umullv -> umull2v in the assertion
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7193:
> 7191:
> 7192: // Multiply each 32-bit value in bs by the 32-bit values in as[lane_lo] and as[lane_lo + 2]
> 7193: // and store in vs.
I think you could be a bit more specific in explaining what happens here: we compute the partial results of
some 52 x 52 bit multiplications where the multiplicands are stored as 64-bit values.
This function computes partial results of 8 such multiplication (b_0, b_1, b_2, b_3) * (a_3, a_4).
In a call of this function, either the high or low 32 bits of the b_i values are multiplied by either the high or low 32 bits of the b_j values, so four calls with the appropriate parameters will produce the 64-bit low32 * low32, low32 * high32, high32 * low 32 and high32 * high32 values in the output register sequences.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7820:
> 7818: // IntegerPolynomialP521: 19 = 8 + 8 + 2 + 1
> 7819: // P521OrderField: 19 = 8 + 8 + 2 + 1
> 7820: // Special Cases 5, 10, 14, 16, 19
Add a comment in the Java code that the intrinsic can only be used for these lengths. I would also change the Java code to use an intermediate method that has an assert checking the allowed lengths and calls the @IntrinsicCandidate conditionalAssign() method (this is an easy change since there is only one caller in the current JVM code).
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7849:
> 7847: __ dup(mask_vec, __ T2D, mask_scalar);
> 7848:
> 7849: __ push(r19, sp); //needed for length = 5
If it is only needed for length == 5, just save and restore on that branch.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768475020
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474771
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768454954
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474590
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2769260685
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768711654
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768730689
More information about the hotspot-dev
mailing list