RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Thu Feb 5 13:55:56 UTC 2026

On Wed, 4 Feb 2026 20:52:15 GMT, Ben Perez <bperez at openjdk.org> wrote:

>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit multiplication is not supported on Neon and manually performing this operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr approach is used. Neon instructions are used to compute intermediate values used in the last two iterations of the main "loop", while the GPRs compute the first few iterations. At the method level this improves performance by ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256              EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8...
>
> Ben Perez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Created subroutine for 32 bit vector multiplication

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3181:

> 3179:   void umullv(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3180:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3181:     assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");

umullv -> umull{2}v in the assertion message (or consider moving the assertions into the calling function)

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3182:

> 3180:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3181:     assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");
> 3182:     assert(Ta == T4S ? (Tb == T4H && Ts == H) : (Tb == T2S && Ts == S), "umullv register arrangements must adhere to spec");

umullv -> umull{2}v in the assertion message (or consider moving the assertions into the calling function)

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3188:

> 3186:   void umull2v(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3187:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3188:     assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");

umullv -> umull2v in the assertion

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3189:

> 3187:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant Ts, int lane) {
> 3188:     assert(Ta == T4S || Ta == T2D, "umullv destination register must have arrangement T4S or T2D");
> 3189:     assert(Ta == T4S ? (Tb == T8H && Ts == H) : (Tb == T4S && Ts == S), "umullv register arrangements must adhere to spec");

umullv -> umull2v in the assertion

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7193:

> 7191: 
> 7192:   // Multiply each 32-bit value in bs by the 32-bit values in as[lane_lo] and as[lane_lo + 2]
> 7193:   // and store in vs.

I think you could be a bit more specific in explaining what happens here: we compute the partial results of
some 52 x 52 bit multiplications where the multiplicands are stored as 64-bit values. 
This function computes partial results of 8 such multiplication (b_0, b_1, b_2, b_3) * (a_3, a_4).
In a call of this function, either the high or low 32 bits of the b_i values are multiplied by either the high or low 32 bits of the b_j values, so four calls with the appropriate parameters will produce the 64-bit  low32 * low32, low32 * high32, high32 * low 32 and high32 * high32 values in the output register sequences.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7820:

> 7818:     //   IntegerPolynomialP521:  19 = 8 + 8 + 2 + 1
> 7819:     //   P521OrderField:         19 = 8 + 8 + 2 + 1
> 7820:     // Special Cases 5, 10, 14, 16, 19

Add a comment in the Java code that the intrinsic can only be used for these lengths. I would also change the Java code to use an intermediate method that has an assert checking the allowed lengths and calls the @IntrinsicCandidate conditionalAssign() method (this is an easy change since there is only one caller in the current JVM code).

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7849:

> 7847:     __ dup(mask_vec, __ T2D, mask_scalar);
> 7848: 
> 7849:     __ push(r19, sp); //needed for length = 5

If it is only needed for length == 5, just save and restore on that branch.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768475020
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474771
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768454954
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474590
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2769260685
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768711654
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768730689