RFR: 8350459: MontgomeryIntegerPolynomialP256 multiply intrinsic with AVX2 on x86_64

Thu Feb 27 19:26:58 UTC 2025

On Thu, 20 Feb 2025 21:49:42 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

> Add AVX2 montgomery multiplication intrinsic. (About 60-80% gain)
> 
> Also add reduction to existing AVX512 multiplication (this was left-over from https://github.com/openjdk/jdk/pull/19893 where a quick fix was required). This is mostly for cleanup, but there is about 1-2% gain.
> 
> Before (no AVX512)
> 
> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   3720.589 ±  17.879  ops/s
> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   3605.940 ±  15.807  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   1076.502 ±   4.190  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   1069.624 ±   2.484  ops/s
> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40   830.448 ± 2.285  ops/s
> 
> After (with AVX2)
> 
> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40   6000.496 ±  39.923  ops/s
> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40   5739.878 ±  34.838  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt   40   1942.437 ±  12.179  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt   40   1921.770 ±   8.992  ops/s
> Benchmark                             (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
> KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt   40  1399.761 ± 6.238  ops/s
> 
> 
> Before (with AVX512):
> 
> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score     Error  Units
> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt   40    9621.950 ±  27.260  ops/s
> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt   40    8975.654 ±  26.707  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA        102...

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 397:

> 395:         __ xorq(acc2, acc2);
> 396:         __ addq(acc1, tmp_rax);
> 397:         __ adcq(acc2, tmp_rdx);

Why adcq here instead of addq? The vector code doesn't do that.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 424:

> 422:       __ shrq(acc1, 52); // low 52 of acc1 ignored, is zero, because Montgomery
> 423: 
> 424:       // Acc2[0] += carry

This is more like shift in carry into lower bits of Acc2[0] so comment could be updated.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 441:

> 439:   __ subq(acc2, modulus);
> 440:   __ vpsubq(Acc2, Acc1, Modulus, Assembler::AVX_256bit);
> 441:   __ vmovdqu(Address(rsp, -32), Acc2); //Assembler::AVX_256bit

Need to first create space on stack and then store temp.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 465:

> 463: 
> 464:   // Now carry propagate the multiply result and (constant-time) select correct
> 465:   // output digit

Carry propagate multiply result is done before subtracting modulus in the Java code.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 467:

> 465:   // output digit
> 466:   Register digit = acc1;
> 467:   __ vmovdqu(Address(rsp, -64), Acc1); //Assembler::AVX_256bit

Need to first create space on stack and then store.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 475:

> 473:     }
> 474:     __ movq(carry, digit);
> 475:     __ sarq(carry, 52);

This was unsigned or logical shift in Java code.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 556:

> 554: //  - constant time (i.e. no branches)
> 555: //  - no-side channel (i.e. all memory must always be accessed, and in same order)
> 556: void assign_avx(Register aBase, Register bBase, int offset, XMMRegister select, XMMRegister tmp, XMMRegister aTmp, int vector_len, MacroAssembler* _masm) {

Good to add the comment from assign_scalar here as well:
// Original java:
  // long dummyLimbs = maskValue & (a[i] ^ b[i]);
  // a[i] = dummyLimbs ^ a[i];

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974184239
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974171188
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974187392
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974203227
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974206111
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974205184
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1972517671