RFR: 8350459: MontgomeryIntegerPolynomialP256 multiply intrinsic with AVX2 on x86_64
Sandhya Viswanathan
sviswanathan at openjdk.org
Thu Feb 27 19:26:58 UTC 2025
On Thu, 20 Feb 2025 21:49:42 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:
> Add AVX2 montgomery multiplication intrinsic. (About 60-80% gain)
>
> Also add reduction to existing AVX512 multiplication (this was left-over from https://github.com/openjdk/jdk/pull/19893 where a quick fix was required). This is mostly for cleanup, but there is about 1-2% gain.
>
> Before (no AVX512)
>
> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 3720.589 ± 17.879 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 3605.940 ± 15.807 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 1076.502 ± 4.190 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 1069.624 ± 2.484 ops/s
> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 830.448 ± 2.285 ops/s
>
> After (with AVX2)
>
> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 6000.496 ± 39.923 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 5739.878 ± 34.838 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 1942.437 ± 12.179 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 1921.770 ± 8.992 ops/s
> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1399.761 ± 6.238 ops/s
>
>
> Before (with AVX512):
>
> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 9621.950 ± 27.260 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 8975.654 ± 26.707 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 102...
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 397:
> 395: __ xorq(acc2, acc2);
> 396: __ addq(acc1, tmp_rax);
> 397: __ adcq(acc2, tmp_rdx);
Why adcq here instead of addq? The vector code doesn't do that.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 424:
> 422: __ shrq(acc1, 52); // low 52 of acc1 ignored, is zero, because Montgomery
> 423:
> 424: // Acc2[0] += carry
This is more like shift in carry into lower bits of Acc2[0] so comment could be updated.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 441:
> 439: __ subq(acc2, modulus);
> 440: __ vpsubq(Acc2, Acc1, Modulus, Assembler::AVX_256bit);
> 441: __ vmovdqu(Address(rsp, -32), Acc2); //Assembler::AVX_256bit
Need to first create space on stack and then store temp.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 465:
> 463:
> 464: // Now carry propagate the multiply result and (constant-time) select correct
> 465: // output digit
Carry propagate multiply result is done before subtracting modulus in the Java code.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 467:
> 465: // output digit
> 466: Register digit = acc1;
> 467: __ vmovdqu(Address(rsp, -64), Acc1); //Assembler::AVX_256bit
Need to first create space on stack and then store.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 475:
> 473: }
> 474: __ movq(carry, digit);
> 475: __ sarq(carry, 52);
This was unsigned or logical shift in Java code.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 556:
> 554: // - constant time (i.e. no branches)
> 555: // - no-side channel (i.e. all memory must always be accessed, and in same order)
> 556: void assign_avx(Register aBase, Register bBase, int offset, XMMRegister select, XMMRegister tmp, XMMRegister aTmp, int vector_len, MacroAssembler* _masm) {
Good to add the comment from assign_scalar here as well:
// Original java:
// long dummyLimbs = maskValue & (a[i] ^ b[i]);
// a[i] = dummyLimbs ^ a[i];
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974184239
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974171188
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974187392
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974203227
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974206111
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1974205184
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1972517671
More information about the hotspot-dev
mailing list