RFR: 8350459: MontgomeryIntegerPolynomialP256 multiply intrinsic with AVX2 on x86_64
Sandhya Viswanathan
sviswanathan at openjdk.org
Wed Feb 26 19:57:03 UTC 2025
On Thu, 20 Feb 2025 21:49:42 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:
> Add AVX2 montgomery multiplication intrinsic. (About 60-80% gain)
>
> Also add reduction to existing AVX512 multiplication (this was left-over from https://github.com/openjdk/jdk/pull/19893 where a quick fix was required). This is mostly for cleanup, but there is about 1-2% gain.
>
> Before (no AVX512)
>
> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 3720.589 ± 17.879 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 3605.940 ± 15.807 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 1076.502 ± 4.190 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 1069.624 ± 2.484 ops/s
> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 830.448 ± 2.285 ops/s
>
> After (with AVX2)
>
> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 6000.496 ± 39.923 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 5739.878 ± 34.838 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 40 1942.437 ± 12.179 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 40 1921.770 ± 8.992 ops/s
> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
> KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 40 1399.761 ± 6.238 ops/s
>
>
> Before (with AVX512):
>
> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 40 9621.950 ± 27.260 ops/s
> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 40 8975.654 ± 26.707 ops/s
> SignatureBench.ECDSA.verify SHA256withECDSA 102...
src/java.base/share/classes/sun/security/util/math/intpoly/MontgomeryIntegerPolynomialP256.java line 423:
> 421: r[2] = ((c7 & mask) | (c2 & ~mask));
> 422: r[3] = ((c8 & mask) | (c3 & ~mask));
> 423: r[4] = ((c9 & mask) | (c4 & ~mask));
It would be good to add a comment here indicating that if the result (c9 - c5) had overflown by one modulus, result - modulus (c4-c0) would be positive else it would be negative. i.e. Upper bits of c4 would be all zeroes on overflow otherwise upper bits of c4 would be all ones. Thus on overflow, return value "r" should be set to result - modulus (c4 - c0) else it should be set to result (c9-c5).
test/jdk/com/sun/security/util/math/intpoly/MontgomeryPolynomialFuzzTest.java line 2:
> 1: /*
> 2: * Copyright (c) 2025, Intel Corporation. All rights reserved.
This should be Copyright (c) 2024, 2025, Intel Corporation. All rights reserved.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1972301843
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1972267785
More information about the hotspot-dev
mailing list