RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
Jatin Bhateja
jbhateja at openjdk.org
Fri Apr 5 09:20:10 UTC 2024
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:
>> Performance. Before:
>>
>> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 3 6443.934 ± 6.491 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 3 6152.979 ± 4.954 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 3 1895.410 ± 36.979 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 3 1878.955 ± 45.487 ops/s
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 3 1357.810 ± 26.584 ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 3 1352.119 ± 23.547 ops/s
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply false thrpt 3 1746.126 ± 10.970 ops/s
>>
>> Performance, no intrinsic:
>>
>> Benchmark (algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 thrpt 3 6529.839 ± 42.420 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 thrpt 3 6199.747 ± 133.566 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 thrpt 3 1973.676 ± 54.071 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt 3 1932.127 ± 35.920 ops/s
>> Benchmark (algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 3 1355.788 ± 29.858 ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt 3 1346.523 ± 28.722 ops/s
>> Benchmark (isMontBench) Mode Cnt Score Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 3 1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
>
> remove use of jdk.crypto.ec
Few early comments.
Please update the copyright year of all the modified files.
You can even consider splitting this into two patches, Java side changes in one and x86 optimized intrinsic in next one.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39:
> 37: };
> 38: static address modulus_p256() {
> 39: return (address)MODULUS_P256;
Long constants should have UL suffix.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386:
> 384: __ jcc(Assembler::equal, L_Length19);
> 385:
> 386: // Default copy loop
Please add appropriate loop entry alignment.
src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:
> 392: __ lea(aLimbs, Address(aLimbs,8));
> 393: __ lea(bLimbs, Address(bLimbs,8));
> 394: __ jmp(L_DefaultLoop);
Both sub and cmp are flag affecting instructions and are macro-fusible.
By doing a loop rotation i.e. moving the length <= 0 check outside the loop and pushing the loop exit check at bottom you can save additional compare checks.
-------------
PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-1981555803
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1553056633
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1552710600
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1553110376
More information about the core-libs-dev
mailing list