RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

Thu May 16 23:38:08 UTC 2024

On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

>> Performance. Before:
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt     Score    Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3  1878.955 ± 45.487  ops/s
>> Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  1352.119 ± 23.547  ops/s
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply          false  thrpt    3  1746.126 ± 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt     Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3  1932.127 ±  35.920  ops/s
>> Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  1346.523 ± 28.722  ops/s
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
> 
>   whitespace

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168:

> 166:   XMMRegister broadcast5 = xmm24;
> 167:   KRegister limb0 = k1;
> 168:   KRegister limb5 = k2;

limb5 and select are not being used anymore.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185:

> 183:   __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), false, Assembler::AVX_512bit, rscratch);
> 184: 
> 185:   // A = load(*aLimbs)

A little bit more description in comments on what the load step involves would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, or in the lowest limb.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270:

> 268:   __ push(r14);
> 269:   __ push(r15);
> 270: 

No need to save/restore  rbx, r12, r14, r15.  Only r13 is used as temp in montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed to r8.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286:

> 284:   __ mov(aLimbs, c_rarg0);
> 285:   __ mov(bLimbs, c_rarg1);
> 286:   __ mov(rLimbs, c_rarg2);

We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then these moves are not necessary.

src/hotspot/cpu/x86/vm_version_x86.cpp line 1370:

> 1368: 
> 1369: #ifdef _LP64
> 1370:   if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize >= 64) {

No need to tie the intrinsic to MaxVectorSize setting.

src/hotspot/share/opto/library_call.cpp line 7564:

> 7562: 
> 7563:   if (!stubAddr) return false;
> 7564:   if (stopped())  return true;

Line 7564 seems redundant here as there is no range check or anything like that  before this.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604169603
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604141586
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604174141
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604175443
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603792252
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603865712