RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

vpaprotsk duke at openjdk.org
Fri Oct 28 19:52:03 UTC 2022


On Thu, 27 Oct 2022 05:10:59 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   extra whitespace character
>
> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175:
> 
>> 173:             // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead
>> 174:             // and not affect platforms without intrinsic support
>> 175:             int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH;
> 
> Since Poly processes 16 byte chunks, a strength reduced version of above expression could be len & (~(BLOCK_LEN-1)

I guess I got no issue with either version.. I was mostly thinking about code clarity? I think your version is 'more reliable' so just gonna switch it, thanks.

> test/micro/org/openjdk/bench/javax/crypto/full/Poly1305DigestBench.java line 94:
> 
>> 92:             throw new RuntimeException(ex);
>> 93:         }
>> 94:     }
> 
> On CLX patch shows performance regression of about 10% for block size 1024-2048+.
> 
> CLX (Non-IFMA target)
> 
> Baseline (JDK-20):-
> 
> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score   Error  Units
> Poly1305DigestBench.digest          64              thrpt    2  3128928.978          ops/s
> Poly1305DigestBench.digest         256              thrpt    2  1526452.083          ops/s
> Poly1305DigestBench.digest        1024              thrpt    2   509267.401          ops/s
> Poly1305DigestBench.digest        2048              thrpt    2   305784.922          ops/s
> Poly1305DigestBench.digest        4096              thrpt    2   142175.885          ops/s
> Poly1305DigestBench.digest        8192              thrpt    2    72142.906          ops/s
> Poly1305DigestBench.digest       16384              thrpt    2    36357.000          ops/s
> Poly1305DigestBench.digest     1048576              thrpt    2      676.142          ops/s
> 
> 
> Withopt:
> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score   Error  Units
> Poly1305DigestBench.digest          64              thrpt    2  3136204.416          ops/s
> Poly1305DigestBench.digest         256              thrpt    2  1683221.124          ops/s
> Poly1305DigestBench.digest        1024              thrpt    2   457432.172          ops/s
> Poly1305DigestBench.digest        2048              thrpt    2   277563.817          ops/s
> Poly1305DigestBench.digest        4096              thrpt    2   149393.357          ops/s
> Poly1305DigestBench.digest        8192              thrpt    2    79463.734          ops/s
> Poly1305DigestBench.digest       16384              thrpt    2    41083.730          ops/s
> Poly1305DigestBench.digest     1048576              thrpt    2      705.419          ops/s

Odd, I measured it on `11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz`, will go again

-------------

PR: https://git.openjdk.org/jdk/pull/10582



More information about the security-dev mailing list