RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions

Sandhya Viswanathan sviswanathan at openjdk.org
Tue Oct 18 23:25:10 UTC 2022


On Wed, 5 Oct 2022 21:28:26 GMT, vpaprotsk <duke at openjdk.org> wrote:

> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`.
> 
> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java.
>   - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.
> - Added a JMH perf test.
>    - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider.
> 
> Perf before:
> 
> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
> Poly1305DigestBench.digest          64              thrpt    8  2961300.661 ± 110554.162  ops/s
> Poly1305DigestBench.digest         256              thrpt    8  1791912.962 ±  86696.037  ops/s
> Poly1305DigestBench.digest        1024              thrpt    8   637413.054 ±  14074.655  ops/s
> Poly1305DigestBench.digest       16384              thrpt    8    48762.991 ±    390.921  ops/s
> Poly1305DigestBench.digest     1048576              thrpt    8      769.872 ±      1.402  ops/s
> 
> and after:
> 
> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
> Poly1305DigestBench.digest          64              thrpt    8  2841243.668 ± 154528.057  ops/s
> Poly1305DigestBench.digest         256              thrpt    8  1662003.873 ±  95253.445  ops/s
> Poly1305DigestBench.digest        1024              thrpt    8  1770028.718 ± 100847.766  ops/s
> Poly1305DigestBench.digest       16384              thrpt    8   765547.287 ±  25883.825  ops/s
> Poly1305DigestBench.digest     1048576              thrpt    8    14508.458 ±     56.147  ops/s

src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 262:

> 260:     private static void processMultipleBlocks(byte[] input, int offset, int length, byte[] aBytes, byte[] rBytes) {
> 261:         MutableIntegerModuloP A = ipl1305.getElement(aBytes).mutable();
> 262:         MutableIntegerModuloP R = ipl1305.getElement(rBytes).mutable();

R doesn't need to be mutable.

src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 286:

> 284:      * numeric values.
> 285:      */
> 286:     private void setRSVals() { //throws InvalidKeyException {

The R and S check for invalid key (all bytes zero) could be submitted as a separate PR. 
It is not related to the Poly1305 acceleration.

test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305IntrinsicFuzzTest.java line 39:

> 37:         public static void main(String[] args) throws Exception {
> 38:                 //Note: it might be useful to increase this number during development of new Poly1305 intrinsics
> 39:                 final int repeat = 100;

Should we increase this repeat count for the c2 compiler to kick in for compiling engineUpdate() and have the call to stub in place from there?

test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305KAT.java line 133:

> 131:             System.out.println("*** Test " + ++testNumber + ": " +
> 132:                     test.testName);
> 133:             if (runSingleTest(test)) {

runSingleTest may need to be called enough number of times for the engineUpdate to be compiled by c2.

-------------

PR: https://git.openjdk.org/jdk/pull/10582


More information about the hotspot-dev mailing list