RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9]

Wed Nov 9 00:29:32 UTC 2022

On Tue, 8 Nov 2022 23:21:58 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

>> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`.
>> 
>> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java.
>>   - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.
>> - Added a JMH perf test.
>>    - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider.
>> 
>> Perf before:
>> 
>> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
>> Poly1305DigestBench.digest          64              thrpt    8  2961300.661 ± 110554.162  ops/s
>> Poly1305DigestBench.digest         256              thrpt    8  1791912.962 ±  86696.037  ops/s
>> Poly1305DigestBench.digest        1024              thrpt    8   637413.054 ±  14074.655  ops/s
>> Poly1305DigestBench.digest       16384              thrpt    8    48762.991 ±    390.921  ops/s
>> Poly1305DigestBench.digest     1048576              thrpt    8      769.872 ±      1.402  ops/s
>> 
>> and after:
>> 
>> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
>> Poly1305DigestBench.digest          64              thrpt    8  2841243.668 ± 154528.057  ops/s
>> Poly1305DigestBench.digest         256              thrpt    8  1662003.873 ±  95253.445  ops/s
>> Poly1305DigestBench.digest        1024              thrpt    8  1770028.718 ± 100847.766  ops/s
>> Poly1305DigestBench.digest       16384              thrpt    8   765547.287 ±  25883.825  ops/s
>> Poly1305DigestBench.digest     1048576              thrpt    8    14508.458 ±     56.147  ops/s
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix 32-bit build

src/hotspot/cpu/x86/macroAssembler_x86.hpp line 970:

> 968: 
> 969:   void addmq(int disp, Register r1, Register r2);
> 970: 

Leftover formatting changes.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 95:

> 93: 
> 94:   // OFFSET 64: mask_44
> 95:   0xfffffffffff, 0xfffffffffff,

Please, keep leading zeroes explicit in the constants.

src/hotspot/cpu/x86/stubRoutines_x86.cpp line 2:

> 1: /*
> 2:  * Copyright (c) 2013, 2022, Oracle and/or its affiliates. All rights reserved.

No changes in the file anymore.

src/hotspot/share/opto/library_call.cpp line 7014:

> 7012:   const TypeKlassPtr* rklass = TypeKlassPtr::make(instklass_ImmutableElement);
> 7013:   const TypeOopPtr* rtype = rklass->as_instance_type()->cast_to_ptr_type(TypePtr::NotNull);
> 7014:   Node* rObj = new CheckCastPPNode(control(), rFace, rtype);

FTR it's an unsafe cast since it  doesn't involve a runtime check from `IntegerModuloP` to `ImmutableElement`. Please, lift as much checks into Java wrapper as possible.

src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175:

> 173: 
> 174:         int blockMultipleLength = len & (~(BLOCK_LENGTH-1));
> 175:         Objects.checkFromIndexSize(offset, blockMultipleLength, input.length);

I suggest to move the checks into `processMultipleBlocks`, introduce new static helper method specifically for the intrinsic part, and lift more logic (e.g., field loads) from the intrinsic into Java code.

As an additional step, you can switch to double-register addressing mode (base + offset) for input data (`input`, `alimbs`, `rlimbs`) and simplify the intrinsic part even more (will involve a switch from `array_element_address` to `make_unsafe_address`).

-------------

PR: https://git.openjdk.org/jdk/pull/10582