RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7]

Jamil Nimeh jnimeh at openjdk.org
Fri Nov 4 16:32:16 UTC 2022


On Fri, 4 Nov 2022 03:20:11 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

>> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`.
>> 
>> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java.
>>   - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.
>> - Added a JMH perf test.
>>    - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider.
>> 
>> Perf before:
>> 
>> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
>> Poly1305DigestBench.digest          64              thrpt    8  2961300.661 ± 110554.162  ops/s
>> Poly1305DigestBench.digest         256              thrpt    8  1791912.962 ±  86696.037  ops/s
>> Poly1305DigestBench.digest        1024              thrpt    8   637413.054 ±  14074.655  ops/s
>> Poly1305DigestBench.digest       16384              thrpt    8    48762.991 ±    390.921  ops/s
>> Poly1305DigestBench.digest     1048576              thrpt    8      769.872 ±      1.402  ops/s
>> 
>> and after:
>> 
>> Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
>> Poly1305DigestBench.digest          64              thrpt    8  2841243.668 ± 154528.057  ops/s
>> Poly1305DigestBench.digest         256              thrpt    8  1662003.873 ±  95253.445  ops/s
>> Poly1305DigestBench.digest        1024              thrpt    8  1770028.718 ± 100847.766  ops/s
>> Poly1305DigestBench.digest       16384              thrpt    8   765547.287 ±  25883.825  ops/s
>> Poly1305DigestBench.digest     1048576              thrpt    8    14508.458 ±     56.147  ops/s
>
> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
> 
>  - Merge remote-tracking branch 'origin/master' into avx512-poly
>  - address Jamil's review
>  - invalidkeyexception and some review comments
>  - extra whitespace character
>  - assembler checks and test case fixes
>  - Merge remote-tracking branch 'origin/master' into avx512-poly
>  - Merge remote-tracking branch 'origin' into avx512-poly
>  - further restrict UsePolyIntrinsics with supports_avx512vlbw
>  - missed white-space fix
>  - - Fix whitespace and copyright statements
>    - Add benchmark
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c

src/hotspot/share/opto/library_call.cpp line 7036:

> 7034:   assert(r_start, "r array is NULL");
> 7035: 
> 7036:   Node* call = make_runtime_call(RC_LEAF,

Can we safely change this to `RC_LEAF | RC_NO_FP`?  For the ChaCha20 block intrinsic I'm working on I've been using that parameter because I'm not touching the FP registers and that looks to be the case here (though your intrinsic is a lot more complicated than mine so I may have missed something).  I believe the GHASH and AES library call routines also call `make_runtime_call()` in this way.

-------------

PR: https://git.openjdk.org/jdk/pull/10582


More information about the hotspot-compiler-dev mailing list