RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7]
Jamil Nimeh
jnimeh at openjdk.org
Fri Nov 4 16:32:16 UTC 2022
On Fri, 4 Nov 2022 03:20:11 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:
>> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`.
>>
>> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java.
>> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.
>> - Added a JMH perf test.
>> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider.
>>
>> Perf before:
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ± 110554.162 ops/s
>> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ± 86696.037 ops/s
>> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ± 14074.655 ops/s
>> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ± 390.921 ops/s
>> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ± 1.402 ops/s
>>
>> and after:
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ± 154528.057 ops/s
>> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ± 95253.445 ops/s
>> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ± 100847.766 ops/s
>> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ± 25883.825 ops/s
>> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ± 56.147 ops/s
>
> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>
> - Merge remote-tracking branch 'origin/master' into avx512-poly
> - address Jamil's review
> - invalidkeyexception and some review comments
> - extra whitespace character
> - assembler checks and test case fixes
> - Merge remote-tracking branch 'origin/master' into avx512-poly
> - Merge remote-tracking branch 'origin' into avx512-poly
> - further restrict UsePolyIntrinsics with supports_avx512vlbw
> - missed white-space fix
> - - Fix whitespace and copyright statements
> - Add benchmark
> - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c
src/hotspot/share/opto/library_call.cpp line 7036:
> 7034: assert(r_start, "r array is NULL");
> 7035:
> 7036: Node* call = make_runtime_call(RC_LEAF,
Can we safely change this to `RC_LEAF | RC_NO_FP`? For the ChaCha20 block intrinsic I'm working on I've been using that parameter because I'm not touching the FP registers and that looks to be the case here (though your intrinsic is a lot more complicated than mine so I may have missed something). I believe the GHASH and AES library call routines also call `make_runtime_call()` in this way.
-------------
PR: https://git.openjdk.org/jdk/pull/10582
More information about the hotspot-compiler-dev
mailing list