RFR: 8247645: ChaCha20 intrinsics [v3]

Thu Nov 10 20:15:14 UTC 2022

On Thu, 10 Nov 2022 20:11:46 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

>> This PR delivers ChaCha20 intrinsics that accelerate the core block function that generates key stream from the key, counter and nonce.  Intrinsics have been written for the following platforms and instruction sets:
>> 
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>> 
>> Note: Microbenchmark results moved to a comment in the PR so we don't have to see it in every email.
>> 
>> Special thanks to the folks who have made many helpful comments while this PR was in draft form.
>
> Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision:
> 
>   replace hi/lo word shuffles and left-right shift/or operations for vpshufd on byte-aligned rotations

using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for 8-bit and 16-bit left rotations has given us some modest speed gains:
Before (with intrinsics):

AVX=1
ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  ops/s
ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  ops/s
ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  ops/s
ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  ops/s

AVX=2
ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  ops/s
ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  ops/s
ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  ops/s
ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  ops/s

After (using vpshufb):

AVX=1
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
ChaCha20.encrypt                  256    thrpt   40  1447416.349 ± 14054.478  ops/s
ChaCha20.encrypt                 1024    thrpt   40   495844.721 ±  1949.237  ops/s
ChaCha20.encrypt                 4096    thrpt   40   138154.478 ±   411.707  ops/s         
ChaCha20.encrypt                16384    thrpt   40    35165.143 ±   110.483  ops/s

AVX=2
ChaCha20.encrypt                  256    thrpt   40  2020170.211 ± 10507.466  ops/s
ChaCha20.encrypt                 1024    thrpt   40   829644.325 ±  6452.931  ops/s
ChaCha20.encrypt                 4096    thrpt   40   246066.542 ±  1052.905  ops/s
ChaCha20.encrypt                16384    thrpt   40    64021.363 ±   468.979  ops/s

This was done on the same system that the original benchmarks were done on.  None of these changes affect AVX512.

I'm working on a hybrid intrinsic approach to get the best of both worlds for those smaller single-part jobs.

-------------

PR: https://git.openjdk.org/jdk/pull/7702