RFR: 8247645: ChaCha20 intrinsics [v3]
Jamil Nimeh
jnimeh at openjdk.org
Thu Nov 10 20:15:14 UTC 2022
On Thu, 10 Nov 2022 20:11:46 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function that generates key stream from the key, counter and nonce. Intrinsics have been written for the following platforms and instruction sets:
>>
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>>
>> Note: Microbenchmark results moved to a comment in the PR so we don't have to see it in every email.
>>
>> Special thanks to the folks who have made many helpful comments while this PR was in draft form.
>
> Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision:
>
> replace hi/lo word shuffles and left-right shift/or operations for vpshufd on byte-aligned rotations
using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for 8-bit and 16-bit left rotations has given us some modest speed gains:
Before (with intrinsics):
AVX=1
ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 ops/s
ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 ops/s
ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 ops/s
ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 ops/s
AVX=2
ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 ops/s
ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 ops/s
ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 ops/s
ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 ops/s
After (using vpshufb):
AVX=1
Benchmark (dataSize) Mode Cnt Score Error Units
ChaCha20.encrypt 256 thrpt 40 1447416.349 ± 14054.478 ops/s
ChaCha20.encrypt 1024 thrpt 40 495844.721 ± 1949.237 ops/s
ChaCha20.encrypt 4096 thrpt 40 138154.478 ± 411.707 ops/s
ChaCha20.encrypt 16384 thrpt 40 35165.143 ± 110.483 ops/s
AVX=2
ChaCha20.encrypt 256 thrpt 40 2020170.211 ± 10507.466 ops/s
ChaCha20.encrypt 1024 thrpt 40 829644.325 ± 6452.931 ops/s
ChaCha20.encrypt 4096 thrpt 40 246066.542 ± 1052.905 ops/s
ChaCha20.encrypt 16384 thrpt 40 64021.363 ± 468.979 ops/s
This was done on the same system that the original benchmarks were done on. None of these changes affect AVX512.
I'm working on a hybrid intrinsic approach to get the best of both worlds for those smaller single-part jobs.
-------------
PR: https://git.openjdk.org/jdk/pull/7702
More information about the security-dev
mailing list