RFR: 8247645: ChaCha20 intrinsics [v3]

Sandhya Viswanathan sviswanathan at openjdk.org
Thu Nov 10 20:27:37 UTC 2022


On Thu, 10 Nov 2022 20:12:30 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

>> Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   replace hi/lo word shuffles and left-right shift/or operations for vpshufd on byte-aligned rotations
>
> using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for 8-bit and 16-bit left rotations has given us some modest speed gains:
> Before (with intrinsics):
> 
> AVX=1
> ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  ops/s
> ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  ops/s
> 
> AVX=2
> ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  ops/s
> ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  ops/s
> 
> After (using vpshufb):
> 
> AVX=1
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
> ChaCha20.encrypt                  256    thrpt   40  1447416.349 ± 14054.478  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   495844.721 ±  1949.237  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   138154.478 ±   411.707  ops/s         
> ChaCha20.encrypt                16384    thrpt   40    35165.143 ±   110.483  ops/s
> 
> AVX=2
> ChaCha20.encrypt                  256    thrpt   40  2020170.211 ± 10507.466  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   829644.325 ±  6452.931  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   246066.542 ±  1052.905  ops/s
> ChaCha20.encrypt                16384    thrpt   40    64021.363 ±   468.979  ops/s
> 
> This was done on the same system that the original benchmarks were done on.  None of these changes affect AVX512.
> 
> I'm working on a hybrid intrinsic approach to get the best of both worlds for those smaller single-part jobs.

@jnimeh Very nice work overall. I think it would be ok to get this PR integrated and do the hybrid approach as a follow on PR. Your work in general shows very good improvement over base.

-------------

PR: https://git.openjdk.org/jdk/pull/7702


More information about the hotspot-dev mailing list