RFR: 8247645: ChaCha20 intrinsics

Mon Nov 7 08:54:08 UTC 2022

On Mon, 7 Nov 2022 08:04:15 GMT, Daniel Jeliński <djelinski at openjdk.org> wrote:

> Is it expected that AVX3 is 35% slower than AVX2 and 8% slower than AVX1?

Well, it isn't slower than AVX/AVX2 across the board.  For plain ChaCha20 it is slower for this particular benchmark at 256 bytes (and smaller I would assume), but that changes at data sizes above 256 bytes.  I haven't worked out the timings exactly, but this is what I think is happening:
The AVX512 intrinsic broadcasts into the registers from memory using twice as many registers and at twice the size over AVX2, and likewise writes 4x the amount of data into the keystream buffer upon completion.  I'm not certain by how much, but I believe the runtime of the AVX512 intrinsic is longer than that of the AVX/AVX2 intrinsic.  When the job size is 256 bytes, both AVX2 and AVX512 will run their intrinsics one time and that may account for the speed difference.  And for AVX at 256 bytes the intrinsic has to run twice (which is why the slowdown is less).  When you get to 1024 bytes, AVX has to run 8 times to make enough key stream, AVX2 has to run 4 times, but AVX512 still only has to run once.  So now AVX512 outperforms the other two, and continues that way for any larger single-part encryption job this benchmark is doing.  I haven't tried running other sizes yet to see where that cross-over point is, but I suspect it is probably once a job gets above 512 bytes.

This particular benchmark is a single-part encryption or decryption.  The performance characteristics look different when you are taking a large buffer and submitting multi-part updates.  In that case 16 byte updates has AVX512 and AVX2 nearly identical (AVX2 is 2% faster, AVX is already 23% slower), at 64 bytes AVX512 is faster than everything and the gap widens as the update sizes grow.

To be fair, I think the single-part jobs are more representative of what we would see in JSSE, but TLS application data job sizes are probably all over the map depending on what is being sent.

ChaCha20-Poly1305 - There AVX512 is slower than AVX2 across the board, and I am not sure why even at larger sizes the throughput gains we see in ChaCha20 are not seen here.  There's a lot more work being done outside the cc20 intrinsic, especially without the pending AVX512 poly1305 intrinsic, but I would've expected to see a crossover point at one of those benchmark sizes.

-------------

PR: https://git.openjdk.org/jdk/pull/7702