RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

Mon Feb 3 10:58:48 UTC 2025

On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

> This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64.  Assembly language profiling yielded an 11% improvement in throughput.  When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement.

This looks very nice, and I'm tempted to just approve it as it is. My only concern is that the algorithm changes aren't really explained, but I guess what you have done here is the _128-Bit Vectorization_ in `https://eprint.iacr.org/2013/759.pdf`. Is that right?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2630610061