RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

Mon Feb 3 16:16:46 UTC 2025

On Mon, 3 Feb 2025 10:56:28 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64.  Assembly language profiling yielded an 11% improvement in throughput.  When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement.
>
> This looks very nice, and I'm tempted to just approve it as it is. My only concern is that the algorithm changes aren't really explained, but I guess what you have done here is the _128-Bit Vectorization_ in `https://eprint.iacr.org/2013/759.pdf`. Is that right?

Hi @theRealAph, thanks for taking a look at the changes.

Actually, I hadn't read that paper believe it or not, though it certainly looks like what I'm doing.  When I was prototyping this a few years ago in x86_64 assembly it just occurred to me that directly loading the state into 4 consecutive vectors had everything align in the columnar organization such that we could do the double-round right off the bat (which must have been intentional in the design of the cipher).  Then all we had to do was do the lane rotation leftward 1/2/3 for the second/third/fourth vectors and do the quarter rounds again.  And longer vectors for AVX2, AVX-512, or using more vectors like on aarch64 allowed me to do more blocks at one time.

It was long enough ago that I don't recall exactly why, but when I did the quarter-round and block parallel versions in aarch64 assembly initially, it seemed like they were pretty comparable in speed.  At the time, I interpreted that to mean that the gains in block-parallel by not needing to do lane rotations was offset by having to gather/scatter each 32-bit state value from different vectors at 64-byte offsets, and it more or less balanced out.  When I went back to look at these two approaches, I think I was able to tweak things on the quarter-round parallel version to make loads and stores a little friendlier (one ld1 and a bunch of register-to-register movs vs. 4 ld4r ops, and 4 st1 ops vs. 16 st4 ops).

In terms of explaining the algorithm changes, I could add some comment text to the header of the stub function that better explains the general idea behind what is being done.  It would certainly help anyone maintaining it down the line (myself included).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2631451246