RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v6]

Mon Apr 7 12:41:57 UTC 2025

On Sun, 23 Mar 2025 17:00:43 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
> 
>  - Merged master.
>  - Fixed bad assertion.
>  - Fixed mismerge.
>  - Merged master.
>  - A little cleanup
>  - Merged master
>  - removing trailing spaces
>  - kyber aarch64 intrinsics

@ferakocz Thanks for another very good piece of work which appears to me to be functioning correctly and performantly.

The PR suffers from the same problems as the original ML_DSA one i.e. The mapping of data to registers and the overall structure of the generated code and its relation to the related Java code/the original algorithms will be hard for a maintainer to identify. I have reworked your patch to use vector sequences in this [draft PR](https://github.com/openjdk/jdk/pull/24419) in very much the same way as was done for the ML_DSA PR. This has significantly abstracted and clarified the register mappings that are in use in each kyber generator and has also made the higher level structure of the generated code much easier to follow.

Note that my rework of the generation routines was applied to your original PR after rebasing it on master. Before updating the kyber routines I also generalized a few of the VSeq<N> methods that benefit from being shared by both kyber and dilithium, most notably the montmul routines, and I added a few extra helpers.

The reworked version passes the ML_KEM functional test and gives similar performance improvements for the ML_KEM micro benchmark. The generated code does differ in a few places from what your original patch generates but only superficially - most notable is that a few loads/stores that rely on continued post-increments in the original instead use a constant offset or an add/load pair in the reworked code. This makes a very minor difference to code size and does not seem to affect performance.

I would like you to rework your PR to incorporate these changes because I believe it will make a big difference to maintainability.  n.b. it may be easier to integrate my changes by diffing your branch and mine and applying the resulting change set rather than trying to merge the changes. Please let me know if you have problems with the integration and need help.

I still have some further review comments and would also like to see more commenting to explain what the code is doing. However, I think it will be easier to do that after this rework has been integrated into your PR.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23663#pullrequestreview-2746672860