RFR: 8350589: Investigate cleaner implementation of AArch64 ML-DSA intrinsic introduced in JDK-8348561

Thu Mar 13 12:36:57 UTC 2025

On Thu, 13 Mar 2025 12:32:29 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> @ferakocz I have modified your generator code to employ vector sequences and auxiliaries that handle iterative loads, stores and math/logic operations over vector sequences. It would be useful to have a review of the code from you and also for you to test it (see comments below re testing)
>> The rewrite has allowed much of the generator logic to be condensed into calls to simple auxiliaries which provides a better mid-level view of how the code is structured. It has also clarified the register use. I think this will be a lot easier for maintainers to understand.
>> A few further comments:
>> 1. I have added some asserts to the montmul operations to ensure that input and output register sequences are either disjoint or overlapping. There may be further opportunities to add asserts in a follow-up.
>> 2. One thing I noted (commented on in code) after switching to passing vector sequences rather than relying on fixed mappings is that some reloading of q and qinv inside loops is unnecessary as the code in the loop does not write the relevant vectors. I left the code as is so that I could check that the generated code is identical to the original but I will move the relevant load outside the loop before pushing.
>> 3. I compared before and after dissasemblies of the generated code and it is unchanged modulo routine `dilithiumDecomposePoly`. For that intrinsic your generator code wrote successive, intermediate results into the next unused set of 4 vectors, which are in most cases used subsequently to hold a non-temporary result needed by a later computation. My code always writes intermediate results into the last set of 4 vectors (declared as `VSeq<4> vtmp(20)`). As a result my generated code has the same structure but a slightly different register mapping to yours. I don't believe this affects performance but the change do make it clearer how the computed values are being used.
>> 4. As well as comparing disassemblies for the generated code I verified the patch by running test `jdk/sun/security/provider/acvp/ML_DSA_Test.java`. However, I noted a problem with relying on the test as currently implemented since it did not appear to capture some errors in my code. I re-ran the test under the debugger and found that only one of the intrinsics was being exercised (dilithiumAlmostNtt). I confirmed this by adding -XX:+PrintCompilation to the test command line. It seems that all the calls to other intrinsic candidates occurred from the interpreter and did not run often eno...
>
>> @ferakocz I have modified your generator code to employ vector sequences and auxiliaries that handle iterative loads, stores and math/logic operations over vector sequences. It would be useful to have a review of the code from you and also for you to test it (see comments below re testing) The rewrite has allowed much of the generator logic to be condensed into calls to simple auxiliaries which provides a better mid-level view of how the code is structured. It has also clarified the register use. I think this will be a lot easier for maintainers to understand. A few further comments:
>> 
>> 1. I have added some asserts to the montmul operations to ensure that input and output register sequences are either disjoint or overlapping. There may be further opportunities to add asserts in a follow-up.
>> 2. One thing I noted (commented on in code) after switching to passing vector sequences rather than relying on fixed mappings is that some reloading of q and qinv inside loops is unnecessary as the code in the loop does not write the relevant vectors. I left the code as is so that I could check that the generated code is identical to the original but I will move the relevant load outside the loop before pushing.
>> 3. I compared before and after dissasemblies of the generated code and it is unchanged modulo routine `dilithiumDecomposePoly`. For that intrinsic your generator code wrote successive, intermediate results into the next unused set of 4 vectors, which are in most cases used subsequently to hold a non-temporary result needed by a later computation. My code always writes intermediate results into the last set of 4 vectors (declared as `VSeq<4> vtmp(20)`). As a result my generated code has the same structure but a slightly different register mapping to yours. I don't believe this affects performance but the change do make it clearer how the computed values are being used.
>> 4. As well as comparing disassemblies for the generated code I verified the patch by running test `jdk/sun/security/provider/acvp/ML_DSA_Test.java`. However, I noted a problem with relying on the test as currently implemented since it did not appear to capture some errors in my code. I re-ran the test under the debugger and found that only one of the intrinsics was being exercised (dilithiumAlmostNtt). I confirmed this by adding -XX:+PrintCompilation to the test command line. It seems that all the calls to other intrinsic candidates occurred from the interpreter and did not run often enou...

> @ferakocz I see that the test problem is being addressed as part of the x86 ML_DSA PR.

Oh, so you have already found that :-) .

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24026#issuecomment-2721105194