RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

Mon Feb 24 16:21:59 UTC 2025

On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Hi. Here is the test result of our CI.
>> 
>> ### copyright year
>> 
>> the following files should update the copyright year to 2025.
>> 
>> 
>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp
>> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp
>> src/hotspot/share/runtime/globals.hpp
>> src/java.base/share/classes/sun/security/provider/ML_DSA.java
>> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java
>> test/micro/org/openjdk/bench/java/security/MLDSA.java
>> 
>> 
>> ### cross-build failure
>> 
>> Cross build for riscv64/s390/ppc64 failed.
>> 
>> Here shows the error msg for ppc64
>> 
>> 
>> === Output from failing command(s) repeated here ===
>> * For target support_interim-jmods_support__create_java.base.jmod_exec:
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769
>> #  assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
>> #
>> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3)
>> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
>> # Problematic frame:
>> # V  [libjvm.so+0x3b391c]  Instruction_aarch64::~Instruction_aarch64()+0xbc
>> #
>> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/
>> #
>> # An error report file with more information is saved as:
>> # /tmp/jdk-src/make/hs_err_pid72752.log
>>    ... (rest of output omitted)
>> 
>> * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
>> === End of repeated output ===
>> 
>> 
>> I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms
>
> @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build.
> Was this a build attempted on an aarch64 for the other architectures?

@ferakocz I have indicated a few places where I think you should add comments to clarify the relationship to the original Java code or just clarify what data is being used. I think the code is ok to go in as it is but I would really like to investigate a better structuring of the generator code. This can be done as a follow-up rather than delay getting this version committed.

There are two things I still see as problematic with the current code.

1) There are lots of places in your auxiliary generator methods and also in their client methods where you generate distinct sequences of calls to the assembler sharing essentially the same code shape i.e. the same instructions but with different vector register arguments. For example, in `dilithium_montmul32` you generate the multiply sequence to montgomery multiply 4x4s registers in v0..v3 by 4x4s registers in v16..v19 and then repeat exactly the same code in exactly the same sequence to multiply the 4x4s registers in v4..v7 by 4x4s registers in v20..v23.

Likewise, `dilithium_sub_add_montmul16` generates that same shape code but uses the montmul sequence with odd registers v1..v7 paired against the compact sequence v16..19. As another example, you generate various 4 or 8 long sequences of subv and addv operations at various points, including in some of the top level methods.

I appreciate that you have folded one of the montmult cases into the other by adding the `bool by_constant` parameter to `dilithium_montmul32`. However, I think it would be worth investigating an alternative that would allow more use more, systematic use of auxiliary methods.

2) Your current auxiliary generator methods rely on a fixed mapping of input, output and scratch registers to specific registers. This is part of why the reason why you cannot always call your auxiliaries (or smaller pieces of them) from other locations where the same code shape is generated -- the input and output mappings of data to registers expected by the auxiliary do not match the register sequences in which the relevant data are (transiently) located.

This same fact also means that the repeated code sections heavily depend on naming exactly the right register on each generator line. That makes it harder for a maintainer to recognize how, essentially, what is really just one common, abstract operation is, at each different occurrence, consuming, combining and updating several input sequences of related registers to generate one or more output sequences. That also means that it would be very easy to introduce an error if the code ever needed to be changed.

I would like to investigate an alternative approach where your auxiliary generator methods and their callers pass arguments that identify the vector register sequences to be consumed as inputs, used as temporaries and written as outputs. In cases where the routines operate on sequences of 4 or 8 successive vectors then, at the very least, that would involve specifying the first register for each input, temporary or output e.g. for the montmult32 multiply v0+ by v16+ using v24+ as temporaries and v30+ as constants and output the results to v16+. However, that leaves it implicit that the first two inputs involve 8 registers while the temporaries involves 4 and the constants 2. The more general requirement is not just to specify the vector sequence length (2, 4 or 8) but also allow the default stride of one (e.g. v0, v1, ...) to be varied to allow for skip sequences (e.g. v0, v2, ...) or constant sequences (v28, v28, ... as would be needed for multiply constant).

I have prototyped a simple vector sequence type `VRSeq<N>` that models an indexable sequence of FloatRegisters and allows many of your higher level routines to simply declare register sets they operate on and then pass them as arguments to a range of simply auxiliary generator functions that can be used in many places where you currently have a lot of inline calls to the assembler -- see attachment:
[vseq.zip](https://github.com/user-attachments/files/18946470/vseq.zip)

I'll raise a JIRA to cover recoding the current implementation using this type and post a follow-up PR that uses it to see how far it helps simplify the code. I believe it will make it easier for maintainers to understand the structure of the generated code and observe/verify the use of registers to store specific values. It should also allow assertions about the use of registers to be added to the code to ensure that values are not being overwritten (expect in circumstances where that is legitimate).

Meanwhile I'll approve this PR modulo the commenting I suggested.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2678977770