RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

Fri Feb 21 10:14:00 UTC 2025

On Tue, 18 Feb 2025 13:43:18 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Adding comments + some code reorganization
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4066:
> 
>> 4064:   }
>> 4065: 
>> 4066:   // Execute on round of keccak of two computations in parallel.
> 
> Suggestion:
> 
> It would be helpful to add comments that relate the register and instruction selection to the original Java source code. e.g. change the header as follows
> 
>     // Performs 2 keccak round transformations using vector parallelism
>     // 
>     // Two sets of 25 * 64-bit input states a0[lo:hi]...a24[lo:hi] are passed in
>     // the lower/upper halves of registers v0...v24 and the transformed states
>     // are returned in the same registers. Intermediate 64-bit pairs
>     // c0...c5 and d0...d5 are computed in registers v25...v30. v31 is
>     // loaded with the required pair of 64 bit rounding constants.
>     // During computation of the output states some intermediate results are
>     // shuffled around registers v0...v30. Comments on each line indicate
>     // how the values in registers correspond to variables ai, ci, di in
>     // the Java source code, likewise how the generated machine instructions
>     // correspond to Java source operations (n.b. rol means rotate left).
> 
> The annotate the generation steps as follows:
> 
>     __ eor3(v29, __ T16B, v4, v9, v14);       // c4 = a4 ^ a9 ^ a14
>     __ eor3(v26, __ T16B, v1, v6, v11);       // c1 = a1 ^ a16 ^ a11
>     __ eor3(v28, __ T16B, v3, v8, v13);       // c3 = a3 ^ a8 ^a13
>     __ eor3(v25, __ T16B, v0, v5, v10);       // c0 = a0 ^ a5 ^ a10
>     __ eor3(v27, __ T16B, v2, v7, v12);       // c2 = a2 ^ a7 ^ a12
>     __ eor3(v29, __ T16B, v29, v19, v24);     // c4 ^= a19 ^ a24
>     __ eor3(v26, __ T16B, v26, v16, v21);     // c1 ^= a16 ^ a21
>     __ eor3(v28, __ T16B, v28, v18, v23);     // c3 ^= a18 ^ a23
>     __ eor3(v25, __ T16B, v25, v15, v20);     // c0 ^= a15 ^ a20
>     __ eor3(v27, __ T16B, v27, v17, v22);     // c2 ^= a17 ^ a22
> 
>     __ rax1(v30, __ T2D, v29, v26);           // d0 = c4 ^ rol(c1, 1)
>     __ rax1(v26, __ T2D, v26, v28);           // d2 = c1 ^ rol(c3, 1)
>     __ rax1(v28, __ T2D, v28, v25);           // d4 = c3 ^ rol(c0, 1)
>     __ rax1(v25, __ T2D, v25, v27);           // d1 = c0 ^ rol(c2, 1)
>     __ rax1(v27, __ T2D, v27, v29);           // d3 = c2 ^ rol(c4, 1)
> 
>     __ eor(v0, __ T16B, v0, v30);             // a0 = a0 ^ d0
>     __ xar(v29, __ T2D, v1,  v25, (64 - 1));  // a10' = rol((a1^d1), 1)
>     __ xar(v1,  __ T2D, v6,  v25, (64 - 44)); // a1 = rol(a6^d1), 44)
>     __ xar(v6,  __ T2D, v9,  v28, (64 - 20)); // a6 = rol((a9^d4), 20)
>     __ xar(v...

Although this piece of code is not new, and I don't really think that this level of commenting is necessary, especially in code that is very unlikely to change, I added the comments.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1965220606