RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v3]

Mon Nov 24 20:53:01 UTC 2025

On Mon, 24 Nov 2025 15:35:12 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   next set of comments
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 88:
> 
>> 86: // +-----+-----+-----+-----+-----
>> 87: //
>> 88: // NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of
> 
> Typo: respectivelly -> respectively

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 248:
> 
>> 246: // We do Montgomery multiplications of two AVX registers in 4 steps:
>> 247: // 1. Do the multiplications of the corresponding even numbered slots into
>> 248: //    the odd numbered slots of a scratch2 register.
> 
> Typo:  scratch2 -> scratch

I think I meant "the scratch2" register here.. reworded, please double check if its clearer..

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 250:
> 
>> 248: //    the odd numbered slots of a scratch2 register.
>> 249: // 2. Swap the even and odd numbered slots of the original input registers.*
>> 250: // 3. Similar to step 1, but into output register.
> 
> Typo: into output register -> into an output register

used 'the' to be 'specific'.. (I think the lack of articles was causing the confusion.. "the scratch2 register is combined with the output register into scratch.. or something..) Also reworded step 4?

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 253:
> 
>> 251: // 4. Combine the outputs of step 1 and step 3 into the output of the Montgomery
>> 252: //    multiplication.
>> 253: // (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap
> 
> Typo: unnecessary '(*' at the beginning

This was my attempt to add a note to second step.. spelled out "Note"? or can just remove, since swapping only happens on second step..

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 282:
> 
>> 280:     const XMMRegister* scratch = scratch1 == input1 ? output: scratch1;
>> 281: 
>> 282:     // scratch = input1_even*intput2_even
> 
> Suggestion:  // scratch = input1_even * intput2_even

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 479:
> 
>> 477:     // level 0 - 128
>> 478:     // scratch1 = coeffs3 * zetas1
>> 479:     // coeffs3, coeffs1 = coeffs1±scratch1
> 
> Suggestion:     // coeffs3, coeffs1 = coeffs1 ± scratch1

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 524:
> 
>> 522:       // coeffs1_2 = coeffs1_2 + scratch1
>> 523:       loadXmms(Zetas3, zetas, level * 512, vector_len, _masm);
>> 524:       shuffle(Scratch1, Coeffs1_2, Coeffs2_2, distance * 32); //Coeffs2_2 freed
> 
> Suggestion:     // Coeffs2_2 freed

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 529:
> 
>> 527: 
>> 528:       loadXmms(Zetas3, zetas, 4*64 + level * 512, vector_len, _masm);
>> 529:       shuffle(Scratch1, Coeffs3_2, Coeffs4_2, distance * 32); //Coeffs4_2 freed
> 
> Suggestion: // Coeffs4_2 freed

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554:
> 
>> 552:     const XMMRegister Coeffs2_2[] = {xmm4, xmm5, xmm6, xmm7};
>> 553: 
>> 554:     // Since we cannot fit the entire payload into registers, we process
> 
> process input -> process the input

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 555:
> 
>> 553: 
>> 554:     // Since we cannot fit the entire payload into registers, we process
>> 555:     // input in two stages. First half, load 8 registers 32 integers each apart.
> 
> First half -> For the first half

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 557:
> 
>> 555:     // input in two stages. First half, load 8 registers 32 integers each apart.
>> 556:     // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
>> 557:     // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
> 
> Remaining -> For the remaining

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 558:
> 
>> 556:     // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
>> 557:     // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
>> 558:     // 2-, 1-integer appart)
> 
> appart -> apart

Thanks! Looks like I've always misspelled that word! :)

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 559:
> 
>> 557:     // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
>> 558:     // 2-, 1-integer appart)
>> 559:     // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
> 
> appart -> apart

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 560:
> 
>> 558:     // 2-, 1-integer appart)
>> 559:     // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
>> 560:     // Other levels, shuffles can be done by re-aranging register order
> 
> Other -> on the other
> re-aranging register order -> rearranging the register order

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 562:
> 
>> 560:     // Other levels, shuffles can be done by re-aranging register order
>> 561: 
>> 562:     // Four batches of 8 registers each, 128 bytes appart
> 
> appart -> apart

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 701:
> 
>> 699:   // In each of these iterations half of the coefficients are added to and
>> 700:   // subtracted from the other half of the coefficients then the result of
>> 701:   // the substration is (Montgomery) multiplied by the corresponding zetas.
> 
> substration -> subtraction (I know this was in my own comment :-( )

done (funny, thats exactly how I say "substraction" in my head too :D )

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 850:
> 
>> 848:     }
>> 849: 
>> 850:     // Four batches of 8 registers each, 128 bytes appart
> 
> appart -> apart

done

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557555908
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557577559
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557589525
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557582314
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557592866
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557595337
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557596482
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557596698
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557599194
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557606458
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557606672
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557608631
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557611103
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557611341
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557616181
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557620647
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2557621206