RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v3]

Mon Nov 24 16:42:40 UTC 2025

On Thu, 20 Nov 2025 22:55:07 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline 
>>    - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>>    - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed https://github.com/vpaprotsk/jdk/pull/7)
>> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version 
>>   - `SignatureBench.MLDSA` is upto 5% faster, never slower
>> 
>> Note on intrinsic:
>> - The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
>> - Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2
>> 
>> Tests and benchmarks:
>> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
>> - Added benchmark to measure the performance of intrinsic itself
>> 
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
> 
>   next set of comments

Good work! I just found a few typos in the comments.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 88:

> 86: // +-----+-----+-----+-----+-----
> 87: //
> 88: // NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of

Typo: respectivelly -> respectively

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 248:

> 246: // We do Montgomery multiplications of two AVX registers in 4 steps:
> 247: // 1. Do the multiplications of the corresponding even numbered slots into
> 248: //    the odd numbered slots of a scratch2 register.

Typo:  scratch2 -> scratch

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 249:

> 247: // 1. Do the multiplications of the corresponding even numbered slots into
> 248: //    the odd numbered slots of a scratch2 register.
> 249: // 2. Swap the even and odd numbered slots of the original input registers.*

Typo: unnecessary '*' at the end

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 250:

> 248: //    the odd numbered slots of a scratch2 register.
> 249: // 2. Swap the even and odd numbered slots of the original input registers.*
> 250: // 3. Similar to step 1, but into output register.

Typo: into output register -> into an output register

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 253:

> 251: // 4. Combine the outputs of step 1 and step 3 into the output of the Montgomery
> 252: //    multiplication.
> 253: // (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap

Typo: unnecessary '(*' at the beginning

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 282:

> 280:     const XMMRegister* scratch = scratch1 == input1 ? output: scratch1;
> 281: 
> 282:     // scratch = input1_even*intput2_even

Suggestion:  // scratch = input1_even * intput2_even

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 479:

> 477:     // level 0 - 128
> 478:     // scratch1 = coeffs3 * zetas1
> 479:     // coeffs3, coeffs1 = coeffs1±scratch1

Suggestion:     // coeffs3, coeffs1 = coeffs1 ± scratch1

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 524:

> 522:       // coeffs1_2 = coeffs1_2 + scratch1
> 523:       loadXmms(Zetas3, zetas, level * 512, vector_len, _masm);
> 524:       shuffle(Scratch1, Coeffs1_2, Coeffs2_2, distance * 32); //Coeffs2_2 freed

Suggestion:     // Coeffs2_2 freed

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 529:

> 527: 
> 528:       loadXmms(Zetas3, zetas, 4*64 + level * 512, vector_len, _masm);
> 529:       shuffle(Scratch1, Coeffs3_2, Coeffs4_2, distance * 32); //Coeffs4_2 freed

Suggestion: // Coeffs4_2 freed

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554:

> 552:     const XMMRegister Coeffs2_2[] = {xmm4, xmm5, xmm6, xmm7};
> 553: 
> 554:     // Since we cannot fit the entire payload into registers, we process

process input -> process the input

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 555:

> 553: 
> 554:     // Since we cannot fit the entire payload into registers, we process
> 555:     // input in two stages. First half, load 8 registers 32 integers each apart.

First half -> For the first half

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 557:

> 555:     // input in two stages. First half, load 8 registers 32 integers each apart.
> 556:     // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
> 557:     // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,

Remaining -> For the remaining

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 558:

> 556:     // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
> 557:     // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
> 558:     // 2-, 1-integer appart)

appart -> apart

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 559:

> 557:     // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
> 558:     // 2-, 1-integer appart)
> 559:     // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers

appart -> apart

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 560:

> 558:     // 2-, 1-integer appart)
> 559:     // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
> 560:     // Other levels, shuffles can be done by re-aranging register order

Other -> on the other
re-aranging register order -> rearranging the register order

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 562:

> 560:     // Other levels, shuffles can be done by re-aranging register order
> 561: 
> 562:     // Four batches of 8 registers each, 128 bytes appart

appart -> apart

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 701:

> 699:   // In each of these iterations half of the coefficients are added to and
> 700:   // subtracted from the other half of the coefficients then the result of
> 701:   // the substration is (Montgomery) multiplied by the corresponding zetas.

substration -> subtraction (I know this was in my own comment :-( )

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 850:

> 848:     }
> 849: 
> 850:     // Four batches of 8 registers each, 128 bytes appart

appart -> apart

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28136#issuecomment-3571728756
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556771999
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556825899
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556836110
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556839540
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556845331
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556853907
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556865521
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556913637
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556915972
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556943987
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556925142
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556945036
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556949814
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556953155
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556942168
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556956323
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556978873
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556961642