RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v3]
Ferenc Rakoczi
duke at openjdk.org
Mon Nov 24 16:42:40 UTC 2025
On Thu, 20 Nov 2025 22:55:07 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:
>> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline
>> - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>> - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed https://github.com/vpaprotsk/jdk/pull/7)
>> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version
>> - `SignatureBench.MLDSA` is upto 5% faster, never slower
>>
>> Note on intrinsic:
>> - The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
>> - Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2
>>
>> Tests and benchmarks:
>> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
>> - Added benchmark to measure the performance of intrinsic itself
>>
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
>
> next set of comments
Good work! I just found a few typos in the comments.
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 88:
> 86: // +-----+-----+-----+-----+-----
> 87: //
> 88: // NOTE: size 0 and 1 are used for initial and final shuffles respectivelly of
Typo: respectivelly -> respectively
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 248:
> 246: // We do Montgomery multiplications of two AVX registers in 4 steps:
> 247: // 1. Do the multiplications of the corresponding even numbered slots into
> 248: // the odd numbered slots of a scratch2 register.
Typo: scratch2 -> scratch
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 249:
> 247: // 1. Do the multiplications of the corresponding even numbered slots into
> 248: // the odd numbered slots of a scratch2 register.
> 249: // 2. Swap the even and odd numbered slots of the original input registers.*
Typo: unnecessary '*' at the end
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 250:
> 248: // the odd numbered slots of a scratch2 register.
> 249: // 2. Swap the even and odd numbered slots of the original input registers.*
> 250: // 3. Similar to step 1, but into output register.
Typo: into output register -> into an output register
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 253:
> 251: // 4. Combine the outputs of step 1 and step 3 into the output of the Montgomery
> 252: // multiplication.
> 253: // (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need NOT swap
Typo: unnecessary '(*' at the beginning
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 282:
> 280: const XMMRegister* scratch = scratch1 == input1 ? output: scratch1;
> 281:
> 282: // scratch = input1_even*intput2_even
Suggestion: // scratch = input1_even * intput2_even
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 479:
> 477: // level 0 - 128
> 478: // scratch1 = coeffs3 * zetas1
> 479: // coeffs3, coeffs1 = coeffs1±scratch1
Suggestion: // coeffs3, coeffs1 = coeffs1 ± scratch1
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 524:
> 522: // coeffs1_2 = coeffs1_2 + scratch1
> 523: loadXmms(Zetas3, zetas, level * 512, vector_len, _masm);
> 524: shuffle(Scratch1, Coeffs1_2, Coeffs2_2, distance * 32); //Coeffs2_2 freed
Suggestion: // Coeffs2_2 freed
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 529:
> 527:
> 528: loadXmms(Zetas3, zetas, 4*64 + level * 512, vector_len, _masm);
> 529: shuffle(Scratch1, Coeffs3_2, Coeffs4_2, distance * 32); //Coeffs4_2 freed
Suggestion: // Coeffs4_2 freed
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554:
> 552: const XMMRegister Coeffs2_2[] = {xmm4, xmm5, xmm6, xmm7};
> 553:
> 554: // Since we cannot fit the entire payload into registers, we process
process input -> process the input
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 555:
> 553:
> 554: // Since we cannot fit the entire payload into registers, we process
> 555: // input in two stages. First half, load 8 registers 32 integers each apart.
First half -> For the first half
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 557:
> 555: // input in two stages. First half, load 8 registers 32 integers each apart.
> 556: // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
> 557: // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
Remaining -> For the remaining
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 558:
> 556: // With one load, we can process level 0-2 (128-, 64- and 32-integers apart)
> 557: // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
> 558: // 2-, 1-integer appart)
appart -> apart
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 559:
> 557: // Remaining levels, load 8 registers from consecutive memory (16-, 8-, 4-,
> 558: // 2-, 1-integer appart)
> 559: // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
appart -> apart
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 560:
> 558: // 2-, 1-integer appart)
> 559: // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within registers
> 560: // Other levels, shuffles can be done by re-aranging register order
Other -> on the other
re-aranging register order -> rearranging the register order
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 562:
> 560: // Other levels, shuffles can be done by re-aranging register order
> 561:
> 562: // Four batches of 8 registers each, 128 bytes appart
appart -> apart
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 701:
> 699: // In each of these iterations half of the coefficients are added to and
> 700: // subtracted from the other half of the coefficients then the result of
> 701: // the substration is (Montgomery) multiplied by the corresponding zetas.
substration -> subtraction (I know this was in my own comment :-( )
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 850:
> 848: }
> 849:
> 850: // Four batches of 8 registers each, 128 bytes appart
appart -> apart
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28136#issuecomment-3571728756
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556771999
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556825899
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556836110
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556839540
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556845331
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556853907
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556865521
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556913637
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556915972
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556943987
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556925142
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556945036
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556949814
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556953155
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556942168
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556956323
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556978873
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556961642
More information about the hotspot-dev
mailing list