RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements
Sandhya Viswanathan
sviswanathan at openjdk.org
Fri Nov 14 19:51:14 UTC 2025
On Fri, 14 Nov 2025 17:51:24 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline
>> - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>> - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed https://github.com/vpaprotsk/jdk/pull/7)
>> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version
>> - `SignatureBench.MLDSA` is upto 5% faster, never slower
>>
>> Note on intrinsic:
>> - The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
>> - Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2
>>
>> Tests and benchmarks:
>> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
>> - Added benchmark to measure the performance of intrinsic itself
>>
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 83:
>
>> 81: // size 0 and 1 are used for initial and final shuffles respectivelly of
>> 82: // dilithiumAlmostInverseNtt and dilithiumAlmostNtt.
>> 83: // NOTE: For size 0 and 1, input1[] and input2[] are modified in-place
>
> what is the size-in-bits when size is 0 and 1? What is the difference between size 0 and size1? The overloading of size makes it confusing.
size 0 seems to be doing a different shuffle than what is described in the diagram.
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 137:
>
>> 135: for (int i = 0; i < regCnt; i++) {
>> 136: // 0b-1-2-3-1
>> 137: __ vshufps(output2[i], input1[i], input2[i], 0b11011101, vector_len);
>
> Did you mean this to be //0b-1-3-1-3?
or 3-1-3-1.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2528747938
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2528753295
More information about the security-dev
mailing list