RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v6]
Jatin Bhateja
jbhateja at openjdk.org
Tue Nov 25 03:04:01 UTC 2025
On Mon, 24 Nov 2025 22:01:17 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:
>> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline
>> - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>> - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed https://github.com/vpaprotsk/jdk/pull/7)
>> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version
>> - `SignatureBench.MLDSA` is upto 5% faster, never slower
>>
>> Note on intrinsic:
>> - The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
>> - Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2
>>
>> Tests and benchmarks:
>> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
>> - Added benchmark to measure the performance of intrinsic itself
>>
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
>> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
>
> spelling
Very nice work @vpaprotsk ,
Please also add in comments the links to original reference implimentation.
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 365:
> 363:
> 364: static void loadXmms(const XMMRegister destinationRegs[], Register source, int offset,
> 365: int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
Suggestion:
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 381:
> 379:
> 380: static void storeXmms(Register destination, int offset, const XMMRegister xmmRegs[],
> 381: int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
Suggestion:
int vector_len, MacroAssembler *_masm, int regCnt = -1, int memStep = -1) {
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 659:
> 657: // zetas (int[128*8]) = c_rarg1
> 658: static address generate_dilithiumAlmostInverseNtt_avx(StubGenerator *stubgen,
> 659: int vector_len,MacroAssembler *_masm) {
Fix indentation
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 718:
> 716:
> 717: // Constants for shuffle and montMul64
> 718: __ mov64(scratch, 0b1010101010101010);
64 bit constant suffix
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 901:
> 899: // poly2 (int[256]) = c_rarg2
> 900: static address generate_dilithiumNttMult_avx(StubGenerator *stubgen,
> 901: int vector_len, MacroAssembler *_masm) {
Fix indentation
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 939:
> 937: vector_len, scratch); // 2^64 mod q
> 938: if (vector_len == Assembler::AVX_512bit) {
> 939: __ mov64(scratch, 0b0101010101010101);
Add long constant suffix
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 985:
> 983: // constant (int) = c_rarg1
> 984: static address generate_dilithiumMontMulByConstant_avx(StubGenerator *stubgen,
> 985: int vector_len, MacroAssembler *_masm) {
Fix indentation
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1026:
> 1024: __ evpbroadcastd(constant, rConstant, Assembler::AVX_512bit); // constant multiplier
> 1025:
> 1026: __ mov64(scratch, 0b0101010101010101); //dw-mask
Constant suffix
-------------
PR Review: https://git.openjdk.org/jdk/pull/28136#pullrequestreview-3503056034
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558380867
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558381318
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558385868
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558390552
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558390067
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558370904
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558391135
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558371478
More information about the hotspot-dev
mailing list