RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements
Jatin Bhateja
jbhateja at openjdk.org
Mon Nov 17 07:16:08 UTC 2025
On Tue, 4 Nov 2025 16:38:49 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:
> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline
> - `SignatureBench.MLDSA` is 1.2x-2.2x faster
> - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed https://github.com/vpaprotsk/jdk/pull/7)
> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version
> - `SignatureBench.MLDSA` is upto 5% faster, never slower
>
> Note on intrinsic:
> - The emitted (existing) AVX512 assembler was not "significantly" changed; mostly more efficient instruction selection and tighter register allocation, which allowed removal of NTT loop and stack spill.
> - Code was refactored to allow reuse of same assembler (as possible) for AVX512 and AVX2
>
> Tests and benchmarks:
> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
> - Added benchmark to measure the performance of intrinsic itself
>
> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+UseDilithiumIntrinsics;FORK=1"
> make test TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:-UseDilithiumIntrinsics;FORK=1"
Minor initial comments
src/hotspot/cpu/x86/assembler_x86.cpp line 3867:
> 3865: (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 3866: (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
> 3867: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)
But your assertions are nowhere checking this.
src/hotspot/cpu/x86/assembler_x86.cpp line 3876:
> 3874: (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 3875: (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
> 3876: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)
But your assertions are nowhere checking this.
src/hotspot/cpu/x86/assembler_x86.cpp line 3882:
> 3880:
> 3881: void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
> 3882: assert(VM_Version::supports_evex(), "");
Suggestion:
assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), "");
src/hotspot/cpu/x86/assembler_x86.cpp line 3894:
> 3892:
> 3893: void Assembler::evmovshdup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
> 3894: assert(VM_Version::supports_evex(), "");
Same as above
src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 397:
> 395: //
> 396: static address generate_dilithiumAlmostNtt_avx(StubGenerator *stubgen,
> 397: int vector_len, MacroAssembler *_masm) {
Indentation corretness
test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 2:
> 1: /*
> 2: * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.
Suggestion:
* Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.
test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 114:
> 112: rnd.setSeed(seed);
> 113: //Note: it might be useful to increase this number during development of new intrinsics
> 114: final int repeat = 10000000;
Instead of high repetition count can you try tuning the tiered compilation threshold.
test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 145:
> 143: coeffs1[j] = rnd.nextInt();
> 144: coeffs2[j] = rnd.nextInt();
> 145: }
You can uses generators for randome initialization of array
-------------
PR Review: https://git.openjdk.org/jdk/pull/28136#pullrequestreview-3471195396
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532894350
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532894989
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532900199
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532901821
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532910907
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532868326
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532875974
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532872372
More information about the security-dev
mailing list