RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v2]
Volodymyr Paprotski
vpaprotski at openjdk.org
Mon Nov 17 23:35:45 UTC 2025
On Mon, 17 Nov 2025 06:44:39 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Volodymyr Paprotski has updated the pull request incrementally with two additional commits since the last revision:
>>
>> - whitespace
>> - address first comments
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 3867:
>
>> 3865: (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
>> 3866: (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
>> 3867: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
>
> When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31)
>
> But your assertions are nowhere checking this.
I believe those asserts are in `vex_prefix_and_encode` (https://github.com/openjdk/jdk/blob/6d3f7794ee6658d48eb2120c7bfe66ac412c6d14/src/hotspot/cpu/x86/assembler_x86.cpp#L13164) and `vex_prefix` (https://github.com/openjdk/jdk/blob/6d3f7794ee6658d48eb2120c7bfe66ac412c6d14/src/hotspot/cpu/x86/assembler_x86.cpp#L13047)
I also haven't found any other instruction that does this check so I could emulate the style.
> src/hotspot/cpu/x86/assembler_x86.cpp line 3882:
>
>> 3880:
>> 3881: void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
>> 3882: assert(VM_Version::supports_evex(), "");
>
> Suggestion:
>
> assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), "");
Took the patch, but also kept the supports_evex() assert
> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 114:
>
>> 112: rnd.setSeed(seed);
>> 113: //Note: it might be useful to increase this number during development of new intrinsics
>> 114: final int repeat = 10000000;
>
> Instead of high repetition count can you try tuning the tiered compilation threshold.
The purpose of the test is to test various (pseudo-random) values and compare the results to the java implementation of same code. A single run-though of the test doesn't always prove that there are no bugs.
A bit philosophical.. as is well known, when writing crypto, branches (conditional on secret) are disallowed; but e.g. carry propagation has the same 'conditional execution' effect. (Instead of "have you tested every branch direction" its "have you tested every carry") Besides a very careful range/overflow analysis (which I also did.. ntt functions skate very close to the int limit), exhaustive fuzz testing is the best method to find conditions that manual (range/overflow) analysis hasn't found; fuzz testing has very little math built in, so its also good at finding 'blind spots' I (and whomever has to review) might have not thought of..
> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 145:
>
>> 143: coeffs1[j] = rnd.nextInt();
>> 144: coeffs2[j] = rnd.nextInt();
>> 145: }
>
> You can uses generators for randome initialization of array
I think you meant this?
coeffs1 = rnd.ints(ML_DSA_N).toArray();
coeffs2 = rnd.ints(ML_DSA_N).toArray();
Didn't know about this, thanks. It does work..
But the original purpose (perhaps misguided, but its done) was to 'factor out' the allocations; the outer loop runs many million times (I've left it running for 6+hours during development) and so I wanted a 'somewhat efficient' test.
In hindsight, these (1k) arrays could probably be stack allocated, but I did not want to depend on an optimization when I could just write it without allocations in the mainline
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535460279
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535804056
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535373444
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535199249
More information about the security-dev
mailing list