RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v2]

Mon Nov 17 23:35:45 UTC 2025

On Mon, 17 Nov 2025 06:44:39 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - whitespace
>>  - address first comments
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 3867:
> 
>> 3865:             (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
>> 3866:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : false)), "");
>> 3867:   InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
> 
> When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31) 
> 
> But your assertions are nowhere checking this.

I believe those asserts are in `vex_prefix_and_encode` (https://github.com/openjdk/jdk/blob/6d3f7794ee6658d48eb2120c7bfe66ac412c6d14/src/hotspot/cpu/x86/assembler_x86.cpp#L13164) and `vex_prefix` (https://github.com/openjdk/jdk/blob/6d3f7794ee6658d48eb2120c7bfe66ac412c6d14/src/hotspot/cpu/x86/assembler_x86.cpp#L13047)

I also haven't found any other instruction that does this check so I could emulate the style.

> src/hotspot/cpu/x86/assembler_x86.cpp line 3882:
> 
>> 3880: 
>> 3881: void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
>> 3882:   assert(VM_Version::supports_evex(), "");
> 
> Suggestion:
> 
>   assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), "");

Took the patch, but also kept the supports_evex() assert

> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 114:
> 
>> 112:         rnd.setSeed(seed);
>> 113:         //Note: it might be useful to increase this number during development of new intrinsics
>> 114:         final int repeat = 10000000;
> 
> Instead of high repetition count can you try tuning the tiered compilation threshold.

The purpose of the test is to test various (pseudo-random) values and compare the results to the java implementation of same code. A single run-though of the test doesn't always prove that there are no bugs.

A bit philosophical.. as is well known, when writing crypto, branches (conditional on secret) are disallowed; but e.g. carry propagation has the same 'conditional execution' effect. (Instead of "have you tested every branch direction" its "have you tested every carry") Besides a very careful range/overflow analysis (which I also did.. ntt functions skate very close to the int limit), exhaustive fuzz testing is the best method to find conditions that manual (range/overflow) analysis hasn't found; fuzz testing has very little math built in, so its also good at finding 'blind spots' I (and whomever has to review) might have not thought of..

> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 145:
> 
>> 143:             coeffs1[j] = rnd.nextInt();
>> 144:             coeffs2[j] = rnd.nextInt();
>> 145:         }
> 
> You can uses generators for randome initialization of array

I think you meant this?

        coeffs1 = rnd.ints(ML_DSA_N).toArray();
        coeffs2 = rnd.ints(ML_DSA_N).toArray();

Didn't know about this, thanks. It does work..

But the original purpose (perhaps misguided, but its done) was to 'factor out' the allocations; the outer loop runs many million times (I've left it running for 6+hours during development) and so I wanted a 'somewhat efficient' test.

In hindsight, these (1k) arrays could probably be stack allocated, but I did not want to depend on an optimization when I could just write it without allocations in the mainline

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535460279
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535804056
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535373444
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2535199249