RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10]

Ferenc Rakoczi duke at openjdk.org
Mon Mar 31 14:28:24 UTC 2025


On Sat, 22 Mar 2025 16:36:08 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix windows build
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 121:
> 
>> 119: static void montmulEven(int outputReg, int inputReg1,  int inputReg2,
>> 120:                         int scratchReg1, int scratchReg2,
>> 121:                         int parCnt, MacroAssembler *_masm) {
> 
> nitpick.. this could be made to look more like `montMul64()` by also taking in an array of registers.

I eliminated this function.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 160:
> 
>> 158:   for (int i = 0; i < 4; i++) {
>> 159:     __ vpmuldq(xmm(scratchRegs[i]), xmm(inputRegs1[i]), xmm(inputRegs2[i]),
>> 160:                Assembler::AVX_512bit);
> 
> using an array of registers, instead of array of ints would read somewhat more compact and fewer 'indirections' . i.e.
> 
> static void montMul64(XMMRegister outputRegs*, XMMRegister inputRegs1*, XMMRegister inputRegs2*,
> ...
>     __ vpmuldq(scratchRegs[i], inputRegs1[i], inputRegs2[i], Assembler::AVX_512bit);

I think from the names it is easy enough to see that we are really passing register names here and it is also easy to check that the indexes of the registers in the named arrays are really what the names of those arrays suggest, so I would like to leave this alone.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 645:
> 
>> 643: // poly1 (int[256]) = c_rarg1
>> 644: // poly2 (int[256]) = c_rarg2
>> 645: static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen,
> 
> This would be 'nice to have', something 'lost' with the refactor..
> 
> As I was reviewing this (original) function, I was thinking, "there is nothing here _that_ specific to AVX512, mostly columnar&independent operations... This function could be made 'vector-length-independent'..."
> - double the loop length:
> 
> int iter = vector_len==Assembler::AVX_512bit?4:8;
> __ movl(len, 4); -> __ movl(len, iter);
> 
> - halve the register arrays.. (or keep them the same but shuffle them to make SURE the first half are in xmm0-xmm15 range)
> 
>   XMMRegister POLY1[] = {xmm0, xmm1, xmm12, xmm13};
>   XMMRegister POLY2[] = {xmm4, xmm5, xmm16, xmm17};
>   XMMRegister SCRATCH1[] = {xmm2, xmm3, xmm14, xmm15}; <<< here
>   XMMRegister SCRATCH2[] = {xmm6, xmm7, xmm18, xmm19}; <<< and here
>   XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11};
> 
> - couple of other int constants (like the memory 'step' and such)
> - for assembler calls, like `evmovdqul` and `evpsubd`, need a few small new MacroAssembler helpers to instead generate VEX encoded versions (plenty of instructions already do that).
> - I think only the perm instruction was unique to evex (didnt really think of an alternative for AVX2.. but can be abstracted away with another helper)
> 
> Anyway; not suggesting its something you do here.. but it would be convenient to leave breadcrumbs/hooks for a future update so one of us can revisit this code and add AVX2 support. e.g. `parCnt` variable was very convenient before for exactly this, now its gone... it probably could be derived in each function from vector_len but..; Its now cleaner, but also harder to 'upgrade'?
> 
> Why AVX2? many of the newer (Atom/Ecore-based/EnableX86ECoreOpts) processors do not have AVX512 support, so its something I've been prioritizing recently
> 
> The alternative would be to write a completely separate AVX2 implementation, but that would be a shame, not to 'just' reuse this code.
>  
> "For fun", I had even gone and parametrized the mult function with the `vector_len` to see how it would look (almost identical... to the original version):
> 
> static void montmulEven2(XMMRegister* outputReg, XMMRegister* inputReg1,  XMMRegister* inputReg2, XMMRegister* scratchReg1, 
>   XMMRegister* scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, int vector_len, ...

I'd like to leave this for another PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150150
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150516
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021153931


More information about the hotspot-dev mailing list