RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]

Volodymyr Paprotski vpaprotski at openjdk.org
Sat Mar 22 20:05:12 UTC 2025


On Thu, 20 Mar 2025 21:06:30 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58:
>> 
>>> 56: 
>>> 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = {
>>> 58:      // collect montmul results into the destination register
>> 
>> same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`).
>> 
>> Could be split into three constant arrays to make the compiler count for us
>
> Well, it is 64 bytes per  line (16 4-byte uint32_ts), not that hard :-) ...

Ha! I didn't realize it was 16 per line.. ran out of fingers while counting!!! :)

'works for me, as long as its a "premeditated" decision'

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980:
>> 
>>> 978: // Dilithium multiply polynomials in the NTT domain.
>>> 979: // Implements
>>> 980: // static int implDilithiumNttMult(
>> 
>> I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size.
>> 
>> Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side.
>
> These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size.

Works for me; just thought I would point it out, so its a 'premeditated' decision.

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012:
>> 
>>> 1010:   __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit);
>>> 1011: 
>>> 1012:   __ movl(len, 4);
>> 
>> Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead?
>
> I have found that unrolling these loops actually hurts performance (probably an I-cache effect.

Interesting; I keep on having to re-train my intuition, thanks for the data

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008806159
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008805574
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008805113


More information about the hotspot-dev mailing list