RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13]

Tue Apr 8 21:29:26 UTC 2025

On Sat, 5 Apr 2025 00:27:05 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Reacting to comment by Sandhya.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 345:
> 
>> 343: 
>> 344:   store4Xmms(coeffs, 0, xmm0_3, _masm);
>> 345:   store4Xmms(coeffs, 4 * XMMBYTES, xmm4_7, _masm);
> 
> This seems to be unnecessary store.

Thanks for catching that. Changed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 370:
> 
>> 368:   loadPerm(xmm16_19, perms, nttL4PermsIdx, _masm);
>> 369:   loadPerm(xmm12_15, perms, nttL4PermsIdx + 64, _masm);
>> 370:   load4Xmms(xmm24_27, zetas, 4 * 512, _masm); // for level 3
> 
> The comment // for level3 is not relevant here and could be removed.

Ooops. Deleted the comment.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 802:
> 
>> 800:   __ evpbroadcastd(zero, scratch, Assembler::AVX_512bit); // 0
>> 801:   __ addl(scratch, 1);
>> 802:   __ evpbroadcastd(one, scratch, Assembler::AVX_512bit); // 1
> 
> A better way to initialize (0, 1, -1) vectors is:
> // load 0 into int vector
> vpxor(zero, zero, zero, Assembler::AVX_512bit);
> // load -1 into int vector
> vpternlogd(minusOne, 0xff, minusOne, minusOne, Assembler::AVX_512bit);
> // load 1 into int vector
> vpsubd(one, zero, minusOne, Assembler::AVX_512bit);
> 
> Where minusOne could be xmm31. 
> 
> A broadcast from r register to xmm register is more expensive.

Changed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 982:
> 
>> 980:   __ evporq(xmm19, k0, xmm19, xmm23, false, Assembler::AVX_512bit);
>> 981: 
>> 982:   __ evpsubd(xmm12, k0, zero, one, false, Assembler::AVX_512bit); // -1
> 
> The -1 initialization could be done outside the loop.

Not really. All registers are used.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1015:
> 
>> 1013:   __ addptr(lowPart, 4 * XMMBYTES);
>> 1014:   __ cmpl(len, 0);
>> 1015:   __ jcc(Assembler::notEqual, L_loop);
> 
> It looks to me that subl and cmpl could be merged:
>   __ addptr(highPart, 4 * XMMBYTES);
>   __ addptr(lowPart, 4 * XMMBYTES);
>   __ subl(len, 4 * XMMBYTES);
>   __ jcc(Assembler::notEqual, L_loop);

Changed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057184
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057342
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057700
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057565
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057463