RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13]
Ferenc Rakoczi
duke at openjdk.org
Tue Apr 8 21:29:26 UTC 2025
On Sat, 5 Apr 2025 00:27:05 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Reacting to comment by Sandhya.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 345:
>
>> 343:
>> 344: store4Xmms(coeffs, 0, xmm0_3, _masm);
>> 345: store4Xmms(coeffs, 4 * XMMBYTES, xmm4_7, _masm);
>
> This seems to be unnecessary store.
Thanks for catching that. Changed.
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 370:
>
>> 368: loadPerm(xmm16_19, perms, nttL4PermsIdx, _masm);
>> 369: loadPerm(xmm12_15, perms, nttL4PermsIdx + 64, _masm);
>> 370: load4Xmms(xmm24_27, zetas, 4 * 512, _masm); // for level 3
>
> The comment // for level3 is not relevant here and could be removed.
Ooops. Deleted the comment.
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 802:
>
>> 800: __ evpbroadcastd(zero, scratch, Assembler::AVX_512bit); // 0
>> 801: __ addl(scratch, 1);
>> 802: __ evpbroadcastd(one, scratch, Assembler::AVX_512bit); // 1
>
> A better way to initialize (0, 1, -1) vectors is:
> // load 0 into int vector
> vpxor(zero, zero, zero, Assembler::AVX_512bit);
> // load -1 into int vector
> vpternlogd(minusOne, 0xff, minusOne, minusOne, Assembler::AVX_512bit);
> // load 1 into int vector
> vpsubd(one, zero, minusOne, Assembler::AVX_512bit);
>
> Where minusOne could be xmm31.
>
> A broadcast from r register to xmm register is more expensive.
Changed.
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 982:
>
>> 980: __ evporq(xmm19, k0, xmm19, xmm23, false, Assembler::AVX_512bit);
>> 981:
>> 982: __ evpsubd(xmm12, k0, zero, one, false, Assembler::AVX_512bit); // -1
>
> The -1 initialization could be done outside the loop.
Not really. All registers are used.
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1015:
>
>> 1013: __ addptr(lowPart, 4 * XMMBYTES);
>> 1014: __ cmpl(len, 0);
>> 1015: __ jcc(Assembler::notEqual, L_loop);
>
> It looks to me that subl and cmpl could be merged:
> __ addptr(highPart, 4 * XMMBYTES);
> __ addptr(lowPart, 4 * XMMBYTES);
> __ subl(len, 4 * XMMBYTES);
> __ jcc(Assembler::notEqual, L_loop);
Changed.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057184
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057342
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057700
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057565
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057463
More information about the security-dev
mailing list