RFR: 8351412: Add AVX-512 intrinsics for ML-KEM [v6]

Tue May 20 11:56:52 UTC 2025

On Fri, 16 May 2025 00:28:18 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Response to review comment + loading constants with broadcast op.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 250:
> 
>> 248: static void montmul(int outputRegs[], int inputRegs1[], int inputRegs2[],
>> 249:              int scratchRegs1[], int scratchRegs2[], MacroAssembler *_masm) {
>> 250:    for (int i = 0; i < 4; i++) {
> 
> In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent?

As used in this case, they are equivalent.  For z = montmul(a,b), z will be  between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20 and for the intrinsic, R = 2^16. In our computations, b is always c * R mod q, so the montmul() really  computes a * c mod q. In the Java code, we use 32-bit numbers for the computations, and we use R = 2^20 because that way the a * b numbers that occur during all computations stay in the required range (the inverse NTT computation is where they can grow the most), so we don't have to do Barrett reductions during that computation. For the intrinsics, we use R = 2^16, because this way we can do twice as much work in parallel, but we have to do Barrett reduction after levels 2 and 4 in the inverse NTT computation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2097757145