RFR: 8351412: Add AVX-512 intrinsics for ML-KEM [v6]

Tue May 20 17:17:58 UTC 2025

On Tue, 20 May 2025 11:51:49 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 250:
>> 
>>> 248: static void montmul(int outputRegs[], int inputRegs1[], int inputRegs2[],
>>> 249:              int scratchRegs1[], int scratchRegs2[], MacroAssembler *_masm) {
>>> 250:    for (int i = 0; i < 4; i++) {
>> 
>> In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent?
>
> As used in this case, they are equivalent.  For z = montmul(a,b), z will be  between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20 and for the intrinsic, R = 2^16. In our computations, b is always c * R mod q, so the montmul() really  computes a * c mod q. In the Java code, we use 32-bit numbers for the computations, and we use R = 2^20 because that way the a * b numbers that occur during all computations stay in the required range (the inverse NTT computation is where they can grow the most), so we don't have to do Barrett reductions during that computation. For the intrinsics, we use R = 2^16, because this way we can do twice as much work in parallel, but we have to do Barrett reduction after levels 2 and 4 in the inverse NTT computation.

Thanks a lot for the explanation. It would be good to add it as a comment in the stubGenerator_x86_64_kyber.cpp.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2098491524