RFR: 8350459: MontgomeryIntegerPolynomialP256 multiply intrinsic with AVX2 on x86_64 [v2]

Tue Mar 4 21:16:15 UTC 2025

On Thu, 27 Feb 2025 19:05:50 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   comments from Sandhya
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 397:
> 
>> 395:         __ xorq(acc2, acc2);
>> 396:         __ addq(acc1, tmp_rax);
>> 397:         __ adcq(acc2, tmp_rdx);
> 
> Why adcq here instead of addq? The vector code doesn't do that.

Its a difference in multiply instruction, how the 'high' and 'low' parts are handled.
i.e. Given that inputs a and b are 52 bits:
- mulq a * b = 40:64 bits in high/low
- vpmadd52{l,h}uq = 52:52

vpmadd52 (by design) leaves upper 12 bits for carry propagation, whereas with mulq, we have to do the propagation 'immediately;.

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 424:
> 
>> 422:       __ shrq(acc1, 52); // low 52 of acc1 ignored, is zero, because Montgomery
>> 423: 
>> 424:       // Acc2[0] += carry
> 
> This is more like shift in carry into lower bits of Acc2[0] so comment could be updated.

Hmm. Shift implies (to me?) overriding Acc2, which isn't the case entirely (there is a slight overlap). But.. what is happening here is also not 100% obvious. 

acc2 (i.e. Acc2[0]) is the upper 40+ bits of the multiply result and acc1 is here the 12+ bit carry (leftover); It is not correctly lined up (yet) because how vpmaddq vs mulq produce high/low parts. (its actually overlapping 13 and 41 bits, to get 54 bits, but we have 12 bits to spare in 64bit reg so no need to be exact)

I tried adjusting the 'heading' comment to the function to explain this flow better; hope its better?

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 441:
> 
>> 439:   __ subq(acc2, modulus);
>> 440:   __ vpsubq(Acc2, Acc1, Modulus, Assembler::AVX_256bit);
>> 441:   __ vmovdqu(Address(rsp, -32), Acc2); //Assembler::AVX_256bit
> 
> Need to first create space on stack and then store temp.

Done. Also added stack alignment so I can use aligned spill. Also using just one spill slot instead of two. Re-ran fuzzing tests

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 465:
> 
>> 463: 
>> 464:   // Now carry propagate the multiply result and (constant-time) select correct
>> 465:   // output digit
> 
> Carry propagate multiply result is done before subtracting modulus in the Java code.

That was intentional.. I believe this is faster..

While in vector 'domain', subtract is essentially one vector operation and one scalar operation.. first carry propagation is expensive chain, but second 'hides' within the first propagation.

Conversely, with Java ordering: (1) carry propagate Acc1, (2) scalar subtract Acc2, (3) carry propagate Acc2; the critical path is longer (and subtract isnt vectorized)

PS: "why didn't you do java the same way" :) I chose clarity for the java code; also java jitted code is in 'scalar domain', so there are plenty of partial sums for out-of-order execution to keep the pipeline fed. Its the vector-to-scalar crossover here that requires the special handling

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 467:
> 
>> 465:   // output digit
>> 466:   Register digit = acc1;
>> 467:   __ vmovdqu(Address(rsp, -64), Acc1); //Assembler::AVX_256bit
> 
> Need to first create space on stack and then store.

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 475:
> 
>> 473:     }
>> 474:     __ movq(carry, digit);
>> 475:     __ sarq(carry, 52);
> 
> This was unsigned or logical shift in Java code.

For Acc1, the limbs are all positive, around 54-bits. sarq is important for the Acc2 propagation (I added a comment there). I used sarq here mostly for symmetry since it really doesnt matter mathematically

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 556:
> 
>> 554: //  - constant time (i.e. no branches)
>> 555: //  - no-side channel (i.e. all memory must always be accessed, and in same order)
>> 556: void assign_avx(Register aBase, Register bBase, int offset, XMMRegister select, XMMRegister tmp, XMMRegister aTmp, int vector_len, MacroAssembler* _masm) {
> 
> Good to add the comment from assign_scalar here as well:
> // Original java:
>   // long dummyLimbs = maskValue & (a[i] ^ b[i]);
>   // a[i] = dummyLimbs ^ a[i];

done

> src/java.base/share/classes/sun/security/util/math/intpoly/MontgomeryIntegerPolynomialP256.java line 423:
> 
>> 421:         r[2] = ((c7 & mask) | (c2 & ~mask));
>> 422:         r[3] = ((c8 & mask) | (c3 & ~mask));
>> 423:         r[4] = ((c9 & mask) | (c4 & ~mask));
> 
> It would be good to add a comment here indicating that if the result (c9 - c5) had overflown by one modulus, result - modulus (c4-c0) would be positive else it would be negative.  i.e. Upper bits of c4 would be all zeroes on overflow otherwise upper bits of c4 would be all ones.  Thus on overflow, return value "r" should be set to result - modulus (c4 - c0) else it should be set to result (c9-c5).

done (I think.. please double-check if my version of the comment is helpful)

> test/jdk/com/sun/security/util/math/intpoly/MontgomeryPolynomialFuzzTest.java line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2025, Intel Corporation. All rights reserved.
> 
> This should be Copyright (c) 2024, 2025, Intel Corporation. All rights reserved.

done, thanks

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1978375998
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1980186231
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1980187534
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1980204172
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1980194853
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1980209434
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1980226208
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1978307783
PR Review Comment: https://git.openjdk.org/jdk/pull/23719#discussion_r1978307045