RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Fri Jun 2 11:05:06 UTC 2023

On Fri, 2 Jun 2023 09:58:59 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Yes, of course, you are right that  0<= U_2 < 6 at the point where that second multiply by 5 occurs (i.e. after the loop).
>> 
>> I believe it is safe to use the same optimization inside the loop for reasons given below. Of course it is a bit late to change this now and retest but if my reasoning is correct then we could consider updating this post release and, maybe, a backport.
>> 
>> The only thing that needs to be determined is what value could sit in U2 when we enter the loop. That's the only important case because we already agreed that at the loop back edge that 0 <= U2 < 6.
>> 
>> The incoming value for U2 at loop entry is derived by the following subsequence of the instruction stream
>> 
>>       __ adcs(S_1, U_1, S_1);
>>       __ adc(S_2, U_2, zr);         // A.1
>>       __ add(S_2, S_2, 1);          // A.2
>>      . . .
>>        wide_mul(U_1, U_1HI, S_0, R_1);  wide_madd(U_1, U_1HI, S_1, R_0);  wide_madd(U_1, U_1HI, S_2, RR_1); // B
>>      . . .
>>       __ andr(U_2, R_0, 3);         // C
>>       __ mul(U_2, S_2, U_2);        // D
>>       . . .
>> 
>>       __ adc(U_2, U_1HI, U_2);      // E
>> 
>> At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26)
>> So, at A.2 we know that 0 <= S2 <= 5
>> 
>> At B we know that 0 <= RR_1 <= (2^60 - 2^2)  = FFFFFFF_FFFFFFFC (top 4 and bottom 2 bits of RR_1 are clear)
>> So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC
>> 
>> At C we know 0 <= U_2 <= 3
>> 
>> At D we know 0 <= U_2 <= 15
>> 
>> So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1
>> 
>> So, the highest possible value for U_2 at loop entry is 50000000_00000002.
>> 
>> Clearly we can shift this down by two and add without any danger of overflowing
>> 
>>  50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002
>
> Ah, no scratch that. I have made a wrong assumption at B. The value of U1_HI is bounded by the sum of the 3 64 bit * 64 bit multiplies.

I think there is still a proof of validity to be salvaged though. We compute a 128 bit product:

  U1_HI:U1 = S_0 * R_1 + S_1 * R_0 + S_2 * RR_1

We know that R_0 and R_1 have four top bits clear and S2 <= 5. So, I think we can guarantee that the top word of the 128 bit product is small enough to not to overflow when we do the shift. Even if we assume S_0, S_1 and RR_1 have the maximum possible value we have

S_0 * R_1 <= (2^64 - 1) * (2^60 - 1)
S_1 * R_0 <= (2^64 - 1) * (2^60 - 1)
S_2 * RR_1  <= 5 * (2^64 - 1)

So, the top word U1_HI is bounded by at most (2 * (2^60 - 1)) + 5. That leaves more than enough room to guarantee that the shift and add will not overflow.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214230004