RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Fri Jun 2 09:57:09 UTC 2023

On Thu, 1 Jun 2023 16:06:40 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135:
>> 
>>> 7133:       regs = (regs.remaining() + U_0HI + U_1HI).begin();
>>> 7134: 
>>> 7135:       // U_2:U_1:U_0 += (U_1HI >> 2)
>> 
>> This comment and the next one both need correcting. They mention U_0HI and U_1HI and, as the previous comment says, those registers are dead.
>> 
>> What actually happens here is best summarized as
>> 
>> // U_2:U_1:U_0 += (U2 >> 2) * 5
>> 
>> or, if we actually want to be clearer about the current encoding which does it in several steps
>> 
>> // rscratch1 = (U2 >> 2)
>> // U2 = U2[1:0]
>> // U_2:U_1:U_0 += rscratch1
>> // U_2:U_1:U_0 += (rscratch1 << 2)
>> 
>> i.e. any bits that are set from 130 upwards are masked off, treated as an integer in their own right, multiplied by 5 and the result added back in at the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
>> 
>> I'm not sure whether this provides an opportunity for you to optimize this by doing the multiply by five earlier i.e. replace the code with this version
>> 
>>     // rscratch1 = (U2 >> 2) * 5
>>     __ lsr(rscratch1, U_2, 2);
>>     __ add(rscratch1, rscratch1, scratch1, Assembler::LSL, 2);
>>     // U2 = U2[1:0]
>>     __ andr(U_2, U_2, (u8)3);
>>     // U2:U1:U0 += rscratch1
>>     __ adds(U_0, U_0, rscratch1);
>>     __ adcs(U_1, U_1, zr);
>>     __ adc(U_2, U_2, zr);
>> 
>> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 bits. Is that why you have implemented two add and carry steps? If so then why is it legitimate to do the multiply by 5 up front in the final reduction that follows the loop?
>
>> This comment and the next one both need correcting. They mention U_0HI and U_1HI and, as the previous comment says, those registers are dead.
>>
>> What actually happens here is best summarized as
>>
>> // U_2:U_1:U_0 += (U2 >> 2) * 5
>>
>> or, if we actually want to be clearer about the current encoding which does it in several steps
>>
>> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // U_2:U_1:U_0 += (rscratch1 << 2)
>>
>> i.e. any bits that are set from 130 upwards are masked off, treated as an integer in their own right, multiplied by 5 and the result added back in at the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
> 
> OK.
> 
>> I'm not sure whether this provides an opportunity for you to optimize this by doing the multiply by five earlier i.e. replace the code with this version
> 
> I'm not sure either, which is why it's done in two separate steps. I think you may be right, but it's a bit late to be optimizing this version any further. That would require careful analysis and a redo of all the testing.
> 
>> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 bits. Is that why you have implemented two add and carry steps?
> 
> Indeed.
> 
>> If so then why is it legitimate to do the multiply by 5 up front in the final reduction that follows the loop?
> 
> I assume that you're referring to the multiply by 5 in
> 
> 
> // Further reduce modulo 2^130 - 5                                                                                                                                                                                                                                                                                                                           
>     __ lsr(rscratch1, U_2, 2);
>     __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 = U_2 * 5                                                                                                                                                                                                                                                                       
> 
> 
> `U_2`, at this point, has only a few lower set bits. This is because `U_2` was previously ANDed with 3, and subsequently twice was the target of adc(U_2, U_2, zr). So I think that `U_2 <= 6`.

Yes, of course, you are right that  0<= U_2 < 6 at the point where that second multiply by 5 occurs (i.e. after the loop).

I believe it is safe to use the same optimization inside the loop for reasons given below. Of course it is a bit late to change this now and retest but if my reasoning is correct then we could consider updating this post release and, maybe, a backport.

The only thing that needs to be determined is what value could sit in U2 when we enter the loop. That's the only important case because we already agreed that at the loop back edge that 0 <= U2 < 6.

The incoming value for U2 at loop entry is derived by the following subsequence of the instruction stream

      __ adcs(S_1, U_1, S_1);
      __ adc(S_2, U_2, zr);         // A.1
      __ add(S_2, S_2, 1);          // A.2
     . . .
       wide_mul(U_1, U_1HI, S_0, R_1);  wide_madd(U_1, U_1HI, S_1, R_0);  wide_madd(U_1, U_1HI, S_2, RR_1); // B
     . . .
      __ andr(U_2, R_0, 3);         // C
      __ mul(U_2, S_2, U_2);        // D
      . . .

      __ adc(U_2, U_1HI, U_2);      // E

At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26)
So, at A.2 we know that 0 <= S2 <= 5

At B we know that 0 <= RR_1 <= (2^60 - 2^2)  = FFFFFFF_FFFFFFFC (top 4 and bottom 2 bits of RR_1 are clear)
So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC

At C we know 0 <= U_2 <= 3

At D we know 0 <= U_2 <= 15

So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1

So, the highest possible value for U_2 at loop entry is 50000000_00000002.

Clearly we can shift this down by two and add without any danger of overflowing

 50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214167215