RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Andrew Haley aph at openjdk.org
Thu Jun 1 16:09:07 UTC 2023


On Thu, 1 Jun 2023 15:00:26 GMT, Andrew Haley <aph at openjdk.org> wrote:

> This comment and the next one both need correcting. They mention U_0HI and U_1HI and, as the previous comment says, those registers are dead.
>
> What actually happens here is best summarized as
>
> // U_2:U_1:U_0 += (U2 >> 2) * 5
>
> or, if we actually want to be clearer about the current encoding which does it in several steps
>
> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // U_2:U_1:U_0 += (rscratch1 << 2)
>
> i.e. any bits that are set from 130 upwards are masked off, treated as an integer in their own right, multiplied by 5 and the result added back in at the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].

OK.

> I'm not sure whether this provides an opportunity for you to optimize this by doing the multiply by five earlier i.e. replace the code with this version

I'm not sure either, which is why it's done in two separate steps. I think you may be right, but it's a bit late to be optimizing this version any further. That would require careful analysis and a redo of all the testing.

> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 bits. Is that why you have implemented two add and carry steps?

Indeed.

> If so then why is it legitimate to do the multiply by 5 up front in the final reduction that follows the loop?

I assume that you're referring to the multiply by 5 in


// Further reduce modulo 2^130 - 5                                                                                                                                                                                                                                                                                                                           
    __ lsr(rscratch1, U_2, 2);
    __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 = U_2 * 5                                                                                                                                                                                                                                                                       


`U_2`, at this point, has only a few lower set bits. This is because `U_2` was previously ANDed with 3, and subsequently was the target of adc(U_2, U_2, zr).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213382344



More information about the security-dev mailing list