RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]
Andrew Haley
aph at openjdk.org
Thu Jun 1 15:03:13 UTC 2023
On Thu, 1 Jun 2023 12:16:45 GMT, Andrew Dinn <adinn at openjdk.org> wrote:
> This comment and the next one both need correcting. They mention U_0HI and U_1HI and, as the previous comment says, those registers are dead.
>
> What actually happens here is best summarized as
>
> // U_2:U_1:U_0 += (U2 >> 2) * 5
>
> or, if we actually want to be clearer about the current encoding which does it in several steps
>
> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // U_2:U_1:U_0 += (rscratch1 << 2)
>
> i.e. any bits that are set from 130 upwards are masked off, treated as an integer in their own right, multiplied by 5 and the result added back in at the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
OK.
> I'm not sure whether this provides an opportunity for you to optimize this by doing the multiply by five earlier i.e. replace the code with this version
I'm not sure either, which is why it's done in two separate steps. I think you may be right, but it's a bit late to be optimizing this version any further. That would require careful analysis and a redo of all the testing.
> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 bits. Is that why you have implemented two add and carry steps?
Indeed.
> If so then why is it legitimate to do the multiply by 5 up front in the final reduction that follows the loop?
I assume that you're referring to the multiply by 5 in
// Further reduce modulo 2^130 - 5
__ lsr(rscratch1, U_2, 2);
__ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 = U_2 * 5
`
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213294906
More information about the hotspot-dev
mailing list