RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Thu Jun 1 12:21:10 UTC 2023

On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This provides a solid speedup of about 3-4x over the Java implementation.
>> 
>> I have a vectorized version of this which uses a bunch of tricks to speed it up, but it's complex and can still be improved. We're getting close to ramp down, so I'm submitting this simple intrinsic so that we can get it reviewed in time.
>> 
>> Benchmarks:
>> 
>> 
>> ThunderX (2, I think):
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt         Score         Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  14078352.014 ± 4201407.966  ops/s
>> Poly1305DigestBench.updateBytes         256              thrpt    3   5154958.794 ± 1717146.980  ops/s
>> Poly1305DigestBench.updateBytes        1024              thrpt    3   1416563.273 ± 1311809.454  ops/s
>> Poly1305DigestBench.updateBytes       16384              thrpt    3     94059.570 ±    2913.021  ops/s
>> Poly1305DigestBench.updateBytes     1048576              thrpt    3      1441.024 ±     164.443  ops/s
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  4516486.795 ± 419624.224  ops/s
>> Poly1305DigestBench.updateBytes         256              thrpt    3  1228542.774 ± 202815.694  ops/s
>> Poly1305DigestBench.updateBytes        1024              thrpt    3   316051.912 ±  23066.449  ops/s
>> Poly1305DigestBench.updateBytes       16384              thrpt    3    20649.561 ±   1094.687  ops/s
>> Poly1305DigestBench.updateBytes     1048576              thrpt    3      310.564 ±     31.053  ops/s
>> 
>> Apple M1:
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt         Score        Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  33551968.946 ± 849843.905  ops/s
>> Poly1305DigestBench.updateBytes         256              thrpt    3   9911637.214 ±  63417.224  ops/s
>> Poly1305DigestBench.updateBytes        1024              thrpt    3   2604370.740 ±  29208.265  ops/s
>> Poly1305DigestBench.updateBytes       16384              thrpt    3    165183.633 ±   1975.998  ops/s
>> Poly1305DigestBench.updateBytes     1048576              thrpt    3      2587.132 ±     40.240  ops/s
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt         Score        Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  12373649.589 ± 184757.721  ops/s
>> Poly1305DigestBench.upd...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135:

> 7133:       regs = (regs.remaining() + U_0HI + U_1HI).begin();
> 7134: 
> 7135:       // U_2:U_1:U_0 += (U_1HI >> 2)

This comment and the next one both need correcting. They mention U_0HI and U_1HI and, as the previous comment says, those registers are dead.

What actually happens here is best summarized as

// U_2:U_1:U_0 += (U2 >> 2) * 5

or, if we actually want to be clearer about the current encoding which does it in several steps

// rscratch1 = (U2 >> 2)
// U2 = U2[1:0]
// U_2:U_1:U_0 += rscratch1
// U_2:U_1:U_0 += (rscratch1 << 2)

i.e. any bits that are set from 130 upwards are masked off, treated as an integer in their own right, multiplied by 5 and the result added back in at the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].

I'm not sure whether this provides an opportunity for you to optimize this by doing the multiply by five earlier i.e. replace the code with this version

    // rscratch1 = (U2 >> 2) * 5
    __ lsr(rscratch1, U_2, 2);
    __ add(rscratch1, rscratch1, scratch1, Assembler::LSL, 2);
    // U2 = U2[1:0]
    __ andr(U_2, U_2, (u8)3);
    // U2:U1:U0 += rscratch1
    __ adds(U_0, U_0, rscratch1);
    __ adcs(U_1, U_1, zr);
    __ adc(U_2, U_2, zr);

The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 bits. Is that why you have implemented two add and carry steps? If so then why is it legitimate to do the multiply by 5 up front in the final reduction that follows the loop?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213062966