RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]
Andrew Dinn
adinn at openjdk.org
Thu Jun 1 12:21:10 UTC 2023
On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> This provides a solid speedup of about 3-4x over the Java implementation.
>>
>> I have a vectorized version of this which uses a bunch of tricks to speed it up, but it's complex and can still be improved. We're getting close to ramp down, so I'm submitting this simple intrinsic so that we can get it reviewed in time.
>>
>> Benchmarks:
>>
>>
>> ThunderX (2, I think):
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 14078352.014 ± 4201407.966 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3 5154958.794 ± 1717146.980 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3 1416563.273 ± 1311809.454 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3 94059.570 ± 2913.021 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3 1441.024 ± 164.443 ops/s
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 4516486.795 ± 419624.224 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3 1228542.774 ± 202815.694 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3 316051.912 ± 23066.449 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3 20649.561 ± 1094.687 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3 310.564 ± 31.053 ops/s
>>
>> Apple M1:
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 33551968.946 ± 849843.905 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3 9911637.214 ± 63417.224 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3 2604370.740 ± 29208.265 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3 165183.633 ± 1975.998 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3 2587.132 ± 40.240 ops/s
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 12373649.589 ± 184757.721 ops/s
>> Poly1305DigestBench.upd...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> Review comments
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135:
> 7133: regs = (regs.remaining() + U_0HI + U_1HI).begin();
> 7134:
> 7135: // U_2:U_1:U_0 += (U_1HI >> 2)
This comment and the next one both need correcting. They mention U_0HI and U_1HI and, as the previous comment says, those registers are dead.
What actually happens here is best summarized as
// U_2:U_1:U_0 += (U2 >> 2) * 5
or, if we actually want to be clearer about the current encoding which does it in several steps
// rscratch1 = (U2 >> 2)
// U2 = U2[1:0]
// U_2:U_1:U_0 += rscratch1
// U_2:U_1:U_0 += (rscratch1 << 2)
i.e. any bits that are set from 130 upwards are masked off, treated as an integer in their own right, multiplied by 5 and the result added back in at the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
I'm not sure whether this provides an opportunity for you to optimize this by doing the multiply by five earlier i.e. replace the code with this version
// rscratch1 = (U2 >> 2) * 5
__ lsr(rscratch1, U_2, 2);
__ add(rscratch1, rscratch1, scratch1, Assembler::LSL, 2);
// U2 = U2[1:0]
__ andr(U_2, U_2, (u8)3);
// U2:U1:U0 += rscratch1
__ adds(U_0, U_0, rscratch1);
__ adcs(U_1, U_1, zr);
__ adc(U_2, U_2, zr);
The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 bits. Is that why you have implemented two add and carry steps? If so then why is it legitimate to do the multiply by 5 up front in the final reduction that follows the loop?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213062966
More information about the hotspot-dev
mailing list