RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]
Claes Redestad
redestad at openjdk.org
Wed May 24 10:11:55 UTC 2023
On Wed, 24 May 2023 09:25:06 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> This provides a solid speedup of about 3-4x over the Java implementation.
>>
>> I have a vectorized version of this which uses a bunch of tricks to speed it up, but it's complex and can still be improved. We're getting close to ramp down, so I'm submitting this simple intrinsic so that we can get it reviewed in time.
>>
>> Benchmarks:
>>
>>
>> ThunderX (2, I think):
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 14078352.014 ± 4201407.966 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3 5154958.794 ± 1717146.980 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3 1416563.273 ± 1311809.454 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3 94059.570 ± 2913.021 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3 1441.024 ± 164.443 ops/s
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 4516486.795 ± 419624.224 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3 1228542.774 ± 202815.694 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3 316051.912 ± 23066.449 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3 20649.561 ± 1094.687 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3 310.564 ± 31.053 ops/s
>>
>> Apple M1:
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 33551968.946 ± 849843.905 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3 9911637.214 ± 63417.224 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3 2604370.740 ± 29208.265 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3 165183.633 ± 1975.998 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3 2587.132 ± 40.240 ops/s
>>
>> Benchmark (dataSize) (provider) Mode Cnt Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3 12373649.589 ± 184757.721 ops/s
>> Poly1305DigestBench.upd...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> Whitespace
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7097:
> 7095: // together partial products without any risk of needing to
> 7096: // propagate a carry out.
> 7097: wide_mul(U_0, U_0HI, S_0, R_0); wide_madd(U_0, U_0HI, S_1, RR_1); wide_madd(U_0, U_0HI, S_2, RR_0);
What is `r` corresponding to here? This asserts that 'the top four bits of each 32-bit subword of "r" are zero'. If `r` is `R_0...R_2` it would seem broken since we're packing 26-bit values into `R_0...R_2` above in a way that would break this invariant?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1203838423
More information about the hotspot-dev
mailing list