RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v2]

Sat Dec 2 21:52:39 UTC 2023

On Tue, 28 Nov 2023 13:12:32 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>> 
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>> 
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>> 
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>> 
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>> 
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>> 
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>> 
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>> 
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   remove debug code

Some raspberry Pi 4 numbers, before and after. There are some regressions for
very small data sizes -below 1kbyte - but even on a very-low-power processor the
vectorized version has a small advantage.

Benchmark                        (dataSize)  (provider)  Mode  Cnt     Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt    3     0.174 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt    3     0.536 ±  0.002  us/op
Poly1305DigestBench.updateBytes        1024              avgt    3     1.976 ±  0.004  us/op
Poly1305DigestBench.updateBytes       16384              avgt    3    30.820 ±  0.024  us/op
Poly1305DigestBench.updateBytes     1048576              avgt    3     2001.719 ± 49.343      us/op

Benchmark                        (dataSize)  (provider)  Mode  Cnt     Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt    3     0.226 ±  0.002  us/op
Poly1305DigestBench.updateBytes         256              avgt    3     0.765 ±  0.105  us/op
Poly1305DigestBench.updateBytes        1024              avgt    3     1.807 ±  0.111  us/op
Poly1305DigestBench.updateBytes       16384              avgt    3    22.679 ±  0.043  us/op
Poly1305DigestBench.updateBytes     1048576              avgt    3     1452.466 ± 14.891      us/op

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16812#issuecomment-1837262405