RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v2]
Andrew Haley
aph at openjdk.org
Sat Dec 2 21:52:39 UTC 2023
On Tue, 28 Nov 2023 13:12:32 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>>
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>>
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>>
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>>
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>>
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>>
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>>
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>>
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> remove debug code
Some raspberry Pi 4 numbers, before and after. There are some regressions for
very small data sizes -below 1kbyte - but even on a very-low-power processor the
vectorized version has a small advantage.
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 3 0.174 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 3 0.536 ± 0.002 us/op
Poly1305DigestBench.updateBytes 1024 avgt 3 1.976 ± 0.004 us/op
Poly1305DigestBench.updateBytes 16384 avgt 3 30.820 ± 0.024 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 3 2001.719 ± 49.343 us/op
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 3 0.226 ± 0.002 us/op
Poly1305DigestBench.updateBytes 256 avgt 3 0.765 ± 0.105 us/op
Poly1305DigestBench.updateBytes 1024 avgt 3 1.807 ± 0.111 us/op
Poly1305DigestBench.updateBytes 16384 avgt 3 22.679 ± 0.043 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 3 1452.466 ± 14.891 us/op
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16812#issuecomment-1837262405
More information about the hotspot-dev
mailing list