RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics
Andrew Haley
aph at openjdk.org
Tue Nov 28 10:10:26 UTC 2023
On Fri, 24 Nov 2023 17:12:25 GMT, Andrew Haley <aph at openjdk.org> wrote:
> Vectorizing Poly1305 is quite tricky. We already have a highly-
> efficient scalar Poly1305 implementation that runs on the core integer
> unit, but it's highly serialized, so it does not make make good use of
> the parallelism available.
>
> The scalar implementation takes advantage of some particular features
> of the Poly1305 keys. In particular, certain bits of r, the secret
> key, are required to be 0. These make it possible to use a full
> 64-bit-wide multiply-accumulate operation without needing to process
> carries between partial products,
>
> While this works well for a serial implementation, a parallel
> implementation cannot do this because rather than multiplying by r,
> each step multiplies by some integer power of r, modulo
> 2^130-5.
>
> In order to avoid processing carries between partial products we use a
> redundant representation, in which each 130-bit integer is encoded
> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
> base 2^52, depending on whether we are using a 64- or 32-bit
> multiply-accumulate.
>
> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
> operation available to us, so we must use 32*32 -> 64-bit operations.
>
> In order to achieve maximum performance we'd like to get close to the
> processor's decode bandwidth, so that every clock cycle does something
> useful. In a typical high-end AArch64 implementation, the core integer
> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
> fast(ish) two-way 32-bit multiplier, which may be slower than than the
> core integer unit's. It is not at all obvious whether it's best to use
> ASIMD or core instructions.
>
> Fortunately, if we have a wide-bandwidth instruction decode, we can do
> both at the same time, by feeding alternating instructions to the core
> and the ASIMD units. This also allows us to make good use of all of
> the available core and ASIMD registers, in parallel.
>
> To do this we use generators, which here are a kind of iterator that
> emits a group of instructions each time it is called. In this case we
> 4 parallel generators, and by calling them alternately we interleave
> the ASIMD and the core instructions. We also take care to ensure that
> each generator finishes at about the same time, to maximize the
> distance between instructions which generate and consume data.
>
> The results are pretty good, ranging from 2* - 3* speedup. It is
> possible that a pure in-order processor (Raspberry Pi?) might be at
> some disadvantage because more work is being done even though it is
> highly parallel, b...
Benchmarking Icestorm, the efficiency cores inside the M1, makes for interesting reading. The cores are much smaller and have far less potential parallelism, so the performance gain is far less, maxing at about 1.6 *, and even showing a very slight regression at small sizes. I think this gain is still worth having for the increased performance, even with the code size expansion, but the decision is not so clear.
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 10 0.076 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.217 ± 0.002 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.786 ± 0.004 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 11.989 ± 0.003 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 766.477 ± 0.888 us/op
Poly1305DigestBench.updateBytes 64 avgt 10 0.097 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.294 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.650 ± 0.002 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 7.754 ± 0.014 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 485.716 ± 1.486 us/op
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7527:
> 7525: __ bind(DONE);
> 7526: }
> 7527: __ poly1305_fully_reduce(S0, u0);
This call to `poly1305_fully_reduce` is probably unnecessary, because the caller invokes `IntegerPolynomial1305::finalCarryReduceLast`. However, this part of the contract is undocumented.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16812#issuecomment-1827704006
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1406082911
More information about the hotspot-dev
mailing list