RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics

Tue Nov 28 10:10:26 UTC 2023

On Fri, 24 Nov 2023 17:12:25 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Vectorizing Poly1305 is quite tricky. We already have a highly-
> efficient scalar Poly1305 implementation that runs on the core integer
> unit, but it's highly serialized, so it does not make make good use of
> the parallelism available.
> 
> The scalar implementation takes advantage of some particular features
> of the Poly1305 keys. In particular, certain bits of r, the secret
> key, are required to be 0. These make it possible to use a full
> 64-bit-wide multiply-accumulate operation without needing to process
> carries between partial products,
> 
> While this works well for a serial implementation, a parallel
> implementation cannot do this because rather than multiplying by r,
> each step multiplies by some integer power of r, modulo
> 2^130-5.
> 
> In order to avoid processing carries between partial products we use a
> redundant representation, in which each 130-bit integer is encoded
> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
> base 2^52, depending on whether we are using a 64- or 32-bit
> multiply-accumulate.
> 
> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
> operation available to us, so we must use 32*32 -> 64-bit operations.
> 
> In order to achieve maximum performance we'd like to get close to the
> processor's decode bandwidth, so that every clock cycle does something
> useful. In a typical high-end AArch64 implementation, the core integer
> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
> fast(ish) two-way 32-bit multiplier, which may be slower than than the
> core integer unit's. It is not at all obvious whether it's best to use
> ASIMD or core instructions.
> 
> Fortunately, if we have a wide-bandwidth instruction decode, we can do
> both at the same time, by feeding alternating instructions to the core
> and the ASIMD units. This also allows us to make good use of all of
> the available core and ASIMD registers, in parallel.
> 
> To do this we use generators, which here are a kind of iterator that
> emits a group of instructions each time it is called. In this case we
> 4 parallel generators, and by calling them alternately we interleave
> the ASIMD and the core instructions. We also take care to ensure that
> each generator finishes at about the same time, to maximize the
> distance between instructions which generate and consume data.
> 
> The results are pretty good, ranging from 2* - 3* speedup. It is
> possible that a pure in-order processor (Raspberry Pi?) might be at
> some disadvantage because more work is being done even though it is
> highly parallel, b...

Benchmarking Icestorm, the efficiency cores inside the M1, makes for interesting reading. The cores are much smaller and have far less potential parallelism, so the performance gain is far less, maxing at about 1.6 *, and even showing a very slight regression at small sizes.  I think this gain is still worth having for the increased performance, even with the  code size expansion, but the decision is not so clear.

Benchmark                        (dataSize)  (provider)  Mode  Cnt     Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt   10     0.076 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt   10     0.217 ±  0.002  us/op
Poly1305DigestBench.updateBytes        1024              avgt   10     0.786 ±  0.004  us/op
Poly1305DigestBench.updateBytes       16384              avgt   10    11.989 ±  0.003  us/op
Poly1305DigestBench.updateBytes     1048576              avgt   10   766.477 ± 0.888  us/op

Poly1305DigestBench.updateBytes          64              avgt   10     0.097 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt   10     0.294 ±  0.001  us/op
Poly1305DigestBench.updateBytes        1024              avgt   10     0.650 ±  0.002  us/op
Poly1305DigestBench.updateBytes       16384              avgt   10     7.754 ±  0.014  us/op
Poly1305DigestBench.updateBytes     1048576              avgt   10   485.716 ±  1.486  us/op

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7527:

> 7525:     __ bind(DONE);
> 7526:   }
> 7527:   __ poly1305_fully_reduce(S0, u0);

This call to `poly1305_fully_reduce` is probably unnecessary, because the caller invokes `IntegerPolynomial1305::finalCarryReduceLast`. However, this part of the contract is undocumented.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16812#issuecomment-1827704006
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1406082911