RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v2]

Tue Nov 28 13:12:32 UTC 2023

> Vectorizing Poly1305 is quite tricky. We already have a highly-
> efficient scalar Poly1305 implementation that runs on the core integer
> unit, but it's highly serialized, so it does not make make good use of
> the parallelism available.
> 
> The scalar implementation takes advantage of some particular features
> of the Poly1305 keys. In particular, certain bits of r, the secret
> key, are required to be 0. These make it possible to use a full
> 64-bit-wide multiply-accumulate operation without needing to process
> carries between partial products,
> 
> While this works well for a serial implementation, a parallel
> implementation cannot do this because rather than multiplying by r,
> each step multiplies by some integer power of r, modulo
> 2^130-5.
> 
> In order to avoid processing carries between partial products we use a
> redundant representation, in which each 130-bit integer is encoded
> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
> base 2^52, depending on whether we are using a 64- or 32-bit
> multiply-accumulate.
> 
> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
> operation available to us, so we must use 32*32 -> 64-bit operations.
> 
> In order to achieve maximum performance we'd like to get close to the
> processor's decode bandwidth, so that every clock cycle does something
> useful. In a typical high-end AArch64 implementation, the core integer
> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
> fast(ish) two-way 32-bit multiplier, which may be slower than than the
> core integer unit's. It is not at all obvious whether it's best to use
> ASIMD or core instructions.
> 
> Fortunately, if we have a wide-bandwidth instruction decode, we can do
> both at the same time, by feeding alternating instructions to the core
> and the ASIMD units. This also allows us to make good use of all of
> the available core and ASIMD registers, in parallel.
> 
> To do this we use generators, which here are a kind of iterator that
> emits a group of instructions each time it is called. In this case we
> 4 parallel generators, and by calling them alternately we interleave
> the ASIMD and the core instructions. We also take care to ensure that
> each generator finishes at about the same time, to maximize the
> distance between instructions which generate and consume data.
> 
> The results are pretty good, ranging from 2* - 3* speedup. It is
> possible that a pure in-order processor (Raspberry Pi?) might be at
> some disadvantage because more work is being done even though it is
> highly parallel, b...

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  remove debug code

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/16812/files
  - new: https://git.openjdk.org/jdk/pull/16812/files/37f46caa..f2e9c8e5

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=16812&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16812&range=00-01

  Stats: 80 lines in 1 file changed: 0 ins; 80 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/16812.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16812/head:pull/16812

PR: https://git.openjdk.org/jdk/pull/16812