RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v5]

Tue Jan 9 14:22:32 UTC 2024

On Mon, 4 Dec 2023 17:33:00 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>> 
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>> 
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>> 
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>> 
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>> 
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>> 
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>> 
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>> 
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Whitespace
>  - Whitespace

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 66:

> 64:   }
> 65: 
> 66:   template<typename T>

This probably also needs a comment.

    /*
     * Appends a closure to the generator that can be executed
     * later to append one or more instructions to the target code
     * buffer. Normally each closure generates a single instruction,
     * allowing multiple generators that interleave parallel instruction
     * streams to obtain the maximum opportunities for pipeline
     * parallelism. In cases where an individual instruction stream
     * uses a shared scratch register to hold a temporary that will be
     * consumed by a later instruction the closure must generate the
     * full instruction sequence between the writer and last reader of
     * the temporary as a block, ensuring that instructions from
     * parallel streams which write or read the same scratch register
     * cannot side-affect each other.
     */

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1446147364