RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v5]
Andrew Dinn
adinn at openjdk.org
Tue Jan 9 14:22:32 UTC 2024
On Mon, 4 Dec 2023 17:33:00 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>>
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>>
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>>
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>>
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>>
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>>
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>>
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>>
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with two additional commits since the last revision:
>
> - Whitespace
> - Whitespace
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 66:
> 64: }
> 65:
> 66: template<typename T>
This probably also needs a comment.
/*
* Appends a closure to the generator that can be executed
* later to append one or more instructions to the target code
* buffer. Normally each closure generates a single instruction,
* allowing multiple generators that interleave parallel instruction
* streams to obtain the maximum opportunities for pipeline
* parallelism. In cases where an individual instruction stream
* uses a shared scratch register to hold a temporary that will be
* consumed by a later instruction the closure must generate the
* full instruction sequence between the writer and last reader of
* the temporary as a block, ensuring that instructions from
* parallel streams which write or read the same scratch register
* cannot side-affect each other.
*/
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1446147364
More information about the hotspot-dev
mailing list