RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v2]

Tue Jan 9 12:12:00 UTC 2024

On Tue, 28 Nov 2023 13:12:32 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>> 
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>> 
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>> 
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>> 
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>> 
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>> 
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>> 
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>> 
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   remove debug code

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 105:

> 103:     }
> 104:   };
> 105: 

Comment

    /*
     * Return an iterator which allows each of the generator blocks (lambdas)
     * appended to the generator to be individually invoked. Clients which
     * employ multiple generators can use this method to interleave
     * instructions belonging to independent instruction streams in the
     * target code buffer.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 109:

> 107:     return Iterator(this);
> 108:   }
> 109: 

Comment

    /*
     * Invoke all the generator blocks (lambdas) previously appended to
     * the generator in order of append, inserting the associated instructions
     * into the target code buffer.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 118:

> 116: 
> 117: class OopMap;
> 118: 

Comment

    /*
     * A RegPair is used to identify a pair of registers which hold the
     * lower and upper halves of a 128 bit value.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1712:

> 1710:   void lightweight_unlock(Register obj, Register hdr, Register t1, Register t2, Label& slow);
> 1711: 
> 1712:   // Poly1305

I think we need a general comment identifying the various different 130-bit data representations that are employed. The terms introduced here can be used later to describe arguments and intermediary data values employed in the implementation.

    /*
     * Poly1305
     * 
     * The Poly1305 implementation operates on 130-bit integer values.
     * These can be encoded in memory or in a sequence of registers
     * using several different representations:
     *
     * Memory layouts (single 130-bit word)
     *
     * mem_d5_26: In this representation 130 bits is stored as five
     * 26-bit 'limbs' in 5 successive memory DWORDs (type long[]).
     *
     * This can be considered to be a base 2^26 representation of
     * the 130 bit value with 5 digits.
     *
     * General Purpose Register Data Representations
     *
     * gpr_d5_26: This is an equivalent representation to memd_d5_26
     * with the minor variation that the data is stored in five
     * registers, R0, ... R4. The register sequence may be presented as
     * five explicit arguments or as a register array (Register[]) of
     * length five.
     *
     * As with the memory representation this can be considered to be a
     * base 2^26 representation of the 130 bit value with 5 digits.
     * However, during product cross-multiplication the registers R0,
     * ... R4 may transiently store a value large than 26 bits which
     * represents a 26-bit 'digit' in the low 26-bits and a 'carry' in
     * the higher 30 bits. Carry bits are eventually propagated up to
     * higher registers ('digits)) or, when performing modulo 2^130-5
     * reduction, back into to lower registers.
     *
     * gpr_d3_52: In this representation 130 bits is stored as 3 52-bit
     * 'limbs' in 3 registers, R0, ... R2. The register sequence may be
     * presented as 3 explicit arguments, or as a register array (type
     * Register[]) of length 3.
     *
     * This can be considered to be a base 2^52 representation of the
     * 130 bit value with 3 digits. Note that R2 normally only contains
     * a 26 bit 'digit'.
     *
     * gpr_2d3_52: In this representation 130 bits is stored as 3 52-bit
     * 'limbs' in the low elements of 3 register pairs (type RegPair,
     * field _lo). The register pairs are presented as an array (type
     * RegPair[3]), possibly embedded in a wrapper management class
     * (type RegPairs).
     *
     * This can be considered to be a base 2^52 representation of the
     * 130 bit value with the 3 digits stored in successive low
     * registers. However, during product cross-multiplication the
     * combined 128 bit derived from concatenating the lower and upper
     * registers may transiently store a 108 bit value with a 52-bit
     * 'digit' in the low 52-bits and a 56 bit 'carry' in the high 12
     * bits of the lower register and low 44 bits of the upper
     * register. Carry bits are eventually propagated through to higher
     * registers or, in the case of modulo 2^130-5 reduction, back into
     * to lower registers.
     *
     * Vector Register Single Value Representations
     *
     * vec_d5_26: This is an equivalent representation to memd_d5_26
     * with the minor variation that each 26 bit limb is stored in the
     * lower DWORDs of five vector registers, V0.d[0], ..., V4.d[0]. The
     * register sequence may be presented as five explicit arguments or
     * as a vector register array (type FloatRegister[]) of length five.
     *
     * As with the memory representation this can be considered to be a
     * base 2^26 representation of the 130 bit value with 5 digits.
     * However, during product cross-multiplication individual DWORDS
     * may transiently store a value larger than 26 bits which comprises
     * a 26-bit 'digit' in the low 26-bits and a 'carry' in the higher
     * 30 bits.
     *
     * vec_2s3_26: In this representation the five 26-bit limbs of a
     * 130-bit value are packed into the lower SWORD pairs of three
     * vector registers, V0.s[0], V0.s[1], V1.s[0], V1.s[1], V2.s[0].
     *
     * This can also be considered to be a base 2^26 representation of
     * the 130 bit value with 5 digits.
     *
     * vec4s2_26: In this representation the five 26-bit limbs of a
     * 130-bit value are packed into five SWORDs of two vector
     * registers, V0.s[0], V0.s[1], V0.s[1], V0.s[2], V1.s[0].
     *
     * This can also be considered to be a base 2^26 representation of
     * the 130 bit value with 5 digits.
     *
     * Vector Register Dual Value Representations
     *
     * Vector register algorithms frequently benefit from use of
     * instructions that employ 2 lane vector parallelism. This requires
     * input, intermediate or output data in the following formats that
     * double up elements defined in the single word value vector
     * register representations.
     *
     * vec_2d5_26 In this representation five limbs of a 130-bit value
     * are stored in the lower DWORDs of five vector registers, V0.d[0],
     * ..., V4.d[0] as per the vec_d5_26 representation. Five limbs of a
     * second 130-bit value are also stored in the upper DWORDs of the
     * same five vector registers, V0.d[1], ..., V4.d[1].
     *
     * The register sequence may be presented as five explicit arguments
     * or as a vector register array (type FloatRegister[]) of length
     * five.
     *
     * vec_4s3_26: In this representation the five 26-bit limbs of a
     * 130-bit value are packed into the lower SWORD pairs of three
     * vector registers, V0.s[0], V0.s[1], V1.s[0], V1.s[1],
     * V2.s[0]. Five limbs of a second 130-bit value are also stored in
     * the upper SWORD pairs of the same three vector registers,
     * V0.s[2], V0.s[3], V1.s[2], V1.s[3], V2.s[2].
     *
     * The register sequence may be presented as three explicit
     * arguments or as a vector register array (type FloatRegister[]) of
     * length three.
     *
     * vec_4s3_26_I: In this variation on vec_4s3_26 the 5 26-bit limbs
     * of a 130-bit value are packed into the even SWORD pairs of 3
     * vector registers, V0.s[0], V0.s[2], V1.s[0], V1.s[2], V2.s[0]. 5
     * limbs of a second 130-bit value are stored in the odd SWORD pairs
     * of the same 3 vector registers, V0.s[1], V0.s[3], V1.s[1],
     * V1.s[3], V2.s[1].
     *
     * The register sequence may be presented as three explicit
     * arguments or as a vector register array (type FloatRegister[]) of
     * length three.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1713:

> 1711: 
> 1712:   // Poly1305
> 1713: 

Comment

    /*
     * Loads five 26-bit limbs located at the address in src in
     * mem_d5_26 format into registers dest0, ... dest3 in gpr_d3_52
     * format.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1720:

> 1718:   void shifted_add128(const RegPair d, const RegPair s, unsigned int shift,
> 1719:                       Register scratch = rscratch1);
> 1720:   // Widening multiply s * r -> u

Comment

    /*
     * Appends instructions to the generator which implement a
     * widening 130-bit multiply u <-- s * r mod 2^130-5
     * 
     * The three elements of s encode a 130-bit value in gpr_d3_52 format.
     *
     * The three elements of r encode a 130-bit key in gpr_d3_52 format.
     *
     * The three elements of s are cross-multiplied with the three
     * elements of r and the results are accumulated as 128 bit
     * products in the three register pairs u in gpr_2d3_52 format.
     *
     * RR2, which must be pre-initialized to 5*(r[2] << 26), is used in
     * place of r for some of the cross-multiplies in order to enable
     * correct mod-130 reduction.
     *
     * scratch is a temporary register which is overwritten during
     * computation  of the product.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1724:

> 1722:                          const RegPair u[], const Register s[], const Register r[],
> 1723:                          Register RR2, RegSetIterator<Register> scratch);
> 1724:   // Multiply mod 2**130-5

Comment

    /*
     * Appends instructions to the current code buffer which implement a
     * 130-bit widening multiply u <-- (s * r) mod 2^130-5 by calling
     *
     *   poly1305_multiply(acc, u, r, s, RR2, scratch)
     *   acc.gen();
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1728:

> 1726:                                const RegPair u[], const Register s[],
> 1727:                                const Register r[],
> 1728:                                Register RR2, RegSetIterator<Register> scratch);

Comment

    /*
     * Appends instructions to the current code buffer which implement a
     * 130-bit multiply (s * r) mod 130 -> u by calling
     *   poly1305_multiply(acc, u, r, s, RR2, scratch)
     *   acc.gen();
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1736:

> 1734:     acc.gen();
> 1735:   }
> 1736: 

Comment

    /*
     * Appends instructions to the generator which implement a vector
     * parallel 2-way SIMD widening 130-bit multiply u <-- s * r mod
     * 2^130-5.
     * 
     * The three elements of s encode two 130-bit values in
     * vec_4s3_26_I format.
     *
     * The two elements of r encode a 130-bit key in vec_4d_26 format.
     *
     * The two elements of rr must be pre-initialized by the caller in
     * vec_4d_26 format format to (5 * r).
     *
     * Five SWORD elements in the even lanes of s are paired with five
     * corresponding SWORD elements in the odd lanes of s and
     * cross-multiplied in parallel with each of the five SWORD
     * elements of r. The pairs of DWORDs which result are
     * accumulated, respectively, in the lower and upper DWORD lanes
     * of u in vec_2d5_26 format.
     *
     * The SWORD elements of rr, which must be initialised to 5 * r,
     * are used in place of r for some of the cross-multiplies in
     * order to ensure correct reduction modulo 2^130 - 5.
     */

Also, change m[] to s[] in the method declaration.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1739:

> 1737:   void poly1305_multiply_vec(AsmGenerator &acc,
> 1738:                              const FloatRegister u[], const FloatRegister m[],
> 1739:                              const FloatRegister r[], const FloatRegister rr[]);

Comment

    /*
     * Appends instructions to the generator which implement a widening
     * 130-bit multiply u <-- s * r mod 2^130-5 by calling
     *   poly1305_multiply(acc, u, r, s, RR2, scratch)
     *   acc.gen();
     */

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409606380
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409606444
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409613796
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410549423
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410536242
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410583483
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410712922
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1412366611
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1412382710
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1412370579