RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v2]
Andrew Dinn
adinn at openjdk.org
Tue Jan 9 12:12:00 UTC 2024
On Tue, 28 Nov 2023 13:12:32 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>>
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>>
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>>
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>>
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>>
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>>
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>>
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>>
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> remove debug code
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 105:
> 103: }
> 104: };
> 105:
Comment
/*
* Return an iterator which allows each of the generator blocks (lambdas)
* appended to the generator to be individually invoked. Clients which
* employ multiple generators can use this method to interleave
* instructions belonging to independent instruction streams in the
* target code buffer.
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 109:
> 107: return Iterator(this);
> 108: }
> 109:
Comment
/*
* Invoke all the generator blocks (lambdas) previously appended to
* the generator in order of append, inserting the associated instructions
* into the target code buffer.
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 118:
> 116:
> 117: class OopMap;
> 118:
Comment
/*
* A RegPair is used to identify a pair of registers which hold the
* lower and upper halves of a 128 bit value.
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1712:
> 1710: void lightweight_unlock(Register obj, Register hdr, Register t1, Register t2, Label& slow);
> 1711:
> 1712: // Poly1305
I think we need a general comment identifying the various different 130-bit data representations that are employed. The terms introduced here can be used later to describe arguments and intermediary data values employed in the implementation.
/*
* Poly1305
*
* The Poly1305 implementation operates on 130-bit integer values.
* These can be encoded in memory or in a sequence of registers
* using several different representations:
*
* Memory layouts (single 130-bit word)
*
* mem_d5_26: In this representation 130 bits is stored as five
* 26-bit 'limbs' in 5 successive memory DWORDs (type long[]).
*
* This can be considered to be a base 2^26 representation of
* the 130 bit value with 5 digits.
*
* General Purpose Register Data Representations
*
* gpr_d5_26: This is an equivalent representation to memd_d5_26
* with the minor variation that the data is stored in five
* registers, R0, ... R4. The register sequence may be presented as
* five explicit arguments or as a register array (Register[]) of
* length five.
*
* As with the memory representation this can be considered to be a
* base 2^26 representation of the 130 bit value with 5 digits.
* However, during product cross-multiplication the registers R0,
* ... R4 may transiently store a value large than 26 bits which
* represents a 26-bit 'digit' in the low 26-bits and a 'carry' in
* the higher 30 bits. Carry bits are eventually propagated up to
* higher registers ('digits)) or, when performing modulo 2^130-5
* reduction, back into to lower registers.
*
* gpr_d3_52: In this representation 130 bits is stored as 3 52-bit
* 'limbs' in 3 registers, R0, ... R2. The register sequence may be
* presented as 3 explicit arguments, or as a register array (type
* Register[]) of length 3.
*
* This can be considered to be a base 2^52 representation of the
* 130 bit value with 3 digits. Note that R2 normally only contains
* a 26 bit 'digit'.
*
* gpr_2d3_52: In this representation 130 bits is stored as 3 52-bit
* 'limbs' in the low elements of 3 register pairs (type RegPair,
* field _lo). The register pairs are presented as an array (type
* RegPair[3]), possibly embedded in a wrapper management class
* (type RegPairs).
*
* This can be considered to be a base 2^52 representation of the
* 130 bit value with the 3 digits stored in successive low
* registers. However, during product cross-multiplication the
* combined 128 bit derived from concatenating the lower and upper
* registers may transiently store a 108 bit value with a 52-bit
* 'digit' in the low 52-bits and a 56 bit 'carry' in the high 12
* bits of the lower register and low 44 bits of the upper
* register. Carry bits are eventually propagated through to higher
* registers or, in the case of modulo 2^130-5 reduction, back into
* to lower registers.
*
* Vector Register Single Value Representations
*
* vec_d5_26: This is an equivalent representation to memd_d5_26
* with the minor variation that each 26 bit limb is stored in the
* lower DWORDs of five vector registers, V0.d[0], ..., V4.d[0]. The
* register sequence may be presented as five explicit arguments or
* as a vector register array (type FloatRegister[]) of length five.
*
* As with the memory representation this can be considered to be a
* base 2^26 representation of the 130 bit value with 5 digits.
* However, during product cross-multiplication individual DWORDS
* may transiently store a value larger than 26 bits which comprises
* a 26-bit 'digit' in the low 26-bits and a 'carry' in the higher
* 30 bits.
*
* vec_2s3_26: In this representation the five 26-bit limbs of a
* 130-bit value are packed into the lower SWORD pairs of three
* vector registers, V0.s[0], V0.s[1], V1.s[0], V1.s[1], V2.s[0].
*
* This can also be considered to be a base 2^26 representation of
* the 130 bit value with 5 digits.
*
* vec4s2_26: In this representation the five 26-bit limbs of a
* 130-bit value are packed into five SWORDs of two vector
* registers, V0.s[0], V0.s[1], V0.s[1], V0.s[2], V1.s[0].
*
* This can also be considered to be a base 2^26 representation of
* the 130 bit value with 5 digits.
*
* Vector Register Dual Value Representations
*
* Vector register algorithms frequently benefit from use of
* instructions that employ 2 lane vector parallelism. This requires
* input, intermediate or output data in the following formats that
* double up elements defined in the single word value vector
* register representations.
*
* vec_2d5_26 In this representation five limbs of a 130-bit value
* are stored in the lower DWORDs of five vector registers, V0.d[0],
* ..., V4.d[0] as per the vec_d5_26 representation. Five limbs of a
* second 130-bit value are also stored in the upper DWORDs of the
* same five vector registers, V0.d[1], ..., V4.d[1].
*
* The register sequence may be presented as five explicit arguments
* or as a vector register array (type FloatRegister[]) of length
* five.
*
* vec_4s3_26: In this representation the five 26-bit limbs of a
* 130-bit value are packed into the lower SWORD pairs of three
* vector registers, V0.s[0], V0.s[1], V1.s[0], V1.s[1],
* V2.s[0]. Five limbs of a second 130-bit value are also stored in
* the upper SWORD pairs of the same three vector registers,
* V0.s[2], V0.s[3], V1.s[2], V1.s[3], V2.s[2].
*
* The register sequence may be presented as three explicit
* arguments or as a vector register array (type FloatRegister[]) of
* length three.
*
* vec_4s3_26_I: In this variation on vec_4s3_26 the 5 26-bit limbs
* of a 130-bit value are packed into the even SWORD pairs of 3
* vector registers, V0.s[0], V0.s[2], V1.s[0], V1.s[2], V2.s[0]. 5
* limbs of a second 130-bit value are stored in the odd SWORD pairs
* of the same 3 vector registers, V0.s[1], V0.s[3], V1.s[1],
* V1.s[3], V2.s[1].
*
* The register sequence may be presented as three explicit
* arguments or as a vector register array (type FloatRegister[]) of
* length three.
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1713:
> 1711:
> 1712: // Poly1305
> 1713:
Comment
/*
* Loads five 26-bit limbs located at the address in src in
* mem_d5_26 format into registers dest0, ... dest3 in gpr_d3_52
* format.
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1720:
> 1718: void shifted_add128(const RegPair d, const RegPair s, unsigned int shift,
> 1719: Register scratch = rscratch1);
> 1720: // Widening multiply s * r -> u
Comment
/*
* Appends instructions to the generator which implement a
* widening 130-bit multiply u <-- s * r mod 2^130-5
*
* The three elements of s encode a 130-bit value in gpr_d3_52 format.
*
* The three elements of r encode a 130-bit key in gpr_d3_52 format.
*
* The three elements of s are cross-multiplied with the three
* elements of r and the results are accumulated as 128 bit
* products in the three register pairs u in gpr_2d3_52 format.
*
* RR2, which must be pre-initialized to 5*(r[2] << 26), is used in
* place of r for some of the cross-multiplies in order to enable
* correct mod-130 reduction.
*
* scratch is a temporary register which is overwritten during
* computation of the product.
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1724:
> 1722: const RegPair u[], const Register s[], const Register r[],
> 1723: Register RR2, RegSetIterator<Register> scratch);
> 1724: // Multiply mod 2**130-5
Comment
/*
* Appends instructions to the current code buffer which implement a
* 130-bit widening multiply u <-- (s * r) mod 2^130-5 by calling
*
* poly1305_multiply(acc, u, r, s, RR2, scratch)
* acc.gen();
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1728:
> 1726: const RegPair u[], const Register s[],
> 1727: const Register r[],
> 1728: Register RR2, RegSetIterator<Register> scratch);
Comment
/*
* Appends instructions to the current code buffer which implement a
* 130-bit multiply (s * r) mod 130 -> u by calling
* poly1305_multiply(acc, u, r, s, RR2, scratch)
* acc.gen();
*/
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1736:
> 1734: acc.gen();
> 1735: }
> 1736:
Comment
/*
* Appends instructions to the generator which implement a vector
* parallel 2-way SIMD widening 130-bit multiply u <-- s * r mod
* 2^130-5.
*
* The three elements of s encode two 130-bit values in
* vec_4s3_26_I format.
*
* The two elements of r encode a 130-bit key in vec_4d_26 format.
*
* The two elements of rr must be pre-initialized by the caller in
* vec_4d_26 format format to (5 * r).
*
* Five SWORD elements in the even lanes of s are paired with five
* corresponding SWORD elements in the odd lanes of s and
* cross-multiplied in parallel with each of the five SWORD
* elements of r. The pairs of DWORDs which result are
* accumulated, respectively, in the lower and upper DWORD lanes
* of u in vec_2d5_26 format.
*
* The SWORD elements of rr, which must be initialised to 5 * r,
* are used in place of r for some of the cross-multiplies in
* order to ensure correct reduction modulo 2^130 - 5.
*/
Also, change m[] to s[] in the method declaration.
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1739:
> 1737: void poly1305_multiply_vec(AsmGenerator &acc,
> 1738: const FloatRegister u[], const FloatRegister m[],
> 1739: const FloatRegister r[], const FloatRegister rr[]);
Comment
/*
* Appends instructions to the generator which implement a widening
* 130-bit multiply u <-- s * r mod 2^130-5 by calling
* poly1305_multiply(acc, u, r, s, RR2, scratch)
* acc.gen();
*/
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409606380
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409606444
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409613796
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410549423
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410536242
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410583483
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1410712922
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1412366611
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1412382710
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1412370579
More information about the hotspot-dev
mailing list