[vector] synthetic doubling, multi-vectors

Wed Apr 28 21:56:02 UTC 2021

In the BLAS case I wonder if a superior strategy is to dynamically generate loop kernels for the required operation, given the preferred vector shape and an upper bound on the number of hardware vector registers available for use. We can almost certainly dynamically generate such kernel code that uses the Vector API (bonus points if we can crack a lambda for a fused operation :-) ) in accordance with the hardware properties.

From what I understand optimized BLAS matrix multiply loop kernels do a non-obvious unroll called unroll-and-jam. 

As I understand it:
given matrixes A and B a sgemm micro-kernel might use a 6x16 tile holding a sub-matrix of the intermediate result (composed of 12 ymm registers), that is operated on over 6 row values of A and 16 column values of B. The columns of B are split into 2x8, that’s two more ymm registers. Each scalar row value of A is broadcast, but we don’t need to use 6 more ymm registers. Instead we can process two row values of A at a time, thereby using just 2 more ymm registers.
(With two more ymm registers we could do 6x32 with B split as 4x8.)
Then we wrap in a loop stepping by 16 for B and 6 for A up to some multiple. Thus the 6x16 tile is accumulated within the loop.
When the loop finishes the 6x16 is stored into C. This is where it's possible to fuse in other operations.
The kernel needs to be wrapped in higher loops that perform data movement so data is resident in caches and appropriately packed. (Perhaps this part does not require dynamic generation? But we could certainly do that too.)

Paul.

> On Apr 28, 2021, at 12:18 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> Many vector APIs (not ours) make software doubling part
> of the routine, and also support other small multiples,
> up to four or five vectors of the same kind.  This is
> useful in many ways.  I might want to traverse stream
> of 3-word structs:  In that case, I want to load 3 vectors
> at a time through my loop.  I may also wish (in that case)
> to transpose those three vectors into a three different
> vectors, where each vector contains only one kind of
> struct element.  (Yes, that’s a true matrix transpose,
> using a canned permutation vector.) This means
> that I also want shuffles on these synthetic types.
> And if I am using ETYPE=byte, I may want doubles
> of those synthetic types, if also VLEN>128.
> 
> Another use case for synthetically doubled or tripled
> vectors is loop unrolling.  I can manually unroll a
> vectorized loop by using copy-and-paste to double
> or triple my loop body, but it would be much better
> to code the double or triple the vector type of the
> loop.  In fact, the decision to double or triple can then
> be made on-line, rather than statically as a manual
> decision.
> 
> Another use case for doubled vectors is vectorized
> reduction (or even scan).  For example, in a sum
> reduction, I’d prefer to keep a vector (not scalar)
> accumulator for my loop, loading each new vector
> from my data stream, and adding its lanes into the
> corresponding lanes of the accumulator.  Of course
> I want to unroll this loop, and that means getting
> 2, 3, or 4 vectors at a time, adding them lanewise,
> and adding the result into the accumulator.
> 
> (At the end, the accumulator is folded horizontally
> to get the final sum.  This scheme only works for
> associative operations.  It can be adjusted for
> floating point if certain error terms are tracked
> by additional logic, as in the compensating sum
> of java.util.DoubleSummaryStatistics.)
> 
> For the unrolled loop to go as fast as possible,
> I want the JIT to choose how to fold up the
> lane values in each loop.  I think that works
> best if there is a vectorized reduce operation
> which reduces the lanes of a multi-vector
> into the lanes of the accumulator.  Perhaps
> there is API design to do there.  If so, I think
> it may pay off, specifically, in BLAS loops.
> 
> In any case, if we invest in synthetic shape
> doubling for any of the above reasons, that will
> reduce the incremental cost of the shape-doubled
> S_Max_BIT_X2 mentioned in the thread about
> shuffle representation.
> 
> For concreteness, we could add as many as 20
> new shapes.  For each of the existing shapes:
> S_{64,128,256,512,Max}_BIT
> …we could add multiples up to 5:
> S_{64,128,256,512,Max}_BIT_X{2,3,4,5}
> 
> The larger shapes will stress auxiliary types
> VectorMask and VectorShuffle.  For vector
> masks, the toLong accessor will fail on
> byte vector shapes larger than 512 bits,
> such as S_512_BIT_X2 or  S_128_BIT_X5.
> Maybe we want VM::toLong(int) to
> cover such cases, or maybe it’s fine as is.
> 
> For shuffles, the byte lane size will fail
> for VLEN>128, which means VSIZE>1024
> (for ESIZE=8).  This means S_512_BIT_X3,
> etc.
> 
> What’s worse is the complexity of shuffle
> logic when it must be synthesized by repeated
> application of shuffling the native size vectors.
> Consider a full, arbitrary shuffle of an N-way
> multi-vector {A,B,C…} into a destination {D,E,F…}.
> Each sub-result D must be merged from elements of
> each of A,B,C…  That means N*N total permutations,
> plus blends, using a straightforward algorithm.
> Refinements are possible, but it’s inherently
> expensive in the general case.
> 
> Probably storing to memory and reloading is better,
> in the general case.  The JIT can manage this without
> heap allocations.
> 
> In other cases, the JIT can notice patterns in the
> shuffle and reduce work; for example, maybe the
> sub-vector D only takes input from A, and so on,
> so there are fewer than N*N permutations.
> 
> The math behind shuffle complexity and scaling
> is that there are M! (VLEN factorial) distinct shuffles
> for M lanes.  For an N-way multi-vector of MN lanes,
> that scales more than exponentially.  The number of
> steps required to perform a shuffle scales like the
> log of the number of lanes (assuming each step
> can touch all the lanes).  So the work to shuffle
> M lanes (out of M! choices) is about M log M
> (from Stirling’s approximation for factorial).
> 
> The work to shuffle MN lanes is then about
> MN log MN = N(M log M) + M(N log N),
> at a minimum.  The M log M factor corresponds
> to a single hardware vector shuffle, and the
> N log N factor is a software-coded sorting
> network moving across sub-vectors; if this
> can be vectorized, then the M in M(N log N)
> goes to one, so we get N native shuffles (each
> M wide), plus N log N sorting network steps
> (each also M wide), if we are very clever.
> The simple N*N-shuffle algorithm is the
> vector analogue of a native quadratic
> sort algorithm.  The practical reality will
> be somewhere in between.
> 
> Multi-vectors can also be viewed as miniature
> matrixes.  (In fact, for BLAS kernels, that is
> *exactly* what they are.)  As such, it makes
> sense to contemplate partitioned operations
> on them, which perform reductions (and
> scans) or permutations along either major
> or minor axes.  It’s reasonable to ask an
> N-way multi-vector (of MN lanes) to sum
> all of its columns down to a native vector
> of M lanes, or to permute each of its rows
> *separately* using either a set of M-lane
> shuffles (N of them, one per row) or else
> a common shuffle.  Also it’s reasonable
> to transpose it as a matrix, especially if
> M=N.  At that point you can perform
> column-wise reductions or shuffles.
> 
> I think such steps are common in BLAS kernels.
> 
> When we get Valhalla specialized generics,
> we can use a nested notation for expressing
> these synthetic vectors, as Vector<Vector<float>>.
> 
> Then partitioned reduction becomes simple,
> using the existing API point (with a clever
> interpretation).  Partitioned shuffle turns into
> a lanewise shuffle.  Who knew shuffle was a
> member of VectorOperations, but there it is!
> 
> So, those are my thoughts at this point about
> multi-vectors and doubling schemes.  At this
> point it seems to be a research project, but I
> hope we can mine useful engineering out of
> this vein.
> 
> — John