[vector] synthetic doubling, multi-vectors

Wed Apr 28 19:18:08 UTC 2021

Many vector APIs (not ours) make software doubling part
of the routine, and also support other small multiples,
up to four or five vectors of the same kind.  This is
useful in many ways.  I might want to traverse stream
of 3-word structs:  In that case, I want to load 3 vectors
at a time through my loop.  I may also wish (in that case)
to transpose those three vectors into a three different
vectors, where each vector contains only one kind of
struct element.  (Yes, that’s a true matrix transpose,
using a canned permutation vector.) This means
that I also want shuffles on these synthetic types.
And if I am using ETYPE=byte, I may want doubles
of those synthetic types, if also VLEN>128.

Another use case for synthetically doubled or tripled
vectors is loop unrolling.  I can manually unroll a
vectorized loop by using copy-and-paste to double
or triple my loop body, but it would be much better
to code the double or triple the vector type of the
loop.  In fact, the decision to double or triple can then
be made on-line, rather than statically as a manual
decision.

Another use case for doubled vectors is vectorized
reduction (or even scan).  For example, in a sum
reduction, I’d prefer to keep a vector (not scalar)
accumulator for my loop, loading each new vector
from my data stream, and adding its lanes into the
corresponding lanes of the accumulator.  Of course
I want to unroll this loop, and that means getting
2, 3, or 4 vectors at a time, adding them lanewise,
and adding the result into the accumulator.

(At the end, the accumulator is folded horizontally
to get the final sum.  This scheme only works for
associative operations.  It can be adjusted for
floating point if certain error terms are tracked
by additional logic, as in the compensating sum
of java.util.DoubleSummaryStatistics.)

For the unrolled loop to go as fast as possible,
I want the JIT to choose how to fold up the
lane values in each loop.  I think that works
best if there is a vectorized reduce operation
which reduces the lanes of a multi-vector
into the lanes of the accumulator.  Perhaps
there is API design to do there.  If so, I think
it may pay off, specifically, in BLAS loops.

In any case, if we invest in synthetic shape
doubling for any of the above reasons, that will
reduce the incremental cost of the shape-doubled
S_Max_BIT_X2 mentioned in the thread about
shuffle representation.

For concreteness, we could add as many as 20
new shapes.  For each of the existing shapes:
S_{64,128,256,512,Max}_BIT
…we could add multiples up to 5:
S_{64,128,256,512,Max}_BIT_X{2,3,4,5}

The larger shapes will stress auxiliary types
VectorMask and VectorShuffle.  For vector
masks, the toLong accessor will fail on
byte vector shapes larger than 512 bits,
such as S_512_BIT_X2 or  S_128_BIT_X5.
Maybe we want VM::toLong(int) to
cover such cases, or maybe it’s fine as is.

For shuffles, the byte lane size will fail
for VLEN>128, which means VSIZE>1024
(for ESIZE=8).  This means S_512_BIT_X3,
etc.

What’s worse is the complexity of shuffle
logic when it must be synthesized by repeated
application of shuffling the native size vectors.
Consider a full, arbitrary shuffle of an N-way
multi-vector {A,B,C…} into a destination {D,E,F…}.
Each sub-result D must be merged from elements of
each of A,B,C…  That means N*N total permutations,
plus blends, using a straightforward algorithm.
Refinements are possible, but it’s inherently
expensive in the general case.

Probably storing to memory and reloading is better,
in the general case.  The JIT can manage this without
heap allocations.

In other cases, the JIT can notice patterns in the
shuffle and reduce work; for example, maybe the
sub-vector D only takes input from A, and so on,
so there are fewer than N*N permutations.

The math behind shuffle complexity and scaling
is that there are M! (VLEN factorial) distinct shuffles
for M lanes.  For an N-way multi-vector of MN lanes,
that scales more than exponentially.  The number of
steps required to perform a shuffle scales like the
log of the number of lanes (assuming each step
can touch all the lanes).  So the work to shuffle
M lanes (out of M! choices) is about M log M
(from Stirling’s approximation for factorial).

The work to shuffle MN lanes is then about
MN log MN = N(M log M) + M(N log N),
at a minimum.  The M log M factor corresponds
to a single hardware vector shuffle, and the
N log N factor is a software-coded sorting
network moving across sub-vectors; if this
can be vectorized, then the M in M(N log N)
goes to one, so we get N native shuffles (each
M wide), plus N log N sorting network steps
(each also M wide), if we are very clever.
The simple N*N-shuffle algorithm is the
vector analogue of a native quadratic
sort algorithm.  The practical reality will
be somewhere in between.

Multi-vectors can also be viewed as miniature
matrixes.  (In fact, for BLAS kernels, that is
*exactly* what they are.)  As such, it makes
sense to contemplate partitioned operations
on them, which perform reductions (and
scans) or permutations along either major
or minor axes.  It’s reasonable to ask an
N-way multi-vector (of MN lanes) to sum
all of its columns down to a native vector
of M lanes, or to permute each of its rows
*separately* using either a set of M-lane
shuffles (N of them, one per row) or else
a common shuffle.  Also it’s reasonable
to transpose it as a matrix, especially if
M=N.  At that point you can perform
column-wise reductions or shuffles.

I think such steps are common in BLAS kernels.

When we get Valhalla specialized generics,
we can use a nested notation for expressing
these synthetic vectors, as Vector<Vector<float>>.

Then partitioned reduction becomes simple,
using the existing API point (with a clever
interpretation).  Partitioned shuffle turns into
a lanewise shuffle.  Who knew shuffle was a
member of VectorOperations, but there it is!

So, those are my thoughts at this point about
multi-vectors and doubling schemes.  At this
point it seems to be a research project, but I
hope we can mine useful engineering out of
this vein.

— John