[vectorIntrinsics] reinterpret vs. reshape vs. cast

Mon Jun 3 22:13:02 UTC 2019

On Jun 3, 2019, at 2:07 PM, Kharbas, Kishor <kishor.kharbas at intel.com> wrote:
> 
> It’s an interesting proposal and if I understand it correctly you want the computation to proceed as [1] and not as [2]. To do that we limit the shape changing apis, the only one now would be explicit reinterpret() or toArray() followed by fromArray(),

I liked your first highly graphical reply.  As the creative writing teachers
tell us, "show me don't tell me".  :-)

I think we'll get computations that are easier to reason about if the
default is to stay within one shape, and if the shape-shifting operations
are clear.

Some operations produce answers in two (or more parts).  For example,
casting int to double will (if one shape is used throughout) produce two
output vectors for every one input vector.  More abstractly, "zip" and "shuffle"
do this also (two output vectors for a pair of inputs, all of one shape).

I think I have a good way to explain this, and thus make an API that will
demystify such operations (including cast and reshape).

Here's my thinking at present:

* Most vector operations produce outputs the same size as their inputs.

* Some special vector operations produce outputs larger or smaller than
their inputs.

* We must distinguish *logical outputs* from *physical outputs*.  A cast
from byte to double produces a logical output which is eight times larger
than its input.  This is true even if there is no actual vector format that is
so large.  A cast from double down to byte produce a logical output that
is 1/8 the size of its input.

* (You could also define a separate "logical input", but that seems less
useful and less common.  Perhaps a two-input shuffle works on a logical
input that is both vectors appended, etc.)

* The physical output of an operation is simply the shape of its result.
Since we don't have multiple-value returns, there's always just one result.

* Every operation the includes a resize can be characterized by the size
of its logical output versus the size of its physical output.

* When a logical output is larger than a physical output, we say the
output is "squeezed".  (I'm avoiding the traditional terms "pack" or
"compress".)  The "squeeze factor" M is the ratio of the sizes.

* When a logical output is smaller than a physical output, we say the
output is "unsqueezed".  (I'm using the "un" version of the other term
here because it makes it clear that something will happen that's the
exact reversal of something else.)

* Every operation that includes a squeeze necessarily produces a partial
result.  The partial result contains 1/M of the logical result.

* If you want all of the logical result, you must invoke the operation M
times, to get each partial result.  Such methods have a parameter
"int part", the "part number", which ranges over [0..M-1].  (Many
vector ISAs feature such part numbers in various places.  They can
be confusing to understand.  We are trying to do better.)

* If you ask for a part number that doesn't exist, you get an AIOOB
exception.  There's no wrapping or clipping or other DWIM of parts.

* Part zero is always valid, even if there is no squeezing.

* An operation that unsqueezes has a smaller logical result, which
is 1/M the size of the physical result.  Normally the small result is
placed into the physical register starting at lane 0 and filling unused
lanes with zeroes.

* An unsqueezing operation also takes an "int part" parameter, which
steers the logical result into one of M different zones of the output
vector.  If part is zero, then the first zone (starting with lane 0) is
used.  The next part is the next part of the vector after the first part,
with no overlap.  The final part ends at the end of the vector.

* The part numbering of unsqueezing operations meshes with the
part numbering of squeezing operations, such that if you squeeze
a vector V through an operation O into (say) M=4 partial outputs W[0..3],
and if you unsqueeze those through a complementary operation P
into 4 partial inputs X[0..3], a bitwise or (or other merge) can then
assemble the final answer Y, with a lane structure parallel to V.

* In order to add an error check for methods which might perform
either squeezes or unsqueezes (data dependently) the part parameters
of unsqueezes are non-positive, in the range [-M+1..0].  The selected
part is the absolute value of the passed parameter.  The sign simply
encodes the users acknowledgement that an unsqueeze is happening.
(We could also consider having pairs of methods, to encode the intention
symbolically.)

* Non-resizing operations reject non-zero part numbers in all cases.

* Operations that work simultaneously on multiple lane sizes
(scatter/gather indexes of type int/long vs. payloads of byte or double)
can perhaps be viewed as taking larger logical inputs, for the
operands with the larger lanes (int/long indexes or double payloads).
It probably makes sense to view them as taking correspondingly
larger logical outputs, making them into squeezing operations.

Methods which squeeze require a non-negative part parameter and
are the following:

* cast, from a larger to a smaller lane size
* reshape or reinterpret, from a larger to a smaller shape
* unpack operations, which expand smaller lanes to (zero- or sign-filled) larger lanes
* data driven expansion operations (APL expand)
* zip or shuffle instructions, which interleave lanes from two (or more) vectors (logical output is the whole zip)
* shift lanes (if viewed as producing a double-sized output)
* (maybe) scatter or gather, when the payload lane (byte/short/int) is smaller than the index lane (int/long)
* other two-input operations which consolidate results into one same-shape input (merge sort step?)

Unsqueezing methods are:

* cast, from a smaller to a larger lane size
* reshape or reinterpret, from a smaller to a larger shape
* pack operations, which truncate larger lanes to (zero- or sign-filled) larger lanes
* data driven compression operations (APL compress)
* (maybe) unzip or unshuffle instructions, which extract interleaved lanes back out to two (or more) vectors
* (maybe) scatter or gather, when the payload lane (double) is larger than the index lane (int)

Nearly all squeezing methods have a common need for non-zero part
selection, to avoid loss of useful data and bandwidth.  Non-zero part
numbers on unsqueezing methods might be less common.

Some of the squeezing operations (unpack, expand) have names which reverse
the connotation of squeezing.  This is not a bug in the classification, but rather
a change of focus:  When you *unpack* your furniture, your *room* might well
feel cramped.  The furniture is loosely arranged, at the expense of using too
much space in the room that contains it.  Likewise, the need to work with part
numbers happens exactly when there is a mismatch between logical output and
physical output.

That mismatch can arise in a variety of ways:
* Data can expand or contract lanewise (cast, pack/unpack)
* Containers can change size (reshape/reinterpret)
* Logical results can incorporate a pair of inputs (zip)
* Logical results consist of a lengthened, zero-padded input (shift lanes)
* (maybe) Mixed-size input lanes require extra-large logical inputs

(You can see why I reached for a new term "squeeze" instead of the traditional
terms.  We need a special term or two to refer to the uncomfortable circumstances
that arise when our containers turn out to be the wrong size.)

I'll post a draft that incorporates this language sometime this week, hopefully
before Wednesday.

— John