[vectorIntrinsics] reinterpret vs. reshape vs. cast

Thu May 30 17:48:06 UTC 2019

I have been thinking about the API points that
convert vectors between the various species.
The existing points are a good "stab" at what
are the basics, but they need reconsideration,
especially now that we have decent implementations
and know (more directly) what are the underlying
"physics" of the Vector API.

So, I want to get rid of reshape, merging it into
reinterpret.  The two methods are not different
enough (AFAICT) to warrant parallel implementations.

In a private version of the branch, I have rewritten
reshape as an alias of reinterpret, but without the
extra <F> variable:

    public abstract <F> Vector<F> reinterpret(VectorSpecies<F> s);

    /** Use reinterpret. */
    @Deprecated
    public final Vector<E> reshape(VectorSpecies<E> s) {
        s.check(elementType());  // verify same E
        return reinterpret(s);
    }

After various constant-folding operations, it still
comes out the same, as a call to VI.reinterpret.

The "check" call is an extra runtime check which
ensures that, in fact, the species has the element
type E as claimed by the static type system.
Because Java allows unchecked casts (and we
use them) I've sprinkled such "check" calls
wherever I think the static type system might
need a little run-time help.

So much for reshape.  Now let's talk about the
hard problem at the center of all of this:  The
unpredictable resizing of vectors.  Most vector
workloads choose a key vector size and use
it for nearly all operations.  Hardware sometimes
prefers to work with a single size at a time,
and human beings prefer to reason about
constant information contents, rather than
having to re-derive a size for every vector
sub-expression.

This means I think the Vector API should
have a clearer policy of shape-invariance.
If you start with shape S_128_BIT and do
a bunch of vector operations, you should
end up with the same shape, unless you
intentionally select a operation that is
documented to change a shape.  This is
all the more important in portable shape
agnostic code, where you don't know the
size of the preferred shape.  Having that
suddenly change to a non-preferred shape
would be a headache.

Therefore, I want to make "shape-changing"
methods a special category, that is clearly
called out in the javadoc, and easy for the
user to recognize in the user's code.

What would non-reshaping methods be?
Well, anything lane-wise that preserves
the element type (ETYPE) is obviously shape
invariant.  Also shuffles, which are not
lane-wise but keep VLENGTH and ETYPE
constant, are shape invariant.  Operations
with masks and shuffles are pretty much
always shape invariant.

In-place reinterpret casts are shape invariant,
even as they completely redraw lane types
and lane boundaries.

Now we come to lane-wise value casts.  These
are sometimes indispensible but will change
the shape of the vector if (as is often the case)
the cast changes the bit-size of the ETYPE.
The implementation code for this (in the JIT)
shows how disruptive this to implement.
And I think it's equally disruptive to users.
What happens if you need to convert from
byte to int, but your preferred byte species
cannot scale upward by 4x to an int species
of another supported shape?  The API requires
you to find out in advance whether the larger
shape exists to hold the int species for the
cast result.  If you can't find one, you need
to radically recast your computation.  This is
a portability anti-pattern.

Here's what I think is a better way, more portable,
easier to implement, and easier for users to manage:

Introduce a cast operation which is shape
invariant.  Something like this:

/** Converts this vector lane-wise to a new vector
  * of a lane-type F, keeping underlying shape constant.
  *  … */
<F> Vector<F> convert(Conversion<E,F> conv, int part);

Rather than rely on Class<F> to denote the conversion
implicitly the user selects a specific conversion operation
from a suitable repertoire (TBD).  The conversion "knows"
its domain and range types, and therefore also "knows"
whether it will expand or contract the vector.  Byte to int
expands, while int to byte contracts, and so on.

(Several ISAs refer to this phenomenon as "unpack"
and "pack".  There's also "zip" and "unzip" in SVE
which has a related function, and I AVX has two-vector
shuffles that can do the same.  Zip could be useful
for zero-filling or sign-filling expansion, while unzip
could be useful for extraction.  There are also mask
driven, variable motion, APL-like compress and expand
operations on the table which are at least related and
maybe useful as implementation tools.  There's lots
more to be said about implementation, but we can
defer that for later.)

When a conversion is expanding, in order to retain
shape invariance, it is necessary for the conversion
to produce 2 (or 4 or 8) output vectors.  We can
try to hide this fact in the name of simplicity for
the user, but it just makes the code shape-shifty,
which (IMO) hurts the user's ability to reason about
the rest of the code.

(If we there are intrinsic or synthetic shapes that
can hold all of the bits, that's fine, but it's still a
shape-shifting operation, which I propose we avoid
in today's design.  When we add synthetic multi-vectors,
after Valhalla, we can re-introduce shape-shifting code
that is portable.  But we can't do that today if the number
of shapes is a dynamic property.  We need synthetic
multi-vector shapes to build out a shape-fluid user
experience.  Can't do that today.)

OK, so we have a byte-to-int cast that expands
from one input vector to four output vectors.  How
does the user keep the all straight?  I think a good
answer is to add the "part" parameter, noted above.

The part parameter is present in all lane-wise shape
changing operations.  Zero is always a valid argument.
For a operation which expands by a factor of N,
the valid range of part numbers is [0..N-1].
The meaning is simple:  It selects which "part"
of the output to return.  It is *not* a lane index
(and I don't want to generalize in that direction).
A user doing byte-to-int conversion will simply
know that there are four parts to deal with, and
work that into the algorithm.

(There are at least three ways to deal with parts:
Use part 0 only, which means the input vector
only uses 25% of its lanes.  You load 25% of the
input bytes at a time and then expand part 0
to a 100%-sized vector of the same shape.
Or, use a little 4-way loop over the parts,
disposing of each part separately.  Or,
unroll the little loop by hand, using 4
temps for parts 0..3.  Depends on the
application.)

If the part number is out of range, you get
an array index exception.  The pseudo-code
of the reference implementation can pretend
that the conversion operation produces a tiny
array of 4 output vectors, and then the part
number is an index into that array.

If the conversion doesn't change shape, then
zero is the only part number you can ever use.
That could be defaulted, but it's not worth the
extra overloaded API point IMO.

Now, if the conversion contracts, we don't
strictly need a part number; we can use a
convention which is that the output bits are
placed at the beginning of the output vector.
But we also want a part number here.  Again,
zero means "just throw away everything except
the beginning of the computation."  But non-zero
part numbers mean "place the output into another
part of the output vector.  Why would we do this?
Because conversions sometimes come in pairs,
and we need to be able to track lane values through
multiple conversions, sometimes.  To do this sanely,
I propose that contracting operations have a part
number parameter which "steers" the lane values
in away compatible with the inverse conversion,
with the same part number.  This means that
contracting with a non-zero part means that
the output is placed in a (zero-filled) vector
at lane VLENGTH*part/N, where N is the
contraction factor.  This means that immediately
following with an inverse, and the same part
number, will reproduce the original input.
That makes these methods much easier to
reason about.  (Maybe a contracting part
number should be either negative or zero,
to provide an extra error check.  There's no
burden on the user to adding a minus sign
to the method call, and it will better document
what's going on.)

I think this "part" idea has legs, and provides
a decent way to deal with multi-part results.

A beneficial side effect of keeping shape
invariance as a principle is that we can
concentrate the JIT code on using one
register type at a time, and do the
fancy footwork for operations like byte
to int conversion (with size changes)
in Java code where it belongs, rather than
in JIT code.  I hope to retire the existing
cast intrinsic, which is a highly complex
5-phase instruction selection problem,
replacing it with a suite of smaller more
flexible primitives, some to do shape
changing without conversion and others
to do conversion without shape changing.

— John