[vector] representation of VectorShuffle

Wed Apr 28 18:34:09 UTC 2021

Here are a note on VectorShuffle design.  It is not intended
to influence the release we are presently working toward,
but I thought I would send it out for the record, so we can
be thinking about these issues.

Currently, VectorShuffle uses bytes internally to store its indexes.
Using the sign bit to represent invalid indexes, this supports vectors
with up to 128 lanes.  For vectors with more lanes (7 < ceil lg VLEN)
we need a wider representation.  Perhaps shorts (16 > ceil lg VLEN)
is enough; it would also be Java-like to use ints (32 >> ceil lg VLEN).

I suppose one clever move would be to right-size the internal
representation, using byte[] when possible and short[] otherwise.
This means we’d have two internal subtypes of VectorShuffle.
(It’s hard to imagine needing larger ones.)  That said, the
advantage of using byte[] over short[] is just a 2x bump on
a small datum, which is perhaps not worth the code complexity.
So, choice #1:  Keep byte, vs. use both byte & short, vs. move
to short only.  I think I like the last, but I’m not sure.

There is a subtle linkage between VectorShuffle and VectorShape.
Basically, we are encouraging users to find the best shape and
use it consistently, on the theory that fewer shapes can sometimes
make it easier for the JIT to optimize.  This is probably the main
reason for the odd API point VectorShuffle::toVector, which
packs the shuffle indexes into a vector of the kind as the vector
which the shuffle operates on.  The point here (which may be
misguided) is that if users are working with data-dependent
shuffles, it is useful to have fewer vector types floating around
in the code.  If VectorShuffle::toVector were to return a
Vector<Short> with the same VLEN as the target vector,
the resulting shape will be different from that of the
target vector (except when it too is a Vector<Short>),
and so the JIT will be register allocating two kinds of
vectors, and the CPU might even concurrently process
two sizes of data.  So we mandate that VS::toVector return
a vector of byte indexes if the target is Vector<Byte>, and
a vector of double indexes if the target is Vector<Double>,
and so on.

This choice is odd for several reasons.  If lg VLEN > 7,
then the indexes won’t even fit into the lanes of a
Vector<Byte>.  And for a Vector<Long>, the lanes are
ludicrously large—which is tolerable, though odd.
For a Vector<Double> or Vector<Float>, the lanes are
*in the wrong format* relative to the internals of
VectorShuffle, so code which creates data-dependent
shuffles of FP vectors may well have accidental
conversions between float and integral types.

There is probably a way to design the connection
between shuffles and vectors in a better way.  The
constraints I am looking at are, for a shuffle type
S that represents a Vector<F> of some VLEN:

A. Shuffle indexes are wide enough to represent 2*VLEN
values (a full set of both valid and invalid indexes).

B. The shuffle can convert efficiently to a related vector
type, and back again, with good optimization.  (And not
FP/int conversions.)  Note that different platforms may
use different basic representations.  Some platforms will
prefer byte-wise permutation maps (PSHUFB), while
others may prefer lane-wise permutation maps.

C. When representing shuffles in registers as data
dependent values, the JIT is not required to use
extra register types (it can work with target
vector registers, if that helps).

D. When representing constant or loop-invariant
shuffles (and expressions that yield shuffles, such
as s.wrap()), the JIT can choose appropriately between
constant bits in memory or rematerialization at
point of use.

These are complicated requirements.  Here’s one
way we could attempt to meet them all.  This is
not the only possible set of choices, but maybe it’s
good enough to help the conversation to continue.

A. Always use 16-bit indexes internally, both
natively in shuffles and in their vector representations.
Do this even if the user-visible index types appear to be
byte or long or int.  (If so, take the hit of converting
between internal and user-visible as needed; this
should be optimizable to a nop for round trips,
given internally normalized lane values.)

B. For any vector species VS, define a related species
VS.shuffleSpecies which is predictably and portably
derived from VS.  Here are two rules that might work:

1. If VS lanes are wide enough to represent all VS.VLEN
values plus an equal number of invalid ones, then
VS.SS has the same shape as VS.  (“Wide enough” is
defined as VS.SS.ESIZE > ceil lg VS.VLEN.)

Example:  Each species of short, int, or long vector
is its own shuffle species.  A species of byte vector
is its own shuffle species if and only if its VLEN
is 128 or less.

2. If VS lanes are not wide enough, then VS.SS has
a shape built by doubling the shape of VS until it
is large enough.  For now, this only happens if
VS.ETYPE is byte, and only one doubling, to short,
is needed.  If we ever do smaller lanes (say, bit
vectors) more doubling would be needed, and
likewise if we do huge vectors needing int indexes.

Example:  A species of byte vector has a doubled
shuffle species of type short, if and only if its VLEN
is 129 or more.

3. The element type of the VS.SS is always integral.
If SS.ETYPE is floating, then VS.SS.ETYPE is the
integral type of the same ESIZE.

Example:  A species of double or float vector has a
shuffle species of the same shape but with long
or int lanes, respectively.  This would go for
small floats also, if we ever support such types.

C. The shape doubling, when it occurs, is defined
to use software, creating register pairs, not using
other kinds of jumbo registers.  This is true even if
the hardware in fact supports both register
sizes at the same time.  The immediate implication
is that the Vector API needs new “synthetic” shapes
to double each kind of “native” shape.

D. The bits in a live register representing a shuffle
value should be as close as possible to any native
representation supported by the hardware.  This
means that the user-visible value of Shuffle::asVector,
which is a series of lane indexes, one per lane,
might be different from the internal bit pattern
of the shuffle itself, which might be a set of byte
indexes, with (perhaps) several sub-indexes per lane.

If there is such a difference, there should be intrinsic
optimizable conversions between the user-visible
vector representation and the internal representation.
The conversions should support nop round trips.

The in-memory representation of shuffle constants
should be the internal representation, not the user
visible vector, nor an array of Java shorts.  This is
probably important for fast constant loading.

I see a few potential costs in the above design, compared
with our present design.

First, it’s a little harder to predict the shape of a shuffle
vector, if the user wants to do vector-wise computation
of data dependent shuffles.  I hope this won’t be a serious
cost, especially compared with the present design, which
requires floating point index manipulation.

Second, we have to add a new vector shape to support
doubling of byte vectors with more than 128 lanes.
The only shape that is affected is S_Max_BIT, so we
need a new shape S_Max_BIT_X2.  (BTW, should that
be S_MAX_BIT?)  Presumably, the S_Max_BIT_X2
is big enough to manage its own shuffles:  It would have
to be more than 32,768 lanes wide to require re-doubling.

Third, building out support for synthetic doubled
vectors is a chore.  Code which shuffles a doubled
vector requires (up to) four single-sized shuffles.
Specifically, when shuffling A:B into C:D, you need
to merge the parts for C from both A and B, and
likewise for D.

Some or all of the above design may be an improvement
on what we have now with VectorShuffle.

As I said at the top, this is food for thought, not a call
to immediate action.

— John

P.S. I will pull a little more on the string of shape doubling,
in a separate message.  The chore might pay off.