[vector] Some thoughts on alternative vector loop shapes

Thu Aug 23 20:11:42 UTC 2018

On Aug 21, 2018, at 11:28 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> The ugly part on API level is that all operations have to be explicitly masked. It would be nice to have a way to define a context mask to all operations. For example, have vector shapes which encapsulates both vector and a mask:
> 
>  for (int i = align; i < a.length; i += spec.length()) {
>      Mask<Float,S> m = spec.index(a.length - i);
>      FloatSpecies<S> mspec = spec.masked(m); // set context mask m
>      var va = mspec.fromArray(a, i);         // fromArray(a, i, m)
>      var vb = mspec.fromArray(b, i);         // fromArray(b, i, m)
>      var vc = va.mul(vb);                    // va.mul(vb, m)
>      vc.intoArray(c, i);                     // vc.intoArray(c, i, m)
>  }
> 
> Mask becomes part of vector shapes and participates in type checking, so not only type and size should match, but also the mask. Otherwise, an exception is thrown.

Yes, I think this is a very promising approach.  The masked-vector
will eventually be a value type which will have two components,
a mask and a vector.  And each of those is probably itself a value
type.  It will implement the unmasked portions of the vector API
by applying its internal mask; it will implement the masked portions
of the vector API by combining both internal and external masks.

When the JIT spins a loop for internally masked vectors, we can
"encourage" it to speculate that the mask is usually full (all ones),
and emit optimal unmasked vector operations along fast paths.
Such speculation would need a mask test to ensure that the
fast path is usable, but such tests would often fold or hoist.
Such "encouragement" would be a special hack in the JIT,
but it seems an appropriate one, since it would unlock many
of the kinds of loops we want to run.  When speculating, the
JIT would emit at least two copies of the main loop, one for
trivial full masking and one for "random" masking.  I suppose
the second copy could be handled by the cleanup post-loop
and/or the setup pre-loop, which (happy for us) already exist
in the JIT's loop transformer.  So the special mask speculation
hack seems an economical tweak on pre-existing optimizations,
rather than something new and difficult.

If two masked vectors are combined we have two choices for
when their internal masks differ:  Either throw an exception,
as a shape mismatch, or do logically combine the masks,
perhaps taking their intersection.  The former tactic (an
exception) seems draconian, but if is acceptable in the
user model, it is preferable, since the JIT can optimize
it better.

(Paradoxically, exceptions, which are very slow, can make
code run much faster, as long as they never actually happen.
The reason is that optimizers don't do well with fixup code
that DWIMs a partially broken program, because such
fixups lead to merge points inside inner loops, and merge
points, which are Phis in the IR, impede reordering and
folding of operations.  An unthrown exception, by contrast,
is simply an exit from a loop gated by a highly predictable
branch; this doesn't disturb the optimizer or the CPU from
doing the real work of the loop.)

Another way to approach all this is to avoid masks (either
external or internal).  We can create another implementation 
of vectors, which has a data-dependent *length* rather than
*mask*.

Such a vector could be (a) full-length, which is the usual case,
(b) a partial suffix for pre-loops, (c) a partial prefix for
post-loops, or (d) an arbitrary contiguous subsequence.
Case (d) covers the other cases, but is hard to strength
reduce and optimize, while the others may be useful in
particular boundary conditions (pre-loop, post-loop,
short-loop).  Probably case (d) can be dropped completely.

Note that some platforms will have an easier time implementing
partial-length vectors than masked vectors, especially cases
(b) and (c), which can be done with log N steering bits instead
of N mask bits (where N is the maximum number of lanes).

A concrete portable representation for a type (b) partial
vector of 8 ints would be a value type that looks like
this:

class IntVector256Suffix {
  IntVector128 v4;  // half
  IntVector64 v2;  // quarter
  int v1;  // eighth
  byte lanes;  // three bits in range [0..7]
}

This is a complicated class, but can be (eventually) optimized
as a value type into four separate registers.  The suffix vector
would have at most 7 lanes, processed in order v4, v2, v1.
The type (c) prefix version would look identical, but would
process its parts in the opposite order v1, v2, v4.  Note that
this log N format doesn't work as well for type (d), but we
can build fast loops out of the other vector types.

There is a serious problem with this approach which we are
cleverly avoiding in Vladimir's masking proposal:  Polymorphism.
A loop that uses prefix and suffix vectors would work on a sequence
(e.g., iterator or stream) of vectors of a single interface but
multiple implementations.  A maximally misaligned loop would
need to start with a type (b) vector and end with a type (c)
vector, with some normal type (a) vectors in the middle.

This could be expressed in Java, but the type profile on
the vector operations would be polluted, causing the JIT
to emit code for the pre-loop in the post-loop and vice
versa, as well as (what is more serious) code for all three
kinds of vectors in the main loop.  The type profile is a
great tool for optimizing (as is inlining), but it doesn't solve
the loop customization problem.  Still, if we had a way to
split the loop at the library level, and hand the pre-loop's
type profile only the type (b) vectors, and so on with the
main- and post-loop, then the above approach would
be highly portable and optimizable.

(To be honest, you can avoid polymorphism by adding a
full-sized vector v8 next to the log N components v4/2/1.
The length field could then take the value 8 to select v8
alone, which would cause the other fields to go unused.
Maybe that could be made to work.  Since it doubles
the register usage, Vladimir's mask-based approach
starts to look economical again, especially on platforms
which support masking.)

On platforms which have only one vector size and
no masking, the partial vector implementation sketched
above is pretty good, although the sub-vectors (half,
quarter, etc.) will boil down to simple power-of-two
sized arrays.  Some such platforms will prefer to simply
use a scalar counted loop for this case, but I have seen
times when the log-N-way approach has been preferred
over the scalar N-way approach.

Final note:  Vladimir's internally-masked vectors would
also work really well with streams.  A stream over a set
of internally masked vectors could parallelize naturally
over multiple CPUs to efficiently process large data
sets.  The spliterator for such a stream would know
how to divide up the working data set on cache-line
boundaries, and the per-CPU loops would almost always
work on type (a) full-mask vectors.

— John