[vector] Some thoughts on alternative vector loop shapes

Fri Aug 24 02:04:44 UTC 2018

If possible I’d like to keep the explicitly masked operations around. Masked operations allow me to treat each SIMD unit as an n-lane multiprocessor, which is very interesting. It means I can implement a SPMD model of execution, like CUDA or Intel’s ISPC (https://ispc.github.io/ <https://ispc.github.io/>), though it does impose a lot of overhead on the programmer to do it by hand.

That said, signalling to the programmer that the masked operations aren’t implemented in HW is a problem that isn’t well solved in the current API. I think some of the current performance issues I’m having are due to heavy use of masking in my code, but I’m executing it on AVX2 HW which doesn’t support it natively.

Adam

> On 23 Aug 2018, at 13:11, John Rose <john.r.rose at oracle.com> wrote:
> 
> On Aug 21, 2018, at 11:28 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>> 
>> The ugly part on API level is that all operations have to be explicitly masked. It would be nice to have a way to define a context mask to all operations. For example, have vector shapes which encapsulates both vector and a mask:
>> 
>> for (int i = align; i < a.length; i += spec.length()) {
>>     Mask<Float,S> m = spec.index(a.length - i);
>>     FloatSpecies<S> mspec = spec.masked(m); // set context mask m
>>     var va = mspec.fromArray(a, i);         // fromArray(a, i, m)
>>     var vb = mspec.fromArray(b, i);         // fromArray(b, i, m)
>>     var vc = va.mul(vb);                    // va.mul(vb, m)
>>     vc.intoArray(c, i);                     // vc.intoArray(c, i, m)
>> }
>> 
>> Mask becomes part of vector shapes and participates in type checking, so not only type and size should match, but also the mask. Otherwise, an exception is thrown.
> 
> Yes, I think this is a very promising approach.  The masked-vector
> will eventually be a value type which will have two components,
> a mask and a vector.  And each of those is probably itself a value
> type.  It will implement the unmasked portions of the vector API
> by applying its internal mask; it will implement the masked portions
> of the vector API by combining both internal and external masks.
> 
> When the JIT spins a loop for internally masked vectors, we can
> "encourage" it to speculate that the mask is usually full (all ones),
> and emit optimal unmasked vector operations along fast paths.
> Such speculation would need a mask test to ensure that the
> fast path is usable, but such tests would often fold or hoist.
> Such "encouragement" would be a special hack in the JIT,
> but it seems an appropriate one, since it would unlock many
> of the kinds of loops we want to run.  When speculating, the
> JIT would emit at least two copies of the main loop, one for
> trivial full masking and one for "random" masking.  I suppose
> the second copy could be handled by the cleanup post-loop
> and/or the setup pre-loop, which (happy for us) already exist
> in the JIT's loop transformer.  So the special mask speculation
> hack seems an economical tweak on pre-existing optimizations,
> rather than something new and difficult.
> 
> If two masked vectors are combined we have two choices for
> when their internal masks differ:  Either throw an exception,
> as a shape mismatch, or do logically combine the masks,
> perhaps taking their intersection.  The former tactic (an
> exception) seems draconian, but if is acceptable in the
> user model, it is preferable, since the JIT can optimize
> it better.
> 
> (Paradoxically, exceptions, which are very slow, can make
> code run much faster, as long as they never actually happen.
> The reason is that optimizers don't do well with fixup code
> that DWIMs a partially broken program, because such
> fixups lead to merge points inside inner loops, and merge
> points, which are Phis in the IR, impede reordering and
> folding of operations.  An unthrown exception, by contrast,
> is simply an exit from a loop gated by a highly predictable
> branch; this doesn't disturb the optimizer or the CPU from
> doing the real work of the loop.)
> 
> Another way to approach all this is to avoid masks (either
> external or internal).  We can create another implementation 
> of vectors, which has a data-dependent *length* rather than
> *mask*.
> 
> Such a vector could be (a) full-length, which is the usual case,
> (b) a partial suffix for pre-loops, (c) a partial prefix for
> post-loops, or (d) an arbitrary contiguous subsequence.
> Case (d) covers the other cases, but is hard to strength
> reduce and optimize, while the others may be useful in
> particular boundary conditions (pre-loop, post-loop,
> short-loop).  Probably case (d) can be dropped completely.
> 
> Note that some platforms will have an easier time implementing
> partial-length vectors than masked vectors, especially cases
> (b) and (c), which can be done with log N steering bits instead
> of N mask bits (where N is the maximum number of lanes).
> 
> A concrete portable representation for a type (b) partial
> vector of 8 ints would be a value type that looks like
> this:
> 
> class IntVector256Suffix {
>  IntVector128 v4;  // half
>  IntVector64 v2;  // quarter
>  int v1;  // eighth
>  byte lanes;  // three bits in range [0..7]
> }
> 
> This is a complicated class, but can be (eventually) optimized
> as a value type into four separate registers.  The suffix vector
> would have at most 7 lanes, processed in order v4, v2, v1.
> The type (c) prefix version would look identical, but would
> process its parts in the opposite order v1, v2, v4.  Note that
> this log N format doesn't work as well for type (d), but we
> can build fast loops out of the other vector types.
> 
> There is a serious problem with this approach which we are
> cleverly avoiding in Vladimir's masking proposal:  Polymorphism.
> A loop that uses prefix and suffix vectors would work on a sequence
> (e.g., iterator or stream) of vectors of a single interface but
> multiple implementations.  A maximally misaligned loop would
> need to start with a type (b) vector and end with a type (c)
> vector, with some normal type (a) vectors in the middle.
> 
> This could be expressed in Java, but the type profile on
> the vector operations would be polluted, causing the JIT
> to emit code for the pre-loop in the post-loop and vice
> versa, as well as (what is more serious) code for all three
> kinds of vectors in the main loop.  The type profile is a
> great tool for optimizing (as is inlining), but it doesn't solve
> the loop customization problem.  Still, if we had a way to
> split the loop at the library level, and hand the pre-loop's
> type profile only the type (b) vectors, and so on with the
> main- and post-loop, then the above approach would
> be highly portable and optimizable.
> 
> (To be honest, you can avoid polymorphism by adding a
> full-sized vector v8 next to the log N components v4/2/1.
> The length field could then take the value 8 to select v8
> alone, which would cause the other fields to go unused.
> Maybe that could be made to work.  Since it doubles
> the register usage, Vladimir's mask-based approach
> starts to look economical again, especially on platforms
> which support masking.)
> 
> On platforms which have only one vector size and
> no masking, the partial vector implementation sketched
> above is pretty good, although the sub-vectors (half,
> quarter, etc.) will boil down to simple power-of-two
> sized arrays.  Some such platforms will prefer to simply
> use a scalar counted loop for this case, but I have seen
> times when the log-N-way approach has been preferred
> over the scalar N-way approach.
> 
> Final note:  Vladimir's internally-masked vectors would
> also work really well with streams.  A stream over a set
> of internally masked vectors could parallelize naturally
> over multiple CPUs to efficiently process large data
> sets.  The spliterator for such a stream would know
> how to divide up the working data set on cache-line
> boundaries, and the per-CPU loops would almost always
> work on type (a) full-mask vectors.
> 
> — John