Vector API Latest Draft Spec

Fri Jul 14 20:20:02 UTC 2017

On Jul 14, 2017, at 12:51 PM, Graves, Ian L <ian.l.graves at intel.com> wrote:
> 
>> 2) I was pondering about masks and wondering whether Vector etc should
>> be parameterised by lane rather than bits. API-wise this is more appealing
>> when pushing/pulling from element sources as its more obvious what the
>> quantities are. Masks are more easily usable across vector types. However
>> optimisation-wise this may become more tricky since a mask consisting of 8
>> lanes could be 8 bytes packed into 128 bits, or 8 ints packed into 512 bits (this
>> might work on AVX512 but there are likely other examples on AVX2 where
>> the register sizes don’t correspond). An alternative is to support a cast, which
>> would be optimal for cases where the lane and element bit size are the same
>> (namely transforming between Float/Int and Double/Long).
>> My inclination would be to explore the lane declaring route, as long as we are
>> confident the JIT can optimize. Note that generics help here but HotSpot still
>> presumably has to do checks when compiling since raw types can be used.
>> 
> 
> … When I experiemented with Lane vs Bit-size, I ran into the most pain
> around casting.  Casting with a lane count parameter puts you into an
> undefined space.  When recasting to a Vector of a specific shape, you
> have the knowledge that the shape is unchanged by the cast.  This
> isn't the case with lane counts. Some casts will shorten or lengthen
> your vector's element count.  If you're parameterizing by lane number,
> you would have to loosen the constraint on the lane parameter in these
> operations or strengthen the assumptions about lane counts in the
> casting operations.

From the user point of view, bit-count is almost always uninteresting,
so it would seem to be a slam-dunk to reformulate the API to emphasize
lane count instead.  I went back and forth on this in my mind the first
time around, and realized that when I was coding with vectors (using
gcc and immintrin.h) I had to keep bit-size (aka register type) foremost
in mind, even when I wanted to deal only with lane count.  The distinction
was important in cases where I needed to convert from a vector of
bytes to a vector of ints (short reason:  the bytes were being shoveled
from memory into a vector of 32-bit accumulators), and sometimes
vice versa.  These same-lane-different-size conversions were extremely
tricky and expensive, and I knew I had to minimize them.  My take-away
from this experience is that changing register types is expensive
and should be avoided.  (That includes vector to scalar, and also
vector of size A to vector of size B.)  When I tried to express that
mindset in a Java API, it seemed best to have the register type
be tracked in the static type system, so that same-register-type
conversions would feel more natural than same-lane-count
conversions.

That's what a Vector.Shape is:  A register type.  There is one
shape for each of xmm, ymm, and zmm registers.  There could
be other shapes too, including variable-sized shapes for
architectures that support those.  A "register type" can also
be synthesized from more than one hardware resource,
so there could also be a shape for "masked register" which
is (on AVX2) a pair of registers of the same size or (on AVX512)
a vector register plus a mask register.

Another application of Shape:  I think we want to experiment with
a Shape which is (logically) a set of registers of each possible size
other than the maximum size, plus a scalar register, plus a set of bits,
one for each of the preceding register types.  What for?  For the post-loop
which deals with the N%S remaining array elements (length N, vector
size S).  Those post loops are very architecture dependent, but they
boil down to either a (lg S)-1 set of conditional stores of decreasing
size, or a masked store from size S.  The proposed Shape would
hide the mechanics of this operation, and simply code-generate to
the post-loop (which obviously shouldn't be a loop at all).

(Having a complete story for pre- and post-loops will make it
easier to generify real loops as found in the wild.  Eventually,
handling general loops will become simple and robust enough
that we can transition the vector work into the implementation
of the JDK Stream API, which will give us a 2-4x performance
bump out of the box.  As everybody knows by now, getting full
power from your cores means fork-join at the macro-scale and
full vectorization at the micro-scale.)

There may be other applications of shape for registers with special
purpose operations (but not sure about that).  Either Shape or Vector
should (as already said) support specially-provided dynamically
determined types with super-powers not mentioned in the standard
API.  VNNI is a good example.  So is crypto in its many forms.
Perhaps types of the form <T extends Vector & VNNIVectorMixin>
will appear, where some of the types are platform-specific.

This is exciting work for me because I think we can use Java's
robust OO modeling capabilities, plus value types when they come
on-line, to make Java the best general purpose language for
vector programming.  It's a pretty good aspiration, at least.

— John