Vector.shiftER, Vector.shiftEL not working as expected.

Wed Oct 11 00:57:52 UTC 2017

On Oct 10, 2017, at 12:06 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> IIUC correctly those instructions are for logical shift operations on elements and are not a lane-wise shift. I am guessing for the latter some form of permute would be used.

I think we need to be scrupulous to distinguish between the operations
which are inside-the-lane maps of scalar operations, and the operations
which move data across lanes.  The former are usually more efficient
than the latter, and it is always a muddle when terminology confuses
the two.

The scalar in-lane operations should be named the same as the "lifted"
lane-wise operations:  add, sub, mul, … & also shift, rotate, etc.

The cross-lane operations need their own separate style of API point.
For example, names like "shuffle" and "permute" clearly apply to lane
structure, and cannot be confused with lifted elemental operations.

(Unfortunately, even the word "lane-wise" is tricky; I think of "lane-wise"
as "a lifted scalar operating within each lane", but I think you used it
above in the opposite sense.  Not sure how to pick clear terms here.)

It is a happy accident that in a few cases bitwise operations can ignore
lane boundaries (xor, and, ior).  But in most cases confusing the two
sets of operations will just muddle our discussions.  For this reason
ambiguous phrases like "elemental shift" set my teeth on edge:  I
immediately feel lost as to whether we are talking in-lane or cross-lane
semantics.  Put another way, the term "elemental" makes it clear that we
are talking about elements, but doesn't help us understand whether
we are talking about the values inside the elements (in-lane work)
or "outside the lane" motion of the elements among themselves
(cross-lane work).  It won't always help to just use Intel mnemonics or
conventions, since we are trying to document a portable semantics.

The load and store operations *almost* have the same happy accident
as xor, of not caring about lane structure, *except* for the order of elements.
For that, Java needs to impose a convention, even if it seems to conflict
with the way the hardware documents the numbering of lanes.
Lane zero has to mean the lowest-numbered array element, or we will
have endless troubles with portability.

This also means that it is risky, and probably counterproductive, for the
Java API to try to expose a notion of "left" and "right" lane directions across
the whole vector, even if the hardware documentation talks about such
things.  (It can because it commits to a platform-specific byte order.  But
Java can't; or if it does, the byte order has to be customizable as in NIO.)

Note that shape abstraction makes whole-vector operations less useful
for many purposes.  If you don't know the size of your vector, cross-lane
operators like shuffle are pretty hard to use.  (Not impossible, of course.)
Keeping programmers away from the hard-to-use cross-lane operations
is another reason to give them names which cannot be confused with the
more commonly used in-lane operations.

To summarize:  Always make it clear whether operations are within
lanes or not (i.e., across the lanes of a whole vector); use natural
terms for lifted in-lane operations; use a separate vocabulary for
cross-lane (whole vector) operations.

— John

P.S.  A challenge:  Extend the API so that nearest-neighbor computations
are supported, within some limited distance, allowing stencils to be
programmed.   Do so without exposing vector sizes.  As in the case
of partial vector loop cleanups, this probably requires some sort of
vector-shape abstraction that "mixes in" contextual access to neighbors
which may be in a nearby vector.  Perhaps this is most naturally used
inside a Stream of two-vector context items, although we can't optimize
that very well yet.