VectorAPI VectorInsert Intrinsic

Fri Jun 8 19:53:01 UTC 2018

On Jun 8, 2018, at 5:33 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> On 08/06/2018 02:50, Rukmannagari, Shravya wrote:
>> Hi Vladimir,
>> Thanks for reviewing the patch. I updated the patch to reflect the changes. Regarding the variable indexed element, we currently haven't given a thought about it, but that seems to be the ultimate design goal for inserting, I can work on it in the next few days.
> 
> I'd like to hear other opinions, but IMO it's quite important case to cover w.r.t. predictable performance goal.
> 
> Regarding implementation considerations, I'd try to extract variable index variants into vector stubs and share them among all use sites.

+1

We are getting good coverage of intra-lane SIMD operations, but
we have more work to do on data movement operations.  The first
thing to aim at (which we are doing right now) is constant data
movements, such as moving a scalar to or from a fixed lane
(i.e., with a constant index), or doing simple fixed permutations
of lanes, such as shift or rotate (keeping lane contents
unchanged, but moving them within the overall vector).

I see three layers to the data movement problem:  1. Fixed
movements, 2. Variable data movements selected from a
small regular set, 3. General data movements selected
from a large set of possible permutations or steering
vectors.

Examples:  1. insert x into lane 0; 2. insert x into lane i;
3. remove masked lanes and compress remaining lanes.

A multi-way branch is one way to do the second layer, of simple
non-constant vector movements.  You need one code path per
distinct movement.

Another way is to use a data-dependent permutation to move either
the source or destination into position.  At worst, a PSHUF steering
vector, indexed from a static table by the non-constant lane number,
will move stuff around.  You need one table element per distinct
movement.

Proof of concept:  To insert in lane i (i non-constant), permute lane i into
lane zero, insert into lane zero (constant location), then un-permute lane
zero back to lane i.  The steering vectors are pre-provisioned constants
stored in a static array, indexed by i.  There are probably better ways to
do this, depending on what instructions are available.  For example,
if there is a data-dependent lane-rotate operation, that might be faster
than loading and applying a static steering vector.

If the number of data dependent movements is large (more than
tens or hundreds) then we have a new optimization problem, the
dynamic computation of steering vectors.  I'd like to tackle this some
day, preferably using Java code and metafactories (not JIT logic).

To code an indexed set of fixed shuffles, consider using List::of
to construct a static constant that contains the array of desired
shuffles.  If the List is a constant, the JIT can easily "see through"
the List structure to the underlying (stable) array, and load the
steering vector without more than one or two extra indirections.
This would be good for a start.  We could also teach the JIT to
"see" constant/stable arrays of vectors and convert them under
the hood into static constant arrays in the nmethod, which would
get the code quality to rival hand-assembled use of steering
vector tables.

— John