[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Thu Sep 2 18:33:14 UTC 2021

On Sep 2, 2021, at 9:50 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> Hi Joshua,
> 
> I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities.
> 
> 
> Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method:
> 
>  VectorMask<Integer> mask = ...;
>  IntVector bv = av.rearrange(VectorOperator.COMPRESS, mask);
>  VectorMask<Integer> prefixMask = prefix(mask.trueCount());
>  bv.intoArray(array, offset, prefixMask);
>  offset += mask.trueCount();
> 
> We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that:
> 
> 1) the non-mask rearrange is an identity operation; and
> 2) the mask accepting rearrange describes mask-based cross-lane movement:

Be careful with adding bulk operations to VectorOperations.
Those are supposed to be only lanewise (scalar).  It’s not
clear to me what COMPRESS could mean, as a lanewise
operation.

One very interesting potential interaction between lanewise
ops and masked operations is reduction of a subset of lanes.
This feels a little like compress, although the lateral data
motion is absent.  I’m thinking that generalized shuffle,
lateral compression/expansion, and segmented scan/reduce
may be the interesting primitives here.

The FIRST_NONZERO operation is the closest thing we have
to lateral motion today:  The idea is that non-zero lane values
propagate from higher to lower positions, skipping over zero
lane values.  We might try to get from FIRST_NONZERO to
compress by defining an enhanced segmented reduction
operation which takes into a account masking (to define
segments that are to be individually reduced), and then
take additional wins from using other VOs such as MAX,
ADD, etc.  I’m throwing this out as brainstorming; what
I get from this line of thought is that segmented scans
naturally deliver results in expanded form, while
segmented reductions have two natural forms of
output, expanded and compressed.  So we need a
compress primitive, and/or an expand/compress
mode bit for segmented reduction.

(Matters of terminology:  scan(op) produces the
vector of partial results that would be obtained
if a reduction(op) operation proceeded linearly
from lane 0 to lane VLENGTH-1, recording
an intermediate result in each lane, often
before lane value is accumulated.  The scan
can take a carry-in and carry-out lane value
to allow multiple vectors to compose.
Both reductions and scans can be usefully
be segmented, again with carry-in and carry-out.
Segmented operations are useful for fast
group-by modes of operation, and also for
vectorized parsing, such as var-ints or
even text.  Compress and expand are independent
primitives, both driven by masks; such masks
can as noted above be related to segmentation
masks.)

> 
> It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS:
> 
>  rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix())
>  Or
>  rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), zero())
>  // Postfix of exceptional lanes in the shuffle, representing unset lanes 

I’m not following this, because I’m having trouble imagining
ops like ADD and MAX here, as well as imagining COMPRESS
as a VO.  (I must be missing something crucial here.)

> 
> For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE.

I guess what I’m missing is this is a whole new kind of OP,
not a lanewise one.  Probably should live on Shuffle, since
VOs is documented to be about lanewise ops only.

> However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence.
> 
> In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant.

Yes, and you are exploring the space of plausible constants.
I think they (or at least some of them) could also factorized
as *families* of constants, indexed by masks.

> Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation?

Siting the compress (and expand!) functionality on shuffle instead
of vector is a nice story, but I think it’s better to site it on general
vectors, as distinct from shuffling, because of its relation to
segmented data-parallel operations (which are distinct from
permutations, since they never swap the order of selected
lanes).

So, I see three primitive notions:  Segmentation, compress/expand,
and permutation.  You can bridge to and from permutation simply
by working with index vectors like iota, and perhaps (as sugar) lifting
selected vector operations to shuffles.

> 
> —
> 
> FWIW another way to think about a partitioning/compression shuffle:
> 
>  SPECIES.iota().compress(m);

Yes, that’s the bridging I just mentioned.  Some applications
would expand as well as compress.  (I know it’s a one-way
street for SVE and maybe others but fear not; expand is
easier to implement than compress, using a plain
permutation.)

For example, if parsing a file of var-ints does
segmentation (based on (lane&0x80)=0x80 masking)
and subsequent compression to 32-bit integer lanes, the
reverse operation of un-parsing would use an expansion
of 32-bit integer lanes, to be reduced to bytes and stored
to the var-int file.

> 
> Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation.

Yes.

If a machine lacks a compress and/or expand operation,
we’d want Java paths that would compute appropriate
shuffles and then apply them, with constant folding
when appropriate.  (Although masks are usually not
constant.)

Sketch of algorithm to derive a compression shuffle
from a mask:

Take VLENGTH=5 just for example and suppose we
want to compress <4:v, 3:w, 2:x, 1:y, 0:z> with mask
0b01001.  The desired output is then <0,0,0,w,z>.

1.   For each masked lane (in a fresh index vector)
compute the number of masked lanes at lower
indexes.  For example, <2, 1, 1, 1, 0> from 0b01001.
This can be done in log time, and hardware can
sometimes help.  The operation naturally yields
popcnt(m) (2 in the example) as a carry-out from
a log-scan.  Note that the lowest lane always gets
zero.  Variants of this algorithm might use a carry-in
to put something other than zero there.  Note also
that the resulting index vector, used to shuffle
the input vector, performs an *expand* which is
the inverse of the compress we are seeking.

2. (Optional but useful.)  Compute a complementary
second fresh index vector from the inverted mask,
added to the bit-count of the first mask.  For example,
<2, 2, 1, 0, 0> from 0x10110 (~0b01001), and then
(adding the bitcount 2) <4, 4, 3, 2, 2>.  Note that
this index vector would perform an expand, to
the left-hand (high) side of the vector, of the vector
we are trying to compress.

3. Under the mask, blend the index vector from step 1
with that of step 2, or with a constant vector of value
VLENGTH-1.  For example, blending from 0b01001,
either <4, 1, 3, 2, 0> or (lacking step 2) <4, 1, 4, 4, 0>.

4. Create an in-memory buffer of the same size and
shape as the vector to be reduced.  Use the blended
index to perform an indexed store from the input
vector to the memory buffer.  This uses the hardware
memory store operation, on a temp buffer, to create
the desired compress.  (It should be done in the VPU
not in the MU, of course, but sometimes this might
be the best way.)

5. Load the buffer back into a vector register, and
set the unselected lanes to whatever value is desired,
either zero, or (perhaps) a left-packed set of the
non-compressed values.  (This is called the SAG
or “sheep and goats” operation, when applied to
a bit-vector, and it is very useful for other vectors
as well.)

Step 2 may  be desirable even if you are not doing a SAG,
if the indexed store hardware will go slower due
to collisions on the last lane (index 4 in the example;
note that only that index gets repeated).

Note that steps 1 and 2 can be done in superscalar parallel.
If step 2 is done with a from-left scan, there is no need to
add in the correction value, of popcnt(m).

I don’t know a great way to avoid a trip to a memory
buffer, if the hardware does not supply a compress
operation in the VPU.  But here’s an almost-great
way:

4b.  Having computed the non-colliding permutation
vector E (in steps 2 and 3, as if for SAG), use math to
compute the inverse permutation vector C.  E is an
expansion as noted above while its inverse C is the
corresponding compression.  In the running example,
E is <4, 1, 3, 2, 0> and its inverse C will be <4, 2, 1, 3, 0>.

5b. Use the standard shuffle (permutation) operation
to reorder the input vector.  Adjust uncompressed
fields appropriately if SAG is not desired (zero them
using the mask or whatever).

4b.  Math details:  Sort the merged E so as to transform
it into the iota vector, while performing *exactly the
same* swaps and reorderings on an iota vector.  As
E sorts to iota, the other copy of iota reorders to C.
Perhaps the easiest way to do this is to create an index
vector A with double-width lanes, and pack each lane
A[i] with (E[i] * VLENGTH + i).  (The “+i” is “+iota[i]”.)
The double-width lanes of A are “packets” to be “routed”
to the “addresses” in the high half of the lane, and the
“payload” of each packet is the eventual index in C.
Butterfly network, aka binary radix sort, will do the job
in log time.

(Hey hardware vendors, wouldn’t it be nice to supply
us users with a keyed sort operation, which takes a
vector of “addresses” and a different vector of “values”
and reorders the “values” according to the sorted
order of the “addresses”?  That is the proper inverse
to shuffle/permute primitives.  You’d build a keyed
routing network in silicon much like the ones that
implement shuffle/permute.)

5c.  The step 5a can be merged into 4b in some cases,
if the original input vector lanes are small enough to
fit into the “packets” in the A vector being sorted in
4b.  In that case, there’s no need to finish with a
shuffle/permutation, since the actual data has already
been moved where it belongs.  All you need to do
is strip off the “address headers” of the “packets”,
by means of a reshaping truncation.

This shows that a possible primitive is the anti-shuffle.
(Anti-permutation.)  Anti-shuffle is to shuffle as
store is to load.  Store is harder than load because
where load has duplications store has collisions.
Collisions probably want to be handled with
reduction operations.  (This takes me back to
my Connection Machine and C* days…  The
C* operation p->n += x causes each processor’s
x value to be applied as an increment to the n
value on each friend processor p, where p is
basically a pointer as in C.  The operations
take place in an indeterminate order but
all x values are summed into each distinct n
variable.  Also, “int y = p->n += x” captures
partial sums, so it’s a scan.  It’s all “as if”
executed sequentially, but in fact executes
in log time, like vector reductions and
permutations.)  An anti-shuffle is really
a 1-1 permutation  (to the expanded “collision
buckets” in order), followed by a segmented
reduction, followed by a compress.  So
(this is brainstorming!) the anti-shuffle
is really another way to dress up the other
primitives I’m discussing.

Ideas of the morning; HTH… :-)