[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Fri Sep 3 20:11:37 UTC 2021

Thanks John, and Sandhya for also commenting.

You both rightly pointed out the weakness using operators and rearrange :-) it does fit right.

John, your observation on order really stood out to me. I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero).

In summary, from the discussion, compress/expand are:

- important conceptually, even if the same functionality could be composed from shuffles (such as used by an implementation); and

- at the right level to reliably optimize on supporting hardware.

So specification-wise we introduce expanding/compressing cross-lane operations. API-wise I prefer two distinct methods rather that one that accepts boolean indicating expansion or compression. We can declare one intrinsic method in VectorSupport.

Paul.

> On Sep 2, 2021, at 11:33 AM, John Rose <john.r.rose at oracle.com> wrote:
> 
> On Sep 2, 2021, at 9:50 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>> 
>> Hi Joshua,
>> 
>> I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities.
>> 
>> 
>> Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method:
>> 
>> VectorMask<Integer> mask = ...;
>> IntVector bv = av.rearrange(VectorOperator.COMPRESS, mask);
>> VectorMask<Integer> prefixMask = prefix(mask.trueCount());
>> bv.intoArray(array, offset, prefixMask);
>> offset += mask.trueCount();
>> 
>> We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that:
>> 
>> 1) the non-mask rearrange is an identity operation; and
>> 2) the mask accepting rearrange describes mask-based cross-lane movement:
> 
> Be careful with adding bulk operations to VectorOperations.
> Those are supposed to be only lanewise (scalar).  It’s not
> clear to me what COMPRESS could mean, as a lanewise
> operation.
> 
> One very interesting potential interaction between lanewise
> ops and masked operations is reduction of a subset of lanes.
> This feels a little like compress, although the lateral data
> motion is absent.  I’m thinking that generalized shuffle,
> lateral compression/expansion, and segmented scan/reduce
> may be the interesting primitives here.
> 
> The FIRST_NONZERO operation is the closest thing we have
> to lateral motion today:  The idea is that non-zero lane values
> propagate from higher to lower positions, skipping over zero
> lane values.  We might try to get from FIRST_NONZERO to
> compress by defining an enhanced segmented reduction
> operation which takes into a account masking (to define
> segments that are to be individually reduced), and then
> take additional wins from using other VOs such as MAX,
> ADD, etc.  I’m throwing this out as brainstorming; what
> I get from this line of thought is that segmented scans
> naturally deliver results in expanded form, while
> segmented reductions have two natural forms of
> output, expanded and compressed.  So we need a
> compress primitive, and/or an expand/compress
> mode bit for segmented reduction.
> 
> (Matters of terminology:  scan(op) produces the
> vector of partial results that would be obtained
> if a reduction(op) operation proceeded linearly
> from lane 0 to lane VLENGTH-1, recording
> an intermediate result in each lane, often
> before lane value is accumulated.  The scan
> can take a carry-in and carry-out lane value
> to allow multiple vectors to compose.
> Both reductions and scans can be usefully
> be segmented, again with carry-in and carry-out.
> Segmented operations are useful for fast
> group-by modes of operation, and also for
> vectorized parsing, such as var-ints or
> even text.  Compress and expand are independent
> primitives, both driven by masks; such masks
> can as noted above be related to segmentation
> masks.)
> 
>> 
>> It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS:
>> 
>> rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix())
>> Or
>> rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), zero())
>> // Postfix of exceptional lanes in the shuffle, representing unset lanes 
> 
> I’m not following this, because I’m having trouble imagining
> ops like ADD and MAX here, as well as imagining COMPRESS
> as a VO.  (I must be missing something crucial here.)
> 
>> 
>> For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE.
> 
> I guess what I’m missing is this is a whole new kind of OP,
> not a lanewise one.  Probably should live on Shuffle, since
> VOs is documented to be about lanewise ops only.
> 
>> However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence.
>> 
>> In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant.
> 
> Yes, and you are exploring the space of plausible constants.
> I think they (or at least some of them) could also factorized
> as *families* of constants, indexed by masks.
> 
>> Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation?
> 
> Siting the compress (and expand!) functionality on shuffle instead
> of vector is a nice story, but I think it’s better to site it on general
> vectors, as distinct from shuffling, because of its relation to
> segmented data-parallel operations (which are distinct from
> permutations, since they never swap the order of selected
> lanes).
> 
> So, I see three primitive notions:  Segmentation, compress/expand,
> and permutation.  You can bridge to and from permutation simply
> by working with index vectors like iota, and perhaps (as sugar) lifting
> selected vector operations to shuffles.
> 
>> 
>> —
>> 
>> FWIW another way to think about a partitioning/compression shuffle:
>> 
>> SPECIES.iota().compress(m);
> 
> Yes, that’s the bridging I just mentioned.  Some applications
> would expand as well as compress.  (I know it’s a one-way
> street for SVE and maybe others but fear not; expand is
> easier to implement than compress, using a plain
> permutation.)
> 
> For example, if parsing a file of var-ints does
> segmentation (based on (lane&0x80)=0x80 masking)
> and subsequent compression to 32-bit integer lanes, the
> reverse operation of un-parsing would use an expansion
> of 32-bit integer lanes, to be reduced to bytes and stored
> to the var-int file.
> 
>> 
>> Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation.
> 
> Yes.
> 
> If a machine lacks a compress and/or expand operation,
> we’d want Java paths that would compute appropriate
> shuffles and then apply them, with constant folding
> when appropriate.  (Although masks are usually not
> constant.)
> 
> Sketch of algorithm to derive a compression shuffle
> from a mask:
> 
> Take VLENGTH=5 just for example and suppose we
> want to compress <4:v, 3:w, 2:x, 1:y, 0:z> with mask
> 0b01001.  The desired output is then <0,0,0,w,z>.
> 
> 1.   For each masked lane (in a fresh index vector)
> compute the number of masked lanes at lower
> indexes.  For example, <2, 1, 1, 1, 0> from 0b01001.
> This can be done in log time, and hardware can
> sometimes help.  The operation naturally yields
> popcnt(m) (2 in the example) as a carry-out from
> a log-scan.  Note that the lowest lane always gets
> zero.  Variants of this algorithm might use a carry-in
> to put something other than zero there.  Note also
> that the resulting index vector, used to shuffle
> the input vector, performs an *expand* which is
> the inverse of the compress we are seeking.
> 
> 2. (Optional but useful.)  Compute a complementary
> second fresh index vector from the inverted mask,
> added to the bit-count of the first mask.  For example,
> <2, 2, 1, 0, 0> from 0x10110 (~0b01001), and then
> (adding the bitcount 2) <4, 4, 3, 2, 2>.  Note that
> this index vector would perform an expand, to
> the left-hand (high) side of the vector, of the vector
> we are trying to compress.
> 
> 3. Under the mask, blend the index vector from step 1
> with that of step 2, or with a constant vector of value
> VLENGTH-1.  For example, blending from 0b01001,
> either <4, 1, 3, 2, 0> or (lacking step 2) <4, 1, 4, 4, 0>.
> 
> 4. Create an in-memory buffer of the same size and
> shape as the vector to be reduced.  Use the blended
> index to perform an indexed store from the input
> vector to the memory buffer.  This uses the hardware
> memory store operation, on a temp buffer, to create
> the desired compress.  (It should be done in the VPU
> not in the MU, of course, but sometimes this might
> be the best way.)
> 
> 5. Load the buffer back into a vector register, and
> set the unselected lanes to whatever value is desired,
> either zero, or (perhaps) a left-packed set of the
> non-compressed values.  (This is called the SAG
> or “sheep and goats” operation, when applied to
> a bit-vector, and it is very useful for other vectors
> as well.)
> 
> Step 2 may  be desirable even if you are not doing a SAG,
> if the indexed store hardware will go slower due
> to collisions on the last lane (index 4 in the example;
> note that only that index gets repeated).
> 
> Note that steps 1 and 2 can be done in superscalar parallel.
> If step 2 is done with a from-left scan, there is no need to
> add in the correction value, of popcnt(m).
> 
> I don’t know a great way to avoid a trip to a memory
> buffer, if the hardware does not supply a compress
> operation in the VPU.  But here’s an almost-great
> way:
> 
> 4b.  Having computed the non-colliding permutation
> vector E (in steps 2 and 3, as if for SAG), use math to
> compute the inverse permutation vector C.  E is an
> expansion as noted above while its inverse C is the
> corresponding compression.  In the running example,
> E is <4, 1, 3, 2, 0> and its inverse C will be <4, 2, 1, 3, 0>.
> 
> 5b. Use the standard shuffle (permutation) operation
> to reorder the input vector.  Adjust uncompressed
> fields appropriately if SAG is not desired (zero them
> using the mask or whatever).
> 
> 4b.  Math details:  Sort the merged E so as to transform
> it into the iota vector, while performing *exactly the
> same* swaps and reorderings on an iota vector.  As
> E sorts to iota, the other copy of iota reorders to C.
> Perhaps the easiest way to do this is to create an index
> vector A with double-width lanes, and pack each lane
> A[i] with (E[i] * VLENGTH + i).  (The “+i” is “+iota[i]”.)
> The double-width lanes of A are “packets” to be “routed”
> to the “addresses” in the high half of the lane, and the
> “payload” of each packet is the eventual index in C.
> Butterfly network, aka binary radix sort, will do the job
> in log time.
> 
> (Hey hardware vendors, wouldn’t it be nice to supply
> us users with a keyed sort operation, which takes a
> vector of “addresses” and a different vector of “values”
> and reorders the “values” according to the sorted
> order of the “addresses”?  That is the proper inverse
> to shuffle/permute primitives.  You’d build a keyed
> routing network in silicon much like the ones that
> implement shuffle/permute.)
> 
> 5c.  The step 5a can be merged into 4b in some cases,
> if the original input vector lanes are small enough to
> fit into the “packets” in the A vector being sorted in
> 4b.  In that case, there’s no need to finish with a
> shuffle/permutation, since the actual data has already
> been moved where it belongs.  All you need to do
> is strip off the “address headers” of the “packets”,
> by means of a reshaping truncation.
> 
> This shows that a possible primitive is the anti-shuffle.
> (Anti-permutation.)  Anti-shuffle is to shuffle as
> store is to load.  Store is harder than load because
> where load has duplications store has collisions.
> Collisions probably want to be handled with
> reduction operations.  (This takes me back to
> my Connection Machine and C* days…  The
> C* operation p->n += x causes each processor’s
> x value to be applied as an increment to the n
> value on each friend processor p, where p is
> basically a pointer as in C.  The operations
> take place in an indeterminate order but
> all x values are summed into each distinct n
> variable.  Also, “int y = p->n += x” captures
> partial sums, so it’s a scan.  It’s all “as if”
> executed sequentially, but in fact executes
> in log time, like vector reductions and
> permutations.)  An anti-shuffle is really
> a 1-1 permutation  (to the expanded “collision
> buckets” in order), followed by a segmented
> reduction, followed by a compress.  So
> (this is brainstorming!) the anti-shuffle
> is really another way to dress up the other
> primitives I’m discussing.
> 
> Ideas of the morning; HTH… :-)