[vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods

Tue Oct 5 20:01:54 UTC 2021

Hi John,

I agree with your assessment on the vector accepting methods. 

After I wrote my brief comment on the PR to get things rolling I was thinking about blends and mask compression along similar lines.

IMO we should add benchmarks for the important use-cases you mention around the loops. That should help focus us on the API points and intrinsics. I think it’s a good way to move forward and increase our understanding towards an API and implementation.

Paul. 

> On Oct 4, 2021, at 6:17 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> On Oct 4, 2021, at 3:59 PM, Sandhya Viswanathan <sviswanathan at openjdk.java.net<mailto:sviswanathan at openjdk.java.net>> wrote:
> 
> With current definition, from the intrinsic perspective the compress(m,  v) can be directly translated to the vcompressps/vcompresspd instruction.
> Also the following holds true:
> v == v.compress(m).expand(m, v)
> v == v.expand(m).compress(m, v)
> 
> Those are cute identities but they are not very useful
> in practice.  If you want to recover v from its compressed
> image (v.compress(m)), it’s not very helpful to say
> that you can recover it fully, but first you need a
> copy of v itself for the second argument to expand.
> 
> So those identities are kind of expository, but they
> are not evidence of useful processing of the second
> argument.  I think the corresponding identities,
> without the second v argument, would be more
> illuminating, less magical, more useful:
> 
> v == v.blend( v.compress(m).expand(m), m )
> v == v.blend( v.expand(m).compress(m), m.compress() )
> 
> I think “m.compress()” would be a worthy addition
> to the API, to support working with compression
> and expansion.  It amounts to:
> 
> m.compress() = m.vectorSpecies().indexInRange(0, m.trueCount())
> 
> (There is no m.expand(), since m.expand() is just m.)
> 
> Surely the JIT does not need the extra overloadings
> of compress and expand in order to select the
> extra argument for vcompress and vexpand.
> If the blend is present in the idiom, then the
> special instruction can be selected.  (We’d
> want an intrinsic for VectorMask::compress
> in that case.)  Otherwise, the zero-filling
> version of the instruction can be selected.
> 
> All this is to say, after looking at the proposal,
> I’m afraid the extra v argument does not pull
> its weight.
> 
> To be more constructive, let me suggest what
> might pull more weight in this area.  I’m starting
> from the assumption that typical uses of
> expand and compress will not be isolated,
> but will help implement conditional features
> of loops.  So compress is a loop over lanes
> conditionally storing into contiguous output,
> and expand is a loop over lanes conditionally
> loading from contiguous input.
> 
> To me this immediately suggests that the best
> “optional argument” for these guys would be
> some hook to resume a compression or expansion
> activity started in a previous vectorized operation.
> 
> Here are compressing and expanding loops, where
> the result of each iteration is stored tio or loaded
> from memory without other loop-carried values:
> 
> // compress from a[*] to z[*]
> int ai = 0, zi = 0;
> while (ai < a.length) {
>   vector as = fromArray(SPECIES, a, ai);
>   mask m = interestingBits(as, …);
>   vector zs = as.compress(m);
>   zs.intoArray(z, zi, m.compress());
>   ai += SPECIES.length();
>   zi += m.trueCount();
> }
> 
> // expand from z[*] updating a[*]
> int ai = 0, zi = 0;
> while (ai < a.length) {
>   vector as = fromArray(SPECIES, a, ai);
>   mask m = interestingBits(as, …);
>   vector zs = fromArray(SPECIES, z, zi, m.compress());
>   as = as.blend(zs.expand(), m);
>   as.intoArray(a, ai);
>   ai += SPECIES.length();
>   bi += m.trueCount();
> }
> 
> Both loops make integral use of m.trueCount()
> and m.compress().
> 
> The above exercise suggests that, if we have enough
> “hooks” to do unaligned memory loads and stores,
> with mask-dependent strides, we can do without
> any extra overloadings of compress and expand.
> 
> And those “hooks” appear to be:
> 
> - masked load and store, where the masks are contiguous
> - a way to compute a contiguous mask by “compressing” an arbitrary one
> - a way to stride past the result of storing such a contiguous partial (+=trueCount)
> 
> That’s enough for now to think about, but I am also thinking
> there might be something useful in either of two other
> directions: loop-carried buffers, and non-lossy compression
> and expansion operations.
> 
> In brief, a loop-carried buffer is a vector register which
> holds a partial vector worth of “results so far”.  Each
> iteration packs more “stuff” into the buffer, increasing
> the “results so far”.  When there’s a full vector’s worth
> of results, those get pushed to memory (no masking)
> and the “results so far” are edited down.  This means
> there’s a loop-carried buffer vector, plus an index
> and/or mask which shows what’s there.  Also, the
> compress and expand operations have to be able
> to take and extra input *and* output to take
> proper account of the “results so far”.  The Vector
> API has pattern for multiple outputs:  It’s the
> “part number” pattern.
> 
> All that is a much more complicated code shape than
> the memory-based loops above, but it should unroll
> much better, I think, to operations which stay inside
> the VPU for longer.
> 
> I think a good way to flesh out more details of this
> kind of loop-carried compress or expand loop would
> be to think of the loop state as consisting of a pair
> of vectors and pair of masks.  When you compress
> or expand state in the current iteration, you also
> include, as a symmetrical input, the “results so
> far”.  For compressing, the “results so far” would
> be a vector and mask, already compressed, and
> you would jointly compress *that*, plus the next
> vector (in the current iteration), with *its* mask,
> to yield a double-sized compressed vector pair.
> If half or more of it is filled, you store that and
> discard, adjusting masks.  In any case, you have
> your new “results so far”.  I think there is a
> corresponding structure for the other loop,
> obtained by judicious arrow reversal.
> 
> If you think I’m on the right track there, then
> you’ll have realized that re-compressing the
> “results so far” is a waste of energy.  So the
> real buffered compression operation we want
> is something like this:
> 
> // compress from a[*] to z[*]
> vector buf;
> int buflen;
> int ai = 0, zi = 0;
> while (ai < a.length) {
>   vector as = fromArray(SPECIES, a, ai);
>   mask m = interestingBits(as, …);
>   vector zs;
>   (zs, buf, buflen) = as.compressOnTopOfBuf(m, buf, buflen);
>   if (buflen >= SPECIES.length()) {
>      buf.intoArray(z, zi);
>      zi += SPECIES.length();
>      buf = zs;
>      buflen -= SPECIES.length();
>   }
>   ai += SPECIES.length();
> }
> 
> Here the double-barreled compression operation is:
> 
> define as.compressOnTopOfBuf(m, buf, buflen) {
>   vector zs = as.compress(m);
>   (zs, buf, buflen) = (
>      zs.unslice(buflen, buf, 1),
>      zs.unslice(buflen, buf, 0),
>      buflen + m.trueCount()
>   );
>   return (zs, buf, buflen);
> }
> 
> I think there’s a similar double-barreled expanding
> operation.
> 
> Another direction of design is non-lossy compression.
> By that I mean that we re-imagine compression and
> expansion as proper inverses of each other.  The
> mask does not edit out information in a vector
> being compressed or expanded, but rather
> rearranges the information, so that there never
> are zero lanes that get injected:  Every bit of
> the input is present *somewhere* in the output.
> 
> That, of course, is just the SAG (sheep and goats)
> operation of Hacker’s Delight, applied lanewise.
> It is also the lanewise version of the x86 instructions
> PDEP and PEXT, which operate (like HD SAG)
> on lanes of one bit in broadwords.
> 
> Since vector hardware does not have lanewise SAG
> or its inverse, I think a balanced design that supports
> non-lossy compress and expand could be composed
> of these methods:
> 
> z = a.compress(m);  //normal right-moving compress
> y = b.compressFlip(m);  //left-moving compress, on ~m
> a = z.expand(m);  //normal left-moving expand
> b = y.expandFlip(m);  //right-moving expand, on ~m
> 
> These can be implemented efficiently with today’s
> hardware.  The “Flip” versions need a cross-lane
> shift on the bitcount of the mask.
> 
> compressFlip is inverse to expandFlip in exactly
> the same sense that compress is inverse to expand.
> 
> The non-lossy operations built on top would look
> like this:
> 
> v.compressAll(m) := v.compress(m) |^| v.compressFlip(m)
> z.expandAll(m) := z.expand(m) |^| z.expandFlip(m)
> 
> It is also useful to derive a shuffle from either expandAll
> or compressAll.  This can be done by forming an iota
> (index) vector and compressing or expanding the index
> values, then “cooking” the resulting permutation vector
> into a shuffle.
> 
> Here are the very simple identities that these non-lossy
> operations have:
> 
> z == z.expandAll(m).compressAll(m)
> a == a.compressAll(m).expandAll(m)
> 
> (BTW the symbol |^| is a bitwise exclusive OR where it is
> asserted that at most one bit input is true; it is therefore
> also a bitwise inclusive OR.  It’s how hardware likes to
> combine chunks of bits.)
> 
> All in all it’s a nice bit of theory.  The connection
> between SAG/PDEP/PEXT masks and shuffles is
> deep.  You can represent any shuffle at all, using
> lg(VLEN) masks in this way.
>