[vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods

Tue Oct 5 20:26:33 UTC 2021

Hi John and Paul,

Thanks a lot for your inputs. It is a lot of insight and very helpful.
Is it correct to summarize as below?
1) At the minimum we would like to have: v.compress(m), v.expand(m), and m.compress()?
2) The v.compressFlip(m) can be implemented by the user in terms of v.compress(m.not()).unslice(m.trueCount()). Do we want to include compressFlip also as an api method?
3) v.compress(m, v1) and v.expand(m, v1) is not very useful and can be removed.

I was looking at the following example from John and our wish to form a loop-carried result vector, if possible:
 // compress from a[*] to z[*]
int ai = 0, zi = 0;
while (ai < a.length) {
vector as = fromArray(SPECIES, a, ai);
mask m = interestingBits(as, ?);
vector zs = as.compress(m);
zs.intoArray(z, zi, m.compress());
ai += SPECIES.length();
zi += m.trueCount();
}

The following is my attempt at it:
// compress from a[*] to z[*] using the composite vector
int length = SPECIES.length();
int zi = 0;
int elems = 0;                         // number of remaining interesting elements to store
Vector zs = SPECIES.zero();  // zs is the loop carried result vector with remaining compressed elements

for (int ai = 0; ai <= a.length - length; ai += length) {
  vector as = fromArray(SPECIES, a, ai);
  mask m = interestingBits(as, ?);
  int true_count = m.trueCount();
  if (true_count > 0) {
     vector rs = as.compress(m);
     zs = rs.unslice(elems, 0).or(zs);  // place the lower part of newly compressed elements into upper part of the zs vector
     int nelems = elems + true_count;
     if (nelems > length) {                   // if zs is full
       zs.intoArray(z, zi);                      // store zs
       zi += length;                                // updated total number of elements stored into destination
       zs = rs.unslice(elems, 1);          // place remaining compressed elements into zs
       nelems  -= length;                     // calculate remaining number of compressed elements to store
     }
     elems = nelems;                         // set elems for next iteration
  }
}
if (elems > 0) {                                 // if compressed elements remaining in zs, need to store them
  zs.intoArray(z, zi, m.indexInRange(0, elems);
}

Please let me know what you think about it.

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev <panama-dev-retn at openjdk.java.net> On Behalf Of Paul Sandoz
Sent: Tuesday, October 05, 2021 1:02 PM
To: John Rose <john.r.rose at oracle.com>
Cc: Sandhya Viswanathan <sviswanathan at openjdk.java.net>; panama-dev <panama-dev at openjdk.java.net>
Subject: Re: [vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods

Hi John,

I agree with your assessment on the vector accepting methods. 

After I wrote my brief comment on the PR to get things rolling I was thinking about blends and mask compression along similar lines.

IMO we should add benchmarks for the important use-cases you mention around the loops. That should help focus us on the API points and intrinsics. I think it’s a good way to move forward and increase our understanding towards an API and implementation.

Paul. 

> On Oct 4, 2021, at 6:17 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> On Oct 4, 2021, at 3:59 PM, Sandhya Viswanathan <sviswanathan at openjdk.java.net<mailto:sviswanathan at openjdk.java.net>> wrote:
> 
> With current definition, from the intrinsic perspective the compress(m,  v) can be directly translated to the vcompressps/vcompresspd instruction.
> Also the following holds true:
> v == v.compress(m).expand(m, v)
> v == v.expand(m).compress(m, v)
> 
> Those are cute identities but they are not very useful in practice.  
> If you want to recover v from its compressed image (v.compress(m)), 
> it’s not very helpful to say that you can recover it fully, but first 
> you need a copy of v itself for the second argument to expand.
> 
> So those identities are kind of expository, but they are not evidence 
> of useful processing of the second argument.  I think the 
> corresponding identities, without the second v argument, would be more 
> illuminating, less magical, more useful:
> 
> v == v.blend( v.compress(m).expand(m), m ) v == v.blend( 
> v.expand(m).compress(m), m.compress() )
> 
> I think “m.compress()” would be a worthy addition to the API, to 
> support working with compression and expansion.  It amounts to:
> 
> m.compress() = m.vectorSpecies().indexInRange(0, m.trueCount())
> 
> (There is no m.expand(), since m.expand() is just m.)
> 
> Surely the JIT does not need the extra overloadings of compress and 
> expand in order to select the extra argument for vcompress and 
> vexpand.
> If the blend is present in the idiom, then the special instruction can 
> be selected.  (We’d want an intrinsic for VectorMask::compress in that 
> case.)  Otherwise, the zero-filling version of the instruction can be 
> selected.
> 
> All this is to say, after looking at the proposal, I’m afraid the 
> extra v argument does not pull its weight.
> 
> To be more constructive, let me suggest what might pull more weight in 
> this area.  I’m starting from the assumption that typical uses of 
> expand and compress will not be isolated, but will help implement 
> conditional features of loops.  So compress is a loop over lanes 
> conditionally storing into contiguous output, and expand is a loop 
> over lanes conditionally loading from contiguous input.
> 
> To me this immediately suggests that the best “optional argument” for 
> these guys would be some hook to resume a compression or expansion 
> activity started in a previous vectorized operation.
> 
> Here are compressing and expanding loops, where the result of each 
> iteration is stored tio or loaded from memory without other 
> loop-carried values:
> 
> // compress from a[*] to z[*]
> int ai = 0, zi = 0;
> while (ai < a.length) {
>   vector as = fromArray(SPECIES, a, ai);
>   mask m = interestingBits(as, …);
>   vector zs = as.compress(m);
>   zs.intoArray(z, zi, m.compress());
>   ai += SPECIES.length();
>   zi += m.trueCount();
> }
> 
> // expand from z[*] updating a[*]
> int ai = 0, zi = 0;
> while (ai < a.length) {
>   vector as = fromArray(SPECIES, a, ai);
>   mask m = interestingBits(as, …);
>   vector zs = fromArray(SPECIES, z, zi, m.compress());
>   as = as.blend(zs.expand(), m);
>   as.intoArray(a, ai);
>   ai += SPECIES.length();
>   bi += m.trueCount();
> }
> 
> Both loops make integral use of m.trueCount() and m.compress().
> 
> The above exercise suggests that, if we have enough “hooks” to do 
> unaligned memory loads and stores, with mask-dependent strides, we can 
> do without any extra overloadings of compress and expand.
> 
> And those “hooks” appear to be:
> 
> - masked load and store, where the masks are contiguous
> - a way to compute a contiguous mask by “compressing” an arbitrary one
> - a way to stride past the result of storing such a contiguous partial 
> (+=trueCount)
> 
> That’s enough for now to think about, but I am also thinking there 
> might be something useful in either of two other
> directions: loop-carried buffers, and non-lossy compression and 
> expansion operations.
> 
> In brief, a loop-carried buffer is a vector register which holds a 
> partial vector worth of “results so far”.  Each iteration packs more 
> “stuff” into the buffer, increasing the “results so far”.  When 
> there’s a full vector’s worth of results, those get pushed to memory 
> (no masking) and the “results so far” are edited down.  This means 
> there’s a loop-carried buffer vector, plus an index and/or mask which 
> shows what’s there.  Also, the compress and expand operations have to 
> be able to take and extra input *and* output to take proper account of 
> the “results so far”.  The Vector API has pattern for multiple 
> outputs:  It’s the “part number” pattern.
> 
> All that is a much more complicated code shape than the memory-based 
> loops above, but it should unroll much better, I think, to operations 
> which stay inside the VPU for longer.
> 
> I think a good way to flesh out more details of this kind of 
> loop-carried compress or expand loop would be to think of the loop 
> state as consisting of a pair of vectors and pair of masks.  When you 
> compress or expand state in the current iteration, you also include, 
> as a symmetrical input, the “results so far”.  For compressing, the 
> “results so far” would be a vector and mask, already compressed, and 
> you would jointly compress *that*, plus the next vector (in the 
> current iteration), with *its* mask, to yield a double-sized 
> compressed vector pair.
> If half or more of it is filled, you store that and discard, adjusting 
> masks.  In any case, you have your new “results so far”.  I think 
> there is a corresponding structure for the other loop, obtained by 
> judicious arrow reversal.
> 
> If you think I’m on the right track there, then you’ll have realized 
> that re-compressing the “results so far” is a waste of energy.  So the 
> real buffered compression operation we want is something like this:
> 
> // compress from a[*] to z[*]
> vector buf;
> int buflen;
> int ai = 0, zi = 0;
> while (ai < a.length) {
>   vector as = fromArray(SPECIES, a, ai);
>   mask m = interestingBits(as, …);
>   vector zs;
>   (zs, buf, buflen) = as.compressOnTopOfBuf(m, buf, buflen);
>   if (buflen >= SPECIES.length()) {
>      buf.intoArray(z, zi);
>      zi += SPECIES.length();
>      buf = zs;
>      buflen -= SPECIES.length();
>   }
>   ai += SPECIES.length();
> }
> 
> Here the double-barreled compression operation is:
> 
> define as.compressOnTopOfBuf(m, buf, buflen) {
>   vector zs = as.compress(m);
>   (zs, buf, buflen) = (
>      zs.unslice(buflen, buf, 1),
>      zs.unslice(buflen, buf, 0),
>      buflen + m.trueCount()
>   );
>   return (zs, buf, buflen);
> }
> 
> I think there’s a similar double-barreled expanding operation.
> 
> Another direction of design is non-lossy compression.
> By that I mean that we re-imagine compression and expansion as proper 
> inverses of each other.  The mask does not edit out information in a 
> vector being compressed or expanded, but rather rearranges the 
> information, so that there never are zero lanes that get injected:  
> Every bit of the input is present *somewhere* in the output.
> 
> That, of course, is just the SAG (sheep and goats) operation of 
> Hacker’s Delight, applied lanewise.
> It is also the lanewise version of the x86 instructions PDEP and PEXT, 
> which operate (like HD SAG) on lanes of one bit in broadwords.
> 
> Since vector hardware does not have lanewise SAG or its inverse, I 
> think a balanced design that supports non-lossy compress and expand 
> could be composed of these methods:
> 
> z = a.compress(m);  //normal right-moving compress y = 
> b.compressFlip(m);  //left-moving compress, on ~m a = z.expand(m);  
> //normal left-moving expand b = y.expandFlip(m);  //right-moving 
> expand, on ~m
> 
> These can be implemented efficiently with today’s hardware.  The 
> “Flip” versions need a cross-lane shift on the bitcount of the mask.
> 
> compressFlip is inverse to expandFlip in exactly the same sense that 
> compress is inverse to expand.
> 
> The non-lossy operations built on top would look like this:
> 
> v.compressAll(m) := v.compress(m) |^| v.compressFlip(m)
> z.expandAll(m) := z.expand(m) |^| z.expandFlip(m)
> 
> It is also useful to derive a shuffle from either expandAll or 
> compressAll.  This can be done by forming an iota
> (index) vector and compressing or expanding the index values, then 
> “cooking” the resulting permutation vector into a shuffle.
> 
> Here are the very simple identities that these non-lossy operations 
> have:
> 
> z == z.expandAll(m).compressAll(m)
> a == a.compressAll(m).expandAll(m)
> 
> (BTW the symbol |^| is a bitwise exclusive OR where it is asserted 
> that at most one bit input is true; it is therefore also a bitwise 
> inclusive OR.  It’s how hardware likes to combine chunks of bits.)
> 
> All in all it’s a nice bit of theory.  The connection between 
> SAG/PDEP/PEXT masks and shuffles is deep.  You can represent any 
> shuffle at all, using
> lg(VLEN) masks in this way.
>