[vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods

Tue Oct 5 01:17:23 UTC 2021

On Oct 4, 2021, at 3:59 PM, Sandhya Viswanathan <sviswanathan at openjdk.java.net<mailto:sviswanathan at openjdk.java.net>> wrote:

With current definition, from the intrinsic perspective the compress(m,  v) can be directly translated to the vcompressps/vcompresspd instruction.
Also the following holds true:
v == v.compress(m).expand(m, v)
v == v.expand(m).compress(m, v)

Those are cute identities but they are not very useful
in practice.  If you want to recover v from its compressed
image (v.compress(m)), it’s not very helpful to say
that you can recover it fully, but first you need a
copy of v itself for the second argument to expand.

So those identities are kind of expository, but they
are not evidence of useful processing of the second
argument.  I think the corresponding identities,
without the second v argument, would be more
illuminating, less magical, more useful:

v == v.blend( v.compress(m).expand(m), m )
v == v.blend( v.expand(m).compress(m), m.compress() )

I think “m.compress()” would be a worthy addition
to the API, to support working with compression
and expansion.  It amounts to:

m.compress() = m.vectorSpecies().indexInRange(0, m.trueCount())

(There is no m.expand(), since m.expand() is just m.)

Surely the JIT does not need the extra overloadings
of compress and expand in order to select the
extra argument for vcompress and vexpand.
If the blend is present in the idiom, then the
special instruction can be selected.  (We’d
want an intrinsic for VectorMask::compress
in that case.)  Otherwise, the zero-filling
version of the instruction can be selected.

All this is to say, after looking at the proposal,
I’m afraid the extra v argument does not pull
its weight.

To be more constructive, let me suggest what
might pull more weight in this area.  I’m starting
from the assumption that typical uses of
expand and compress will not be isolated,
but will help implement conditional features
of loops.  So compress is a loop over lanes
conditionally storing into contiguous output,
and expand is a loop over lanes conditionally
loading from contiguous input.

To me this immediately suggests that the best
“optional argument” for these guys would be
some hook to resume a compression or expansion
activity started in a previous vectorized operation.

Here are compressing and expanding loops, where
the result of each iteration is stored tio or loaded
from memory without other loop-carried values:

// compress from a[*] to z[*]
int ai = 0, zi = 0;
while (ai < a.length) {
   vector as = fromArray(SPECIES, a, ai);
   mask m = interestingBits(as, …);
   vector zs = as.compress(m);
   zs.intoArray(z, zi, m.compress());
   ai += SPECIES.length();
   zi += m.trueCount();
}

// expand from z[*] updating a[*]
int ai = 0, zi = 0;
while (ai < a.length) {
   vector as = fromArray(SPECIES, a, ai);
   mask m = interestingBits(as, …);
   vector zs = fromArray(SPECIES, z, zi, m.compress());
   as = as.blend(zs.expand(), m);
   as.intoArray(a, ai);
   ai += SPECIES.length();
   bi += m.trueCount();
}

Both loops make integral use of m.trueCount()
and m.compress().

The above exercise suggests that, if we have enough
“hooks” to do unaligned memory loads and stores,
with mask-dependent strides, we can do without
any extra overloadings of compress and expand.

And those “hooks” appear to be:

- masked load and store, where the masks are contiguous
- a way to compute a contiguous mask by “compressing” an arbitrary one
- a way to stride past the result of storing such a contiguous partial (+=trueCount)

That’s enough for now to think about, but I am also thinking
there might be something useful in either of two other
directions: loop-carried buffers, and non-lossy compression
and expansion operations.

In brief, a loop-carried buffer is a vector register which
holds a partial vector worth of “results so far”.  Each
iteration packs more “stuff” into the buffer, increasing
the “results so far”.  When there’s a full vector’s worth
of results, those get pushed to memory (no masking)
and the “results so far” are edited down.  This means
there’s a loop-carried buffer vector, plus an index
and/or mask which shows what’s there.  Also, the
compress and expand operations have to be able
to take and extra input *and* output to take
proper account of the “results so far”.  The Vector
API has pattern for multiple outputs:  It’s the
“part number” pattern.

All that is a much more complicated code shape than
the memory-based loops above, but it should unroll
much better, I think, to operations which stay inside
the VPU for longer.

I think a good way to flesh out more details of this
kind of loop-carried compress or expand loop would
be to think of the loop state as consisting of a pair
of vectors and pair of masks.  When you compress
or expand state in the current iteration, you also
include, as a symmetrical input, the “results so
far”.  For compressing, the “results so far” would
be a vector and mask, already compressed, and
you would jointly compress *that*, plus the next
vector (in the current iteration), with *its* mask,
to yield a double-sized compressed vector pair.
If half or more of it is filled, you store that and
discard, adjusting masks.  In any case, you have
your new “results so far”.  I think there is a
corresponding structure for the other loop,
obtained by judicious arrow reversal.

If you think I’m on the right track there, then
you’ll have realized that re-compressing the
“results so far” is a waste of energy.  So the
real buffered compression operation we want
is something like this:

// compress from a[*] to z[*]
vector buf;
int buflen;
int ai = 0, zi = 0;
while (ai < a.length) {
   vector as = fromArray(SPECIES, a, ai);
   mask m = interestingBits(as, …);
   vector zs;
   (zs, buf, buflen) = as.compressOnTopOfBuf(m, buf, buflen);
   if (buflen >= SPECIES.length()) {
      buf.intoArray(z, zi);
      zi += SPECIES.length();
      buf = zs;
      buflen -= SPECIES.length();
   }
   ai += SPECIES.length();
}

Here the double-barreled compression operation is:

define as.compressOnTopOfBuf(m, buf, buflen) {
   vector zs = as.compress(m);
   (zs, buf, buflen) = (
      zs.unslice(buflen, buf, 1),
      zs.unslice(buflen, buf, 0),
      buflen + m.trueCount()
   );
   return (zs, buf, buflen);
}

I think there’s a similar double-barreled expanding
operation.

Another direction of design is non-lossy compression.
By that I mean that we re-imagine compression and
expansion as proper inverses of each other.  The
mask does not edit out information in a vector
being compressed or expanded, but rather
rearranges the information, so that there never
are zero lanes that get injected:  Every bit of
the input is present *somewhere* in the output.

That, of course, is just the SAG (sheep and goats)
operation of Hacker’s Delight, applied lanewise.
It is also the lanewise version of the x86 instructions
PDEP and PEXT, which operate (like HD SAG)
on lanes of one bit in broadwords.

Since vector hardware does not have lanewise SAG
or its inverse, I think a balanced design that supports
non-lossy compress and expand could be composed
of these methods:

z = a.compress(m);  //normal right-moving compress
y = b.compressFlip(m);  //left-moving compress, on ~m
a = z.expand(m);  //normal left-moving expand
b = y.expandFlip(m);  //right-moving expand, on ~m

These can be implemented efficiently with today’s
hardware.  The “Flip” versions need a cross-lane
shift on the bitcount of the mask.

compressFlip is inverse to expandFlip in exactly
the same sense that compress is inverse to expand.

The non-lossy operations built on top would look
like this:

v.compressAll(m) := v.compress(m) |^| v.compressFlip(m)
z.expandAll(m) := z.expand(m) |^| z.expandFlip(m)

It is also useful to derive a shuffle from either expandAll
or compressAll.  This can be done by forming an iota
(index) vector and compressing or expanding the index
values, then “cooking” the resulting permutation vector
into a shuffle.

Here are the very simple identities that these non-lossy
operations have:

z == z.expandAll(m).compressAll(m)
a == a.compressAll(m).expandAll(m)

(BTW the symbol |^| is a bitwise exclusive OR where it is
asserted that at most one bit input is true; it is therefore
also a bitwise inclusive OR.  It’s how hardware likes to
combine chunks of bits.)

All in all it’s a nice bit of theory.  The connection
between SAG/PDEP/PEXT masks and shuffles is
deep.  You can represent any shuffle at all, using
lg(VLEN) masks in this way.