[vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods
Paul Sandoz
paul.sandoz at oracle.com
Tue Oct 5 20:01:54 UTC 2021
Hi John,
I agree with your assessment on the vector accepting methods.
After I wrote my brief comment on the PR to get things rolling I was thinking about blends and mask compression along similar lines.
IMO we should add benchmarks for the important use-cases you mention around the loops. That should help focus us on the API points and intrinsics. I think it’s a good way to move forward and increase our understanding towards an API and implementation.
Paul.
> On Oct 4, 2021, at 6:17 PM, John Rose <john.r.rose at oracle.com> wrote:
>
> On Oct 4, 2021, at 3:59 PM, Sandhya Viswanathan <sviswanathan at openjdk.java.net<mailto:sviswanathan at openjdk.java.net>> wrote:
>
> With current definition, from the intrinsic perspective the compress(m, v) can be directly translated to the vcompressps/vcompresspd instruction.
> Also the following holds true:
> v == v.compress(m).expand(m, v)
> v == v.expand(m).compress(m, v)
>
> Those are cute identities but they are not very useful
> in practice. If you want to recover v from its compressed
> image (v.compress(m)), it’s not very helpful to say
> that you can recover it fully, but first you need a
> copy of v itself for the second argument to expand.
>
> So those identities are kind of expository, but they
> are not evidence of useful processing of the second
> argument. I think the corresponding identities,
> without the second v argument, would be more
> illuminating, less magical, more useful:
>
> v == v.blend( v.compress(m).expand(m), m )
> v == v.blend( v.expand(m).compress(m), m.compress() )
>
> I think “m.compress()” would be a worthy addition
> to the API, to support working with compression
> and expansion. It amounts to:
>
> m.compress() = m.vectorSpecies().indexInRange(0, m.trueCount())
>
> (There is no m.expand(), since m.expand() is just m.)
>
> Surely the JIT does not need the extra overloadings
> of compress and expand in order to select the
> extra argument for vcompress and vexpand.
> If the blend is present in the idiom, then the
> special instruction can be selected. (We’d
> want an intrinsic for VectorMask::compress
> in that case.) Otherwise, the zero-filling
> version of the instruction can be selected.
>
> All this is to say, after looking at the proposal,
> I’m afraid the extra v argument does not pull
> its weight.
>
> To be more constructive, let me suggest what
> might pull more weight in this area. I’m starting
> from the assumption that typical uses of
> expand and compress will not be isolated,
> but will help implement conditional features
> of loops. So compress is a loop over lanes
> conditionally storing into contiguous output,
> and expand is a loop over lanes conditionally
> loading from contiguous input.
>
> To me this immediately suggests that the best
> “optional argument” for these guys would be
> some hook to resume a compression or expansion
> activity started in a previous vectorized operation.
>
> Here are compressing and expanding loops, where
> the result of each iteration is stored tio or loaded
> from memory without other loop-carried values:
>
> // compress from a[*] to z[*]
> int ai = 0, zi = 0;
> while (ai < a.length) {
> vector as = fromArray(SPECIES, a, ai);
> mask m = interestingBits(as, …);
> vector zs = as.compress(m);
> zs.intoArray(z, zi, m.compress());
> ai += SPECIES.length();
> zi += m.trueCount();
> }
>
> // expand from z[*] updating a[*]
> int ai = 0, zi = 0;
> while (ai < a.length) {
> vector as = fromArray(SPECIES, a, ai);
> mask m = interestingBits(as, …);
> vector zs = fromArray(SPECIES, z, zi, m.compress());
> as = as.blend(zs.expand(), m);
> as.intoArray(a, ai);
> ai += SPECIES.length();
> bi += m.trueCount();
> }
>
> Both loops make integral use of m.trueCount()
> and m.compress().
>
> The above exercise suggests that, if we have enough
> “hooks” to do unaligned memory loads and stores,
> with mask-dependent strides, we can do without
> any extra overloadings of compress and expand.
>
> And those “hooks” appear to be:
>
> - masked load and store, where the masks are contiguous
> - a way to compute a contiguous mask by “compressing” an arbitrary one
> - a way to stride past the result of storing such a contiguous partial (+=trueCount)
>
> That’s enough for now to think about, but I am also thinking
> there might be something useful in either of two other
> directions: loop-carried buffers, and non-lossy compression
> and expansion operations.
>
> In brief, a loop-carried buffer is a vector register which
> holds a partial vector worth of “results so far”. Each
> iteration packs more “stuff” into the buffer, increasing
> the “results so far”. When there’s a full vector’s worth
> of results, those get pushed to memory (no masking)
> and the “results so far” are edited down. This means
> there’s a loop-carried buffer vector, plus an index
> and/or mask which shows what’s there. Also, the
> compress and expand operations have to be able
> to take and extra input *and* output to take
> proper account of the “results so far”. The Vector
> API has pattern for multiple outputs: It’s the
> “part number” pattern.
>
> All that is a much more complicated code shape than
> the memory-based loops above, but it should unroll
> much better, I think, to operations which stay inside
> the VPU for longer.
>
> I think a good way to flesh out more details of this
> kind of loop-carried compress or expand loop would
> be to think of the loop state as consisting of a pair
> of vectors and pair of masks. When you compress
> or expand state in the current iteration, you also
> include, as a symmetrical input, the “results so
> far”. For compressing, the “results so far” would
> be a vector and mask, already compressed, and
> you would jointly compress *that*, plus the next
> vector (in the current iteration), with *its* mask,
> to yield a double-sized compressed vector pair.
> If half or more of it is filled, you store that and
> discard, adjusting masks. In any case, you have
> your new “results so far”. I think there is a
> corresponding structure for the other loop,
> obtained by judicious arrow reversal.
>
> If you think I’m on the right track there, then
> you’ll have realized that re-compressing the
> “results so far” is a waste of energy. So the
> real buffered compression operation we want
> is something like this:
>
> // compress from a[*] to z[*]
> vector buf;
> int buflen;
> int ai = 0, zi = 0;
> while (ai < a.length) {
> vector as = fromArray(SPECIES, a, ai);
> mask m = interestingBits(as, …);
> vector zs;
> (zs, buf, buflen) = as.compressOnTopOfBuf(m, buf, buflen);
> if (buflen >= SPECIES.length()) {
> buf.intoArray(z, zi);
> zi += SPECIES.length();
> buf = zs;
> buflen -= SPECIES.length();
> }
> ai += SPECIES.length();
> }
>
> Here the double-barreled compression operation is:
>
> define as.compressOnTopOfBuf(m, buf, buflen) {
> vector zs = as.compress(m);
> (zs, buf, buflen) = (
> zs.unslice(buflen, buf, 1),
> zs.unslice(buflen, buf, 0),
> buflen + m.trueCount()
> );
> return (zs, buf, buflen);
> }
>
> I think there’s a similar double-barreled expanding
> operation.
>
> Another direction of design is non-lossy compression.
> By that I mean that we re-imagine compression and
> expansion as proper inverses of each other. The
> mask does not edit out information in a vector
> being compressed or expanded, but rather
> rearranges the information, so that there never
> are zero lanes that get injected: Every bit of
> the input is present *somewhere* in the output.
>
> That, of course, is just the SAG (sheep and goats)
> operation of Hacker’s Delight, applied lanewise.
> It is also the lanewise version of the x86 instructions
> PDEP and PEXT, which operate (like HD SAG)
> on lanes of one bit in broadwords.
>
> Since vector hardware does not have lanewise SAG
> or its inverse, I think a balanced design that supports
> non-lossy compress and expand could be composed
> of these methods:
>
> z = a.compress(m); //normal right-moving compress
> y = b.compressFlip(m); //left-moving compress, on ~m
> a = z.expand(m); //normal left-moving expand
> b = y.expandFlip(m); //right-moving expand, on ~m
>
> These can be implemented efficiently with today’s
> hardware. The “Flip” versions need a cross-lane
> shift on the bitcount of the mask.
>
> compressFlip is inverse to expandFlip in exactly
> the same sense that compress is inverse to expand.
>
> The non-lossy operations built on top would look
> like this:
>
> v.compressAll(m) := v.compress(m) |^| v.compressFlip(m)
> z.expandAll(m) := z.expand(m) |^| z.expandFlip(m)
>
> It is also useful to derive a shuffle from either expandAll
> or compressAll. This can be done by forming an iota
> (index) vector and compressing or expanding the index
> values, then “cooking” the resulting permutation vector
> into a shuffle.
>
> Here are the very simple identities that these non-lossy
> operations have:
>
> z == z.expandAll(m).compressAll(m)
> a == a.compressAll(m).expandAll(m)
>
> (BTW the symbol |^| is a bitwise exclusive OR where it is
> asserted that at most one bit input is true; it is therefore
> also a bitwise inclusive OR. It’s how hardware likes to
> combine chunks of bits.)
>
> All in all it’s a nice bit of theory. The connection
> between SAG/PDEP/PEXT masks and shuffles is
> deep. You can represent any shuffle at all, using
> lg(VLEN) masks in this way.
>
More information about the panama-dev
mailing list