[External] : Re: UTF-8 validation using Vector API

Wed Feb 23 17:07:11 UTC 2022

It’s easy to confuse compress-like operations with expand-like operations,
or (as I said in the doc) “pushing” vs. “pulling” reordering.  I think both hardware designers and programmers assume that, given a fully general “pulling” primitive (shuffle/permute using index vector) it must be just a quick step to the other kind of operation.  But in fact, although the hardware design is doable either way, the software design is hard in one direction, if the primitives are missing.  Which they are.

Specifically, I like to characterize the “missing primitive” (the one that “pushes” values based on associated indexes or keys or mask bits) as a kind  “sort”, and note that the standard shuffle/permute operation is just the inverse, which you could call “unsort” to make the difference clear.  On hardware which supports a native “compress” operation driven by an associated mask, you basically get a one-bit sort key (and have to do it twice); from there you can build up more complex sort algorithms by picking masks based on comparisons to pivots.  But I hope some day hardware designers will take a second look at “compress” and consider steering it using a parallel vector/array of key values, rather than a series of mask bits.

Your suggestion of treating the mask as an index into a table of permutation vectors is a good one, though it doesn’t scale well.  There are ways to partition the mask and do block-wise compresses using table lookup, and then fix the problem by merging the separately sorted blocks.  This is the hallmark of (all?) sort algorithms:  You can subdivide the problem, but then there’s always a residual merge to recombine the subdivided parts.

On 22 Feb 2022, at 8:44, Quân Anh Mai wrote:

>> Your wonderful document has a lot of ideas, I just want to add that you
> don't need to involve memory to replicate a compress operation as you can
> compute the cumulative sum of the not of the mask then subtract that from a
> iota index vector to achieve a shuffle index vector.
>
> Silly me messed up this part, please ignore it. Another way (should be
> correct this time) would be to do a lookup on the value of the mask to find
> the correct shuffle
>