[vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods

Tue Oct 5 01:41:31 UTC 2021

On Oct 4, 2021, at 6:17 PM, John Rose <john.r.rose at oracle.com<mailto:john.r.rose at oracle.com>> wrote:

All in all it’s a nice bit of theory.  The connection
between SAG/PDEP/PEXT masks and shuffles is
deep.  You can represent any shuffle at all, using
lg(VLEN) masks in this way.

P.S. I think the compressAll operation makes an
intriguing primitive for vectorized radix sort,
on a small scale.  And the expandAll operation
might play a role in a merge phase of a sort algorithm.

But I think it would be better for hardware designers
to take yet another look at their on-chip cross-lane
routing networks, and consider whether it would be
useful to steer those networks using data in the lanes
themselves.  This operation (and its inverse) is a
generalization of compress (and expand) which is
probably not far off from what silicon can do, and
is very general and powerful:

w = k.sort(v)

where the lanes of k are rearranged stably in ascending
order (interpreted, say, as unsigned ints) and the lanes
of v are rearranged in unison with k.  k is discarded and
the rearranged lanes of v are stored to w.

The inverse is:

x = k.unsort(w)

Here, w is inversely permuted so that given the resulting x,
it would be the case that k.sort(x) is just w again.

If you just want to sort one vector, and it already has the
“native” unsigned key format:

w = v.sort(v)

If you want to sort signed or floating point numbers v,
generate an unsigned key vector k appropriately by
manipulating the bit representations of v’s lanes:

w = deriveUnsignedSortKeys(v).sort(v)

If you want to record a permutation vector for later use
as a shuffle:

j = k.sort(k.iota())

The inverse of that permutation vector is:

j = k.unsort(k.iota())

In general, if j is a permutation vector, then we have:

j == j.unsort(j.iota())
j.sort(j) == j.iota()

Also, normal shuffle/permute operations are revealed
as a special case of sorting:

v.rearrange(j.toShuffle()) == j.sort(v)

And that is why I think the hardware is not far from the
more general sort operation; the hardware needs more
wires but essentially the same gates.

The compress and expand operations are revealed as
sort and unsort operations on one-bit keys:

v.compress(m) == m.toVector().sort(v)
v.expand(m) == m.toVector().unsort(v)