[vectorIntrinsics+compress] RFR: 8274664: Add support for compress/expand api methods
John Rose
john.r.rose at oracle.com
Tue Oct 5 01:41:31 UTC 2021
On Oct 4, 2021, at 6:17 PM, John Rose <john.r.rose at oracle.com<mailto:john.r.rose at oracle.com>> wrote:
All in all it’s a nice bit of theory. The connection
between SAG/PDEP/PEXT masks and shuffles is
deep. You can represent any shuffle at all, using
lg(VLEN) masks in this way.
P.S. I think the compressAll operation makes an
intriguing primitive for vectorized radix sort,
on a small scale. And the expandAll operation
might play a role in a merge phase of a sort algorithm.
But I think it would be better for hardware designers
to take yet another look at their on-chip cross-lane
routing networks, and consider whether it would be
useful to steer those networks using data in the lanes
themselves. This operation (and its inverse) is a
generalization of compress (and expand) which is
probably not far off from what silicon can do, and
is very general and powerful:
w = k.sort(v)
where the lanes of k are rearranged stably in ascending
order (interpreted, say, as unsigned ints) and the lanes
of v are rearranged in unison with k. k is discarded and
the rearranged lanes of v are stored to w.
The inverse is:
x = k.unsort(w)
Here, w is inversely permuted so that given the resulting x,
it would be the case that k.sort(x) is just w again.
If you just want to sort one vector, and it already has the
“native” unsigned key format:
w = v.sort(v)
If you want to sort signed or floating point numbers v,
generate an unsigned key vector k appropriately by
manipulating the bit representations of v’s lanes:
w = deriveUnsignedSortKeys(v).sort(v)
If you want to record a permutation vector for later use
as a shuffle:
j = k.sort(k.iota())
The inverse of that permutation vector is:
j = k.unsort(k.iota())
In general, if j is a permutation vector, then we have:
j == j.unsort(j.iota())
j.sort(j) == j.iota()
Also, normal shuffle/permute operations are revealed
as a special case of sorting:
v.rearrange(j.toShuffle()) == j.sort(v)
And that is why I think the hardware is not far from the
more general sort operation; the hardware needs more
wires but essentially the same gates.
The compress and expand operations are revealed as
sort and unsort operations on one-bit keys:
v.compress(m) == m.toVector().sort(v)
v.expand(m) == m.toVector().unsort(v)
More information about the panama-dev
mailing list