bitwise vs. bytewise permutations
John Rose
john.r.rose at oracle.com
Mon Jun 4 00:08:22 UTC 2018
One reason I want to get the Shuffle API right is that it interacts even with
normal SIMD lanewise operations. Rearranging the bytes within each lane
is a sometimes best implemented as a whole-vector permutation.
Per-lane rotate is an example I just stumbled across here:
https://gist.github.com/orlp/32f5d1b631ab092608b1#file-chacha-h-L128
Even if rotate is a built-in Vector API method, there are other intra-lane
byte and bit permutations which won't be, and it would be best if there
were a high-level way to specify a Shuffle that performs any specifiable
permutation, using a Shuffle factory method or the like.
Byte-swapping words being transferred through memory is another example
of using shuffles without doing cross-lane transfers.
Cross lane transfers are important, of course, and sometimes they preserve
most or all of lane structure, so they make sense on strongly typed vectors,
not just "bag of bytes" register contents.
So, permuting lanes (not just bytes) is a good example. There are some
special purpose instructions for this in AVX, but I don't want to enshrine
them as ad hoc API points in the Vector API, but rather easy to use
Shuffle operations on typed vectors. They should code-generate down
to general permutation instructions if necessary or else more compact
or efficient ones if possible (on AVX or any other platform).
A particular kind of cross-lane transfer that's important is converting between
vectors of different sizes (expand/compress), either by a constant stride or
using a steering mask (akin to the "Sheep and Goats" algorithm from HD7).
I'd especially like to be able to create factories of "canned" Shuffle objects
which do particular tasks, like swap or reverse or reverse bytes in a lane,
or expand or contract between vector sizes.
So I'd like to be able, eventually, to code data movements habitually using
shuffles, mixed as needed with resizing operations, and have the code generator
pick good instruction sequences, in cases where constant shuffles correspond
to more efficient special-purpose instructions such as the unpack family.
This is especially true for constant Shuffle values, but also (as a long term
stretch goal) for data-dependent Shuffles.
And some of what we learn about the design of byte and lane permutations
will also be applicable to the more exotic case of bit permutations. Eventually
I'd like to be able to extend our framework to encompass bit-vectors and
bit-shuffles, be able to express "reverse lanes" as a common Shuffle constant,
and apply that Shuffle to a bit vector to get true bit reversal. (After all, bit
reversal is available in today's x86 ISAs, and is a method on java.lang.Integer.)
More than that, I'd like to be able to say to a vector of 32-bit values, "treat each
lane as a 32-bit vector and shuffle each lane using a reversal Shuffle for that
lane-vector".
Finally, another class of byte permutations which is sometimes demanded
by algorithms is vector transpose, where a vector of lanes is viewed as a
matrix of lane-row-vectors, and transposed. That of course turns the lane
structure on its head, so it is a more special-purpose operation.
HD chapter 7 gives a fairly comprehensive overview of bit movement
patterns which show up in practice and have reasonable hardware
implementations. Many of them are expressed as bit-level functions
operating on the *indexes* of the lanes or bytes or bits of a composite
vector-like value (say, a word holding a series of bits). If the index is
negated (as i^-1) you reverse the sequence. If the index is incremented
you rotate it. If you *rotate* the *index*, voila you have a matrix transpose.
(Didn't see that coming!) If you work on a subset of the index bits, you
have something like a per-lane operation or a cross-lane operation which
doesn't disturb lane values. Someday I'd like to see bit-and-byte shuffle
factories which can pull any or all of these tricks.
— John
More information about the panama-dev
mailing list