[vectorIntrinsics] Issue to VectorAPI "selectFrom" for byte type with Arm SVE 2048-bits

Wed Nov 4 00:11:13 UTC 2020

On Oct 26, 2020, at 4:47 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> Given that this.selectFrom(Vector that) is equivalent to that.rearrange(this.toShuffle()) we might be able to remove selectFrom(Vector ), but we likely need to teach C2 to recognize the pattern of the latter and elide the shuffle conversion if there is a more direct instruction.

There are two forms for shuffling selection for expressiveness.
In some algorithms, the shuffle is fixed, and in others the data
being shuffled is fixed; in any case the code is often easier to read
when the data-driven part is written in fluent notation, on the left.
That’s why there are two ways to say the same thing, because
the programmer might prefer to emphasize either part of
the operation (shuffle or data) as the principal part.

That said, this problem of narrow lanes not being able index
(exponentially) long vectors is independent of which side of
the dot the shuffle vector shows up on.  This more fundamental
problem cannot be overcome by a syntax shift.  You sometimes
have to expand the lane width of the shuffle (permutation vector)
beyond the data lane width.  For now, the problem shows up
with byte data lanes indexed by byte shuffle lanes; making the
shuffle lanes unsigned delays the inevitable by a factor of 2.

The more permanent fix is to widen the shuffle lanes to the next
size up that is big enough to index the data lanes, which for the
foreseeable future is 16 bits (from 8), and that can be signed until
we get vectors of size 2^16.  So, signed indexes are just fine, and
they are sometimes useful (as explained in the docs) to express
exceptional conditions and/or to select from a second data input.

The most robust way, IMO, to widen shuffle lanes double size
when necessary is to allow *synthetic* double-size vector
shapes.  For example, allow a 1024-bit shape on AVX-512,
or a 512-bit shape on AVX-2, either one consisting of a pair
of native registers of the large size.  Even in AVX-512, the
512-bit shape which consists of two 256-bit vectors should
*not* be conflated with the native 512-bit shape, but rather
be its own option.

These multi-vector synthetic shapes amount to user selectable
unrolling of loops, in addition to helping us out of tight spots
when a lane width needs to be doubled.

The C bindings for SVE have synthetic vector types for 2x, 3x,
and 4x (IIRC).  I think this would be a fine thing to do in our
Vector API as well.  I suggest names of the form S_{N}_BIT_X{M},
where {M} ranges in 2..4 or 2..5.  (Yes, odd sizes are probably
helpful here.  It’s another use case for synthetics, when you
have 3-tuples interleaved in memory but want to vectorize.)

— John