RFR: 8338023: Support two vector selectFrom API

Fri Aug 16 23:38:45 UTC 2024

(Better late than never, although I wish I’d been more explicit
about this on panama-dev.)

I think we should be moving away from throwing exceptions on all
reorder/shuffle/permute vector ops, and moving toward wrapping.
These ops all operate on vectors (small arrays) of vector lane
indexes (small array indexes in a fixed domain, always a power
of two).  The throwing behavior checks an input for bad indexes
and throws a (scalar) exception if there are any at all.  The
wrapping behavior reduces bad indexes to good ones by an unsigned
modulo operation (which is at worst a mask for powers of two).

If I’m right, then new API points should start out with wrap
semantics, not throw semantics.  And old API points should be
migrated ASAP.

There’s no loss of functionality in such a move.  Instead the
defaults are moved around.  Before, throwing was the default
and wrapping was an explicit operation.  After, wrapping would
be the default and throwing would be explicit.  Both wrapping
and throwing checks are available through explicit calls to
VectorShuffle methods checkIndexes and wrapIndexes.

OK, so why is wrapping better than throwing?  And first, why
did we start with throwing as the default?  Well, we chose
throwing as the default to make the vector operations
more Java-like.  Java scalar operations don’t try to reduce
bad array indexes into the array domain; they throw.  Since
a shuffle op is like an array reference, it makes sense to
emulate the checks built into Java array references.

Or it did make sense.  I think there is a technical debt
here which is turning out to be hard to pay off.  The tech
debt is to suppress or hoist or strength-reduce the vector
instructions that perform the check for invalid indexes
(in parallel), then ask “did any of those checks fail?”
(a mask reduction), then do a conditional branch to
failure code.  I think I was over-confident that our
scalar tactics for reducing array range checks would
apply to vectors as well.  On second thought, vectorizing
our key optimization, of loop range splitting (pre/main/post
loops) is kind of a nightmare.

Instead, consider the alternative of wrapping.  First,
you use vpand or the like to mask the indexes down to
the valid range.  Then you run the shuffle/permute
instruction.  That’s it.  There is no scalar query
or branch.  And, there are probably some circumstances
where you can omit the vpand operation:  Perhaps the
hardware already masks the inputs (as with shift
instructions).  Or, perhaps C2 can do bitwise inference
of the vectors and figure out that the vpand is a nop.
(I am agitating for bitwise types in C2; this is a use
case for them.)  In the worst case, the vpand op is
fast and pipelines well.

This is why I think we should switch, ASAP, to masking
instead of throwing, on bad indexes.

I think some of our reports from customers have shown
that the extra checks necessary for throwing on bad
indexes are giving their code surprising slowdowns,
relative to C-based vector code.

Did I miss a point?

— John

On 14 Aug 2024, at 3:43, Jatin Bhateja wrote:

> On Mon, 12 Aug 2024 22:03:44 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:
>
>> The results look promising. I can provide guidance on the specification e.g., we can specify the behavior in terms of rearrange, with the addition of throwing on out of bounds indexes.
>>
>> Regarding the throwing of exceptions, some wider context will help to know where we are heading before we finalize the specification. I believe we are considering changing the default throwing behavior for index out of bounds to wrapping, thereby we can avoid bounds checks. If that is the case we should wait until that is done then update rather than submitting a CSR just yet?
>>
>> I see you created a specific intrinsic, which will avoid the cost of shuffle creation. Should we apply the same approach (in a subsequent PR) to the single argument shuffle? Or perhaps if we manage to optimize shuffles and change the default wrapping we don't require a specific intrinsic and can just use defer to rearrange?
>
> Hi @PaulSandoz ,
> Thanks for your comments. With this new API we intend to enforce stricter specification w.r.t to index values to emit a lean instruction sequence preventing any cycles spent on massaging inputs to a consumable form, thus preventing redundant wrapping and unwrapping operations.
>
> Existing [two vector rearrange API](https://docs.oracle.com/en/java/javase/22/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vector.html#rearrange(jdk.incubator.vector.VectorShuffle,jdk.incubator.vector.Vector)) has a flexible specification which allows wrapping out of bounds shuffle indexes into exceptional index with a -ve value.
>
> Even if we optimize existing two vector rearrange implementation we will still need to emit additional instructions to generate an indexes which lie within two vector range [0, 2*VLEN). I see this as a specialized API like vector compress/expand which cater to targets like x86-AVX512+ and aarch64-SVE which offers direct instruction for two vector lookups.
>
> May be the API nomenclature can be refined to better reflect its semantics i.e. from selectFrom to twoVectorLookup ?
>
> -------------
>
> PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2288062038