RFR: 8338023: Support two vector selectFrom API [v3]

Wed Aug 21 18:51:25 UTC 2024

On 21 Aug 2024, at 11:30, Paul Sandoz wrote:

> Is it possible for the intrinsic to be responsible for wrapping, if needed? If was looking at [`vpermi2b`](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpermi2b&ig_expand=4917,4982,5004,5010,5014&techs=AVX_512) and AFAICT it implicitly wraps, operating on the lower N bits. Is that correct?

That’s not a bad idea.  But it is also possible (and routine) for the
JIT to take an expression like (i >> (j&31)) down to (i >> j) if the
hardware takes care of the (j&31) inside its >> operation.  I think
that some hardware permutation operations do something similar to >>
in that they simply ignore irrelevant bits in the steering indexes.
(Other operations do exotic things with irrelevant bits, such as
interpreting the sign bit as a command to “force this one to zero”.)

If the wrapping operation for steering indexes is just a vpand against
a simple constant, then maybe (maybe!) the JIT can easily drop that
vpand, when the input is passed to a friendly auto-masking instruction,
just like with (i >> (j&31)).

On the other hand, Paul’s idea might be more robust.  It would require
that the permutation intrinsics would apply vpand at the right places,
and omit vpand when possible.

On the other other hand (the first hand) the classic way of doing it
doesn’t introduce vpand inside of intrinsics, which has a routine
advantage:  The vpands introduced outside of the intrinsic can be
user-introduced or framework-introduced or both.  In all cases, the
JIT treats them uniformly and can collapse them together.  Putting
magic fixup instructions inside of intrinsic expansion risks making
them invisible to the routine optimizations of the JIT.  So,
assuming the vpand gets good optimization, putting it outside of
the intrinsic is the most robust option, as long as “good optimization”
includes the >>(j&31) trick for auto-masking instructions.  So the
intrinsic should look for a vpand in its steering input, and pop off
the IR node if the hardware masking is found to produce the same result.