RFR: 8338023: Support two vector selectFrom API

Jatin Bhateja jbhateja at openjdk.org
Mon Aug 19 06:47:50 UTC 2024


On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> Hi All,
> 
> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs.
> 
> 
> Declaration:-
>     Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
> 
> 
> Semantics:-
>     Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types.  The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown.  
> 
> Summary of changes:
> -  Java side implementation of new selectFrom API.
> -  C2 compiler IR and inline expander changes.
> -  In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. 
> -  Optimized x86 backend implementation for AVX512 and legacy target.
> -  Function tests covering new API.
> 
> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :-
> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
> 
> 
>   Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
> SelectFromBenchmark.rearrangeFromByteVector     1024  thrpt    2   2041.762          ops/ms
> SelectFromBenchmark.rearrangeFromByteVector     2048  thrpt    2   1028.550          ops/ms
> SelectFromBenchmark.rearrangeFromIntVector      1024  thrpt    2    962.605          ops/ms
> SelectFromBenchmark.rearrangeFromIntVector      2048  thrpt    2    479.004          ops/ms
> SelectFromBenchmark.rearrangeFromLongVector     1024  thrpt    2    359.758          ops/ms
> SelectFromBenchmark.rearrangeFromLongVector     2048  thrpt    2    178.192          ops/ms
> SelectFromBenchmark.rearrangeFromShortVector    1024  thrpt    2   1463.459          ops/ms
> SelectFromBenchmark.rearrangeFromShortVector    2048  thrpt    2    727.556          ops/ms
> SelectFromBenchmark.selectFromByteVector        1024  thrpt    2  33254.830          ops/ms
> SelectFromBenchmark.selectFromByteVector        2048  thrpt    2  17313.174          ops/ms
> SelectFromBenchmark.selectFromIntVector         1024  thrpt    2  10756.804          ops/ms
> SelectFromBenchmark.selectFromIntVector         2048  thrpt    2   5398.2...

> _Mailing list message from [John Rose](mailto:john.r.rose at oracle.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at mail.openjdk.org):_
> 
> (Better late than never, although I wish I?d been more explicit about this on panama-dev.)
> 
> I think we should be moving away from throwing exceptions on all reorder/shuffle/permute vector ops, and moving toward wrapping. These ops all operate on vectors (small arrays) of vector lane indexes (small array indexes in a fixed domain, always a power of two). The throwing behavior checks an input for bad indexes and throws a (scalar) exception if there are any at all. The wrapping behavior reduces bad indexes to good ones by an unsigned modulo operation (which is at worst a mask for powers of two).
> 
> If I?m right, then new API points should start out with wrap semantics, not throw semantics. And old API points should be migrated ASAP.
> 
> There?s no loss of functionality in such a move. Instead the defaults are moved around. Before, throwing was the default and wrapping was an explicit operation. After, wrapping would be the default and throwing would be explicit. Both wrapping and throwing checks are available through explicit calls to VectorShuffle methods checkIndexes and wrapIndexes.
> 
> OK, so why is wrapping better than throwing? And first, why did we start with throwing as the default? Well, we chose throwing as the default to make the vector operations more Java-like. Java scalar operations don?t try to reduce bad array indexes into the array domain; they throw. Since a shuffle op is like an array reference, it makes sense to emulate the checks built into Java array references.
> 
> Or it did make sense. I think there is a technical debt here which is turning out to be hard to pay off. The tech debt is to suppress or hoist or strength-reduce the vector instructions that perform the check for invalid indexes (in parallel), then ask ?did any of those checks fail?? (a mask reduction), then do a conditional branch to failure code. I think I was over-confident that our scalar tactics for reducing array range checks would apply to vectors as well. On second thought, vectorizing our key optimization, of loop range splitting (pre/main/post loops) is kind of a nightmare.
> 
> Instead, consider the alternative of wrapping. First, you use vpand or the like to mask the indexes down to the valid range. Then you run the shuffle/permute instruction. That?s it. There is no scalar query or branch. And, there are probably some circumstances where you can omit the vpand operation: Perhaps the hardware already masks the inputs (as with shift instructions). Or, perhaps C2 can do bitwise inference of the vectors and figure out that the vpand is a nop. (I am agitating for bitwise types in C2; this is a use case for them.) In the worst case, the vpand op is fast and pipelines well.
> 
> This is why I think we should switch, ASAP, to masking instead of throwing, on bad indexes.
> 
> I think some of our reports from customers have shown that the extra checks necessary for throwing on bad indexes are giving their code surprising slowdowns, relative to C-based vector code.
> 
> Did I miss a point?
> 
> ? John
> 
> On 14 Aug 2024, at 3:43, Jatin Bhateja wrote:


Hi @rose00,

I agree that wrapping should be the default behaviour if indices are passed through shuffles, idea was to pick exception throwing semantics for out of bounds indexes *only* for selectFrom flavour of APIs which accept indexes through vector interface, this will save redundant partial wrapping and un-wrapping for cross vector permutation API which has a direct mappings in x86 and AARCH64 ISA.

As @PaulSandoz [suggested](https://github.com/openjdk/jdk/pull/20508#pullrequestreview-2234095541) we can also tune existing single 'selectFrom' API to adopt default exception throwing semantics if any of the indices lies beyond valid index range.

While we will continue keeping default wrapping semantics for APIs accepting shuffles, this little deviation of semantics for selectFrom family of APIs will enable generating efficient code and will enable users to chooses between the rearrange and selectFrom APIs based on convenience vs efficient code trade-off.

Since, API interfaces were crafted keeping in view long term flexibility, having multiple permutation interfaces (selectFrom / rearrange) accepting indexes though vector or shuffle enables compiler to emit efficient code.

Best Regards,
Jatin

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2295785781


More information about the hotspot-compiler-dev mailing list