RFR: 8338023: Support two vector selectFrom API [v3]

Jatin Bhateja jbhateja at openjdk.org
Wed Aug 21 16:52:06 UTC 2024


On Wed, 21 Aug 2024 16:42:44 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs.
>> 
>> 
>> Declaration:-
>>     Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
>> 
>> 
>> Semantics:-
>>     Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types.  The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown.  
>> 
>> Summary of changes:
>> -  Java side implementation of new selectFrom API.
>> -  C2 compiler IR and inline expander changes.
>> -  In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. 
>> -  Optimized x86 backend implementation for AVX512 and legacy target.
>> -  Function tests covering new API.
>> 
>> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :-
>> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
>> 
>> 
>>   Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> SelectFromBenchmark.rearrangeFromByteVector     1024  thrpt    2   2041.762          ops/ms
>> SelectFromBenchmark.rearrangeFromByteVector     2048  thrpt    2   1028.550          ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector      1024  thrpt    2    962.605          ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector      2048  thrpt    2    479.004          ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector     1024  thrpt    2    359.758          ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector     2048  thrpt    2    178.192          ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector    1024  thrpt    2   1463.459          ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector    2048  thrpt    2    727.556          ops/ms
>> SelectFromBenchmark.selectFromByteVector        1024  thrpt    2  33254.830          ops/ms
>> SelectFromBenchmark.selectFromByteVector        2048  thrpt    2  17313.174          ops/ms
>> SelectFromBenchmark.selectFromIntVector         1024  thrpt    2  10756.804          ops/ms
>> S...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Pass explicit wrap argument to selectFrom API with default value set to true.

Hi @rose00 , @sviswa7 , @PaulSandoz , 
As suggested, now passing explicit 'wrap' argument to new selectFrom API.

Following are the performance number of modified JMH micro included with the patch.



Baseline:-
Benchmark                                      (size)   Mode  Cnt     Score   Error   Units
SelectFromBenchmark.rearrangeFromByteVector     4096  thrpt    2  5849.771          ops/ms
SelectFromBenchmark.rearrangeFromDoubleVector   4096  thrpt    2   430.712          ops/ms
SelectFromBenchmark.rearrangeFromFloatVector    4096  thrpt    2   942.737          ops/ms
SelectFromBenchmark.rearrangeFromIntVector      4096  thrpt    2  1057.695          ops/ms
SelectFromBenchmark.rearrangeFromLongVector     4096  thrpt    2   616.360          ops/ms
SelectFromBenchmark.rearrangeFromShortVector    4096  thrpt    2  2146.465          ops/ms

With Patch:-
Benchmark                                   (size)   Mode  Cnt     Score   Error   Units
SelectFromBenchmark.selectFromByteVector        4096  thrpt    2  9543.775          ops/ms
SelectFromBenchmark.selectFromDoubleVector      4096  thrpt    2   558.195          ops/ms
SelectFromBenchmark.selectFromFloatVector       4096  thrpt    2  1325.059          ops/ms
SelectFromBenchmark.selectFromIntVector         4096  thrpt    2  1418.748          ops/ms
SelectFromBenchmark.selectFromLongVector        4096  thrpt    2   687.231          ops/ms
SelectFromBenchmark.selectFromShortVector       4096  thrpt    2  4782.395          ops/ms


With WIP wrap index acceleration PR#20634:
Benchmark                                      (size)   Mode  Cnt     Score   Error   Units
SelectFromBenchmark.rearrangeFromByteVector     4096  thrpt    2  7602.645          ops/ms
SelectFromBenchmark.rearrangeFromDoubleVector   4096  thrpt    2   441.684          ops/ms
SelectFromBenchmark.rearrangeFromFloatVector    4096  thrpt    2   926.112          ops/ms
SelectFromBenchmark.rearrangeFromIntVector      4096  thrpt    2  1061.695          ops/ms
SelectFromBenchmark.rearrangeFromLongVector     4096  thrpt    2   644.058          ops/ms
SelectFromBenchmark.rearrangeFromShortVector    4096  thrpt    2  2777.735          ops/ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2302541724


More information about the core-libs-dev mailing list