RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange()

Fri Feb 3 01:47:52 UTC 2023

On Thu, 2 Feb 2023 18:07:59 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi, I modified a version by using the old implementation for the tail loop instead of adding the new intrinsics. The code looks like:
>> 
>> public VectorMask<E> indexInRange(int offset, int limit) {
>>         int vlength = length();
>>         if (offset >= 0 && limit - offset >= length()) {
>>             return this;
>>         } else if (offset >= limit) {
>>             return vectorSpecies().maskAll(false);
>>         }
>> 
>>         Vector<E> iota = vectorSpecies().zero().addIndex(1);
>>         VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
>>         return this.andNot(badMask);
>>     }
>> 
>> And I tested the performance of the new added benchmarks with different vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show that the performance of changed version is not better than the current version, if the array size is not aligned with the vector size, especially for the double/long type with larger size.
>> 
>> Here are some raw data with NEON:
>> 
>> Benchmark                                 (size)   Mode  Cnt  current   modified   Units
>> IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  2654.919  2584.423   ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  1830.364  1802.876   ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  1058.548  1073.742   ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   594.920   593.832   ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   308.678   149.279   ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   160.639    74.579   ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  1097.567  1104.008   ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   617.845   606.886   ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   315.978   152.046   ops/ms
>> IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  1165.279  1205.486   ops/ms
>> IndexInRangeBenchmark.intIndexInRange        259  thrpt    5   633.648   631.672   ops/ms
>> IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   315.370   154.144   ops/ms
>> IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   639.840   633.623   ops/ms
>> IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   312.267   152.788   ops/ms
>> IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   163.028    78.150   ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  1834.318  1800.318   ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  1105.695  1094.347   ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   602.442   599.827   ops/ms
>> 
>> 
>> And the data with SVE 256-bit vector size:
>> 
>> Benchmark                                 (size)   Mode  Cnt   current   modified Units
>> IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  23772.370 22921.113 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  18930.388 17920.910 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  13528.610 13282.504 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   7850.522  7975.720 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   4281.749  4373.926 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   2160.001   604.458 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  13594.943 13306.904 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   8163.134  7912.343 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   4335.529  4198.555 ops/ms
>> IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  22106.880 20348.266 ops/ms
>> IndexInRangeBenchmark.intIndexInRange        259  thrpt    5  11711.588 10958.299 ops/ms
>> IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   5501.034  5358.441 ops/ms
>> IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   9832.578  9829.398 ops/ms
>> IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   4979.326  4947.166 ops/ms
>> IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   2269.131   614.204 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  19865.866 19297.628 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  14005.214 13592.407 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   8766.450  8531.675 ops/ms
>> 
>> As a conclusion, I prefer to keep the current version. WDYT?
>
> Yes, I agree, I did collect performance data with your benchmarks at various AVX levels and see significant gains. Cause of performance variation b/w integer and floating point cases is due to below limitation.
> https://github.com/openjdk/jdk/pull/12064#discussion_r1094101761
> Which can be addressed in a separate PR.
> 
> FTR here are performance numbers at UseAVX=3
> 
> 
> Benchmark                                 (size)   Mode  Cnt      Score   Error   Units
> IndexInRangeBenchmark.byteIndexInRange      1024  thrpt    2  74983.406          ops/ms
> IndexInRangeBenchmark.byteIndexInRange      1027  thrpt    2  19156.962          ops/ms
> IndexInRangeBenchmark.doubleIndexInRange    1024  thrpt    2  11368.179          ops/ms
> IndexInRangeBenchmark.doubleIndexInRange    1027  thrpt    2   2165.207          ops/ms
> IndexInRangeBenchmark.floatIndexInRange     1024  thrpt    2  18736.787          ops/ms
> IndexInRangeBenchmark.floatIndexInRange     1027  thrpt    2   3798.996          ops/ms
> IndexInRangeBenchmark.intIndexInRange       1024  thrpt    2  18797.863          ops/ms
> IndexInRangeBenchmark.intIndexInRange       1027  thrpt    2   5455.317          ops/ms
> IndexInRangeBenchmark.longIndexInRange      1024  thrpt    2  11866.493          ops/ms
> IndexInRangeBenchmark.longIndexInRange      1027  thrpt    2   2227.896          ops/ms
> IndexInRangeBenchmark.shortIndexInRange     1024  thrpt    2  46921.520          ops/ms
> IndexInRangeBenchmark.shortIndexInRange     1027  thrpt    2   8532.394          ops/ms

Thanks for the performance testing on x86 systems! Agree that a separate PR is fine, and I will address it once this PR merged.

-------------

PR: https://git.openjdk.org/jdk/pull/12064