RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange() [v3]

Tue Feb 7 09:54:45 UTC 2023

On Tue, 7 Feb 2023 09:51:19 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The Vector API `"indexInRange(int offset, int limit)"` is used
>> to compute a vector mask whose lanes are set to true if the
>> index of the lane is inside the range specified by the `"offset"`
>> and `"limit"` arguments, otherwise the lanes are set to false.
>> 
>> There are two special cases for this API:
>>  1) If `"offset >= 0 && offset >= limit"`, all the lanes of the
>> generated mask are false.
>>  2) If` "offset >= 0 && limit - offset >= vlength"`, all the
>> lanes of the generated mask are true. Note that `"vlength"` is
>> the number of vector lanes.
>> 
>> For such special cases, we can simply use `"maskAll(false|true)"`
>> to implement the API. Otherwise, the original comparison with
>> `"iota" `vector is needed. And for further optimization, we have
>> optimal instruction supported by SVE (i.e. whilelo [1]), which
>> can implement the API directly if the `"offset >= 0"`.
>> 
>> As a summary, to optimize the API, we can use the if-else branches
>> to handle the specific cases in java level and intrinsify the
>> remaining case by C2 compiler:
>> 
>> 
>>   public VectorMask<E> indexInRange(int offset, int limit) {
>>       if (offset < 0) {
>>           return this.and(indexInRange0Helper(offset, limit));
>>       } else if (offset >= limit) {
>>           return this.and(vectorSpecies().maskAll(false));
>>       } else if (limit - offset >= length()) {
>>           return this.and(vectorSpecies().maskAll(true));
>>       }
>>       return this.and(indexInRange0(offset, limit));
>>  }
>> 
>> 
>> The last part (i.e. `"indexInRange0"`) in the above implementation
>> is expected to be intrinsified by C2 compiler if the necessary IRs
>> are supported. Otherwise, it will fall back to the original API
>> implementation (i.e. `"indexInRange0Helper"`). Regarding to the
>> intrinsifaction, the compiler will generate `"VectorMaskGen"` IR
>> with "limit - offset" as the input if the current platform supports
>> it. Otherwise, it generates `"VectorLoadConst + VectorMaskCmp"` based
>> on `"iota < limit - offset"`.
>> 
>> For the following java code which uses `"indexInRange"`:
>> 
>> 
>> static final VectorSpecies<Double> SPECIES =
>>                                    DoubleVector.SPECIES_PREFERRED;
>> static final int LENGTH = 1027;
>> 
>> public static double[] da;
>> public static double[] db;
>> public static double[] dc;
>> 
>> private static void func() {
>>     for (int i = 0; i < LENGTH; i += SPECIES.length()) {
>>         var m = SPECIES.indexInRange(i, LENGTH);
>>         var av = DoubleVector.fromArray(SPECIES, da, i, m);
>>         av.lanewise(VectorOperators.NEG).intoArray(dc, i, m);
>>     }
>> }
>> 
>> 
>> The core code generated with SVE 256-bit vector size is:
>> 
>> 
>>   ptrue   p2.d                  ; maskAll(true)
>>   ...
>> LOOP:
>>   ...
>>   sub     w11, w13, w14         ; limit - offset
>>   cmp     w14, w13
>>   b.cs    LABEL-1               ; if (offset >= limit) => uncommon-trap
>>   cmp     w11, #0x4
>>   b.lt    LABEL-2               ; if (limit - offset < vlength)
>>   mov     p1.b, p2.b
>> LABEL-3:
>>   ld1d    {z16.d}, p1/z, [x10]  ; load vector masked
>>   ...
>>   cmp     w14, w29
>>   b.cc    LOOP
>>   ...
>> LABEL-2:
>>   whilelo p1.d, x16, x10        ; VectorMaskGen
>>   ...
>>   b       LABEL-3
>>   ...
>> LABEL-1:
>>   uncommon-trap
>> 
>> 
>> Please note that if the array size `LENGTH` is aligned with
>> the vector size 256 (i.e. `LENGTH = 1024`), the branch "LABEL-2"
>> will be optimized out by compiler and it becomes another
>> uncommon-trap.
>> 
>> For NEON, the main CFG is the same with above. But the compiler
>> intrinsification is different. Here is the code:
>> 
>> 
>>   sub     x10, x10, x12          ; limit - offset
>>   scvtf   d16, x10
>>   dup     v16.2d, v16.d[0]       ; replicateD
>> 
>>   mov     x8, #0xd8d0
>>   movk    x8, #0x84cb, lsl #16
>>   movk    x8, #0xffff, lsl #32
>>   ldr     q17, [x8], #0          ; load the "iota" const vector
>>   fcmgt   v18.2d, v16.2d, v17.2d ; mask = iota < limit - offset
>> 
>> 
>> Here is the performance data of the new added benchmark on an ARM
>> SVE 256-bit platform:
>> 
>> 
>> Benchmark                               (size)  Before    After   Units
>> IndexInRangeBenchmark.byteIndexInRange   1024 11203.697 41404.431 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange   1027  2365.920  8747.004 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1024  1227.505  6092.194 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1027   351.215  1156.683 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange  1024  1468.876 11032.580 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange  1027   699.645  2439.671 ops/ms
>> IndexInRangeBenchmark.intIndexInRange    1024  2842.187 11903.544 ops/ms
>> IndexInRangeBenchmark.intIndexInRange    1027   689.866  2547.424 ops/ms
>> IndexInRangeBenchmark.longIndexInRange   1024  1394.135  5902.973 ops/ms
>> IndexInRangeBenchmark.longIndexInRange   1027   355.621  1189.458 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange  1024  5521.468 21578.340 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange  1027  1264.816  4640.504 ops/ms
>> 
>> 
>> And the performance data with ARM NEON:
>> 
>> 
>> Benchmark                               (size)  Before    After   Units
>> IndexInRangeBenchmark.byteIndexInRange   1024  4026.548 15562.880 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange   1027   305.314   576.559 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1024   289.224  2244.080 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1027    39.740    76.499 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange  1024   675.264  4457.470 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange  1027    79.918   144.952 ops/ms
>> IndexInRangeBenchmark.intIndexInRange    1024   740.139  4014.583 ops/ms
>> IndexInRangeBenchmark.intIndexInRange    1027    78.608   147.903 ops/ms
>> IndexInRangeBenchmark.longIndexInRange   1024   400.683  2209.551 ops/ms
>> IndexInRangeBenchmark.longIndexInRange   1027    41.146    69.599 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange  1024  1821.736  8153.546 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange  1027   158.810   243.205 ops/ms
>> 
>> 
>> The performance improves about `3.5x ~ 7.5x` on the vector size aligned
>> (1024 size) benchmarks both with NEON and SVE. And it improves about
>> `3.5x/1.8x` on the vector size not aligned (1027 size) benchmarks with
>> SVE/NEON respectively. We can also observe the similar improvement on
>> the x86 platforms.
>> 
>> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-
>
> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rename the indexInRange API and simply the benchmarks

Hi Paul,

I updated the API name as you suggested and simply the benchmarks by removing the calling to masked `fromArray()/intoArray()` APIs.

Here are the benchmark result compared with jdk/master with ARM NEON:

Benchmark                                (size)  Mode  Cnt    Before    After    Units
IndexInRangeBenchmark.byteIndexInRange       7  thrpt   5  164957.447 188954.757 ops/ms
IndexInRangeBenchmark.byteIndexInRange     256  thrpt   5   28373.131  60895.091 ops/ms
IndexInRangeBenchmark.byteIndexInRange     259  thrpt   5   28290.365  55573.807 ops/ms
IndexInRangeBenchmark.byteIndexInRange     512  thrpt   5   15695.618  49147.370 ops/ms
IndexInRangeBenchmark.doubleIndexInRange     7  thrpt   5   58926.711  87837.117 ops/ms
IndexInRangeBenchmark.doubleIndexInRange   256  thrpt   5    2558.505  17795.100 ops/ms
IndexInRangeBenchmark.doubleIndexInRange   259  thrpt   5    2521.995   5309.487 ops/ms
IndexInRangeBenchmark.doubleIndexInRange   512  thrpt   5    1289.556   8882.959 ops/ms
IndexInRangeBenchmark.floatIndexInRange      7  thrpt   5  113429.518 114530.506 ops/ms
IndexInRangeBenchmark.floatIndexInRange    256  thrpt   5    5681.129  31686.156 ops/ms
IndexInRangeBenchmark.floatIndexInRange    259  thrpt   5    5614.762  13659.272 ops/ms
IndexInRangeBenchmark.floatIndexInRange    512  thrpt   5    2897.391  17796.357 ops/ms
IndexInRangeBenchmark.intIndexInRange        7  thrpt   5   50990.391 125139.575 ops/ms
IndexInRangeBenchmark.intIndexInRange      256  thrpt   5    8444.632  31090.867 ops/ms
IndexInRangeBenchmark.intIndexInRange      259  thrpt   5    8349.075  20258.705 ops/ms
IndexInRangeBenchmark.intIndexInRange      512  thrpt   5    4525.218  17555.370 ops/ms
IndexInRangeBenchmark.longIndexInRange       7  thrpt   5   77003.438  89592.650 ops/ms
IndexInRangeBenchmark.longIndexInRange     256  thrpt   5    3669.537  17455.742 ops/ms
IndexInRangeBenchmark.longIndexInRange     259  thrpt   5    3672.086  11150.989 ops/ms
IndexInRangeBenchmark.longIndexInRange     512  thrpt   5    1883.831   8832.311 ops/ms
IndexInRangeBenchmark.shortIndexInRange      7  thrpt   5  159881.634 185593.426 ops/ms
IndexInRangeBenchmark.shortIndexInRange    256  thrpt   5   16762.736  50486.836 ops/ms
IndexInRangeBenchmark.shortIndexInRange    259  thrpt   5   16490.397  35110.418 ops/ms
IndexInRangeBenchmark.shortIndexInRange    512  thrpt   5    8815.322  31113.907 ops/ms

And the result with SVE 512-bit vector size:

Benchmark                                (size)  Mode  Cnt   Before    After     Units
IndexInRangeBenchmark.byteIndexInRange       7  thrpt   5  48977.004  62712.3874 ops/ms
IndexInRangeBenchmark.byteIndexInRange     256  thrpt   5  28005.444  36067.6281 ops/ms
IndexInRangeBenchmark.byteIndexInRange     259  thrpt   5  26833.661  33337.5660 ops/ms
IndexInRangeBenchmark.byteIndexInRange     512  thrpt   5  18621.850  26251.4372 ops/ms
IndexInRangeBenchmark.doubleIndexInRange     7  thrpt   5  31556.967  63184.8951 ops/ms
IndexInRangeBenchmark.doubleIndexInRange   256  thrpt   5   4394.624  22536.9730 ops/ms
IndexInRangeBenchmark.doubleIndexInRange   259  thrpt   5   4390.727  13714.7822 ops/ms
IndexInRangeBenchmark.doubleIndexInRange   512  thrpt   5   2358.633  15654.2022 ops/ms
IndexInRangeBenchmark.floatIndexInRange      7  thrpt   5  31507.582  62985.8334 ops/ms
IndexInRangeBenchmark.floatIndexInRange    256  thrpt   5   7873.270  25331.0291 ops/ms
IndexInRangeBenchmark.floatIndexInRange    259  thrpt   5   7733.960  22011.2921 ops/ms
IndexInRangeBenchmark.floatIndexInRange    512  thrpt   5   4392.090  21542.3555 ops/ms
IndexInRangeBenchmark.intIndexInRange        7  thrpt   5  55291.415  62846.4699 ops/ms
IndexInRangeBenchmark.intIndexInRange      256  thrpt   5  12580.224  25637.0236 ops/ms
IndexInRangeBenchmark.intIndexInRange      259  thrpt   5  12815.614  23283.9921 ops/ms
IndexInRangeBenchmark.intIndexInRange      512  thrpt   5   7737.667  21611.9642 ops/ms
IndexInRangeBenchmark.longIndexInRange       7  thrpt   5  46632.264  63072.6243 ops/ms
IndexInRangeBenchmark.longIndexInRange     256  thrpt   5   6664.042  22541.1474 ops/ms
IndexInRangeBenchmark.longIndexInRange     259  thrpt   5   6294.857  16994.0206 ops/ms
IndexInRangeBenchmark.longIndexInRange     512  thrpt   5   3446.688  15689.5675 ops/ms
IndexInRangeBenchmark.shortIndexInRange      7  thrpt   5  43243.398  63971.3060 ops/ms
IndexInRangeBenchmark.shortIndexInRange    256  thrpt   5  17997.651  27081.8088 ops/ms
IndexInRangeBenchmark.shortIndexInRange    259  thrpt   5  16572.132  30804.5928 ops/ms
IndexInRangeBenchmark.shortIndexInRange    512  thrpt   5  10211.183  21771.9652 ops/ms

Similar gains can also be observed on x86 different systems. From the result, we can see the performance doesn't have too much gap between the 256/259 array sizes.

-------------

PR: https://git.openjdk.org/jdk/pull/12064