RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange()

Wed Feb 1 14:23:02 UTC 2023

On Wed, 18 Jan 2023 08:58:42 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> The Vector API `"indexInRange(int offset, int limit)"` is used
> to compute a vector mask whose lanes are set to true if the
> index of the lane is inside the range specified by the `"offset"`
> and `"limit"` arguments, otherwise the lanes are set to false.
> 
> There are two special cases for this API:
>  1) If `"offset >= 0 && offset >= limit"`, all the lanes of the
> generated mask are false.
>  2) If` "offset >= 0 && limit - offset >= vlength"`, all the
> lanes of the generated mask are true. Note that `"vlength"` is
> the number of vector lanes.
> 
> For such special cases, we can simply use `"maskAll(false|true)"`
> to implement the API. Otherwise, the original comparison with
> `"iota" `vector is needed. And for further optimization, we have
> optimal instruction supported by SVE (i.e. whilelo [1]), which
> can implement the API directly if the `"offset >= 0"`.
> 
> As a summary, to optimize the API, we can use the if-else branches
> to handle the specific cases in java level and intrinsify the
> remaining case by C2 compiler:
> 
> 
>   public VectorMask<E> indexInRange(int offset, int limit) {
>       if (offset < 0) {
>           return this.and(indexInRange0Helper(offset, limit));
>       } else if (offset >= limit) {
>           return this.and(vectorSpecies().maskAll(false));
>       } else if (limit - offset >= length()) {
>           return this.and(vectorSpecies().maskAll(true));
>       }
>       return this.and(indexInRange0(offset, limit));
>  }
> 
> 
> The last part (i.e. `"indexInRange0"`) in the above implementation
> is expected to be intrinsified by C2 compiler if the necessary IRs
> are supported. Otherwise, it will fall back to the original API
> implementation (i.e. `"indexInRange0Helper"`). Regarding to the
> intrinsifaction, the compiler will generate `"VectorMaskGen"` IR
> with "limit - offset" as the input if the current platform supports
> it. Otherwise, it generates `"VectorLoadConst + VectorMaskCmp"` based
> on `"iota < limit - offset"`.
> 
> For the following java code which uses `"indexInRange"`:
> 
> 
> static final VectorSpecies<Double> SPECIES =
>                                    DoubleVector.SPECIES_PREFERRED;
> static final int LENGTH = 1027;
> 
> public static double[] da;
> public static double[] db;
> public static double[] dc;
> 
> private static void func() {
>     for (int i = 0; i < LENGTH; i += SPECIES.length()) {
>         var m = SPECIES.indexInRange(i, LENGTH);
>         var av = DoubleVector.fromArray(SPECIES, da, i, m);
>         av.lanewise(VectorOperators.NEG).intoArray(dc, i, m);
>     }
> }
> 
> 
> The core code generated with SVE 256-bit vector size is:
> 
> 
>   ptrue   p2.d                  ; maskAll(true)
>   ...
> LOOP:
>   ...
>   sub     w11, w13, w14         ; limit - offset
>   cmp     w14, w13
>   b.cs    LABEL-1               ; if (offset >= limit) => uncommon-trap
>   cmp     w11, #0x4
>   b.lt    LABEL-2               ; if (limit - offset < vlength)
>   mov     p1.b, p2.b
> LABEL-3:
>   ld1d    {z16.d}, p1/z, [x10]  ; load vector masked
>   ...
>   cmp     w14, w29
>   b.cc    LOOP
>   ...
> LABEL-2:
>   whilelo p1.d, x16, x10        ; VectorMaskGen
>   ...
>   b       LABEL-3
>   ...
> LABEL-1:
>   uncommon-trap
> 
> 
> Please note that if the array size `LENGTH` is aligned with
> the vector size 256 (i.e. `LENGTH = 1024`), the branch "LABEL-2"
> will be optimized out by compiler and it becomes another
> uncommon-trap.
> 
> For NEON, the main CFG is the same with above. But the compiler
> intrinsification is different. Here is the code:
> 
> 
>   sub     x10, x10, x12          ; limit - offset
>   scvtf   d16, x10
>   dup     v16.2d, v16.d[0]       ; replicateD
> 
>   mov     x8, #0xd8d0
>   movk    x8, #0x84cb, lsl #16
>   movk    x8, #0xffff, lsl #32
>   ldr     q17, [x8], #0          ; load the "iota" const vector
>   fcmgt   v18.2d, v16.2d, v17.2d ; mask = iota < limit - offset
> 
> 
> Here is the performance data of the new added benchmark on an ARM
> SVE 256-bit platform:
> 
> 
> Benchmark                               (size)  Before    After   Units
> IndexInRangeBenchmark.byteIndexInRange   1024 11203.697 41404.431 ops/ms
> IndexInRangeBenchmark.byteIndexInRange   1027  2365.920  8747.004 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024  1227.505  6092.194 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027   351.215  1156.683 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1024  1468.876 11032.580 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1027   699.645  2439.671 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1024  2842.187 11903.544 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1027   689.866  2547.424 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1024  1394.135  5902.973 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1027   355.621  1189.458 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1024  5521.468 21578.340 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1027  1264.816  4640.504 ops/ms
> 
> 
> And the performance data with ARM NEON:
> 
> 
> Benchmark                               (size)  Before    After   Units
> IndexInRangeBenchmark.byteIndexInRange   1024  4026.548 15562.880 ops/ms
> IndexInRangeBenchmark.byteIndexInRange   1027   305.314   576.559 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024   289.224  2244.080 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027    39.740    76.499 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1024   675.264  4457.470 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1027    79.918   144.952 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1024   740.139  4014.583 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1027    78.608   147.903 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1024   400.683  2209.551 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1027    41.146    69.599 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1024  1821.736  8153.546 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1027   158.810   243.205 ops/ms
> 
> 
> The performance improves about `3.5x ~ 7.5x` on the vector size aligned
> (1024 size) benchmarks both with NEON and SVE. And it improves about
> `3.5x/1.8x` on the vector size not aligned (1027 size) benchmarks with
> SVE/NEON respectively. We can also observe the similar improvement on
> the x86 platforms.
> 
> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java line 236:

> 234:         } else if (offset >= limit) {
> 235:             return vectorSpecies().maskAll(false);
> 236:         } else if (limit - offset >= length()) {

Can you move this else if check at the top, this is the most general case appearing in the loop and hence two extra uncommon trap jumps before it for special cases may penalize this.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java line 239:

> 237:             return this;
> 238:         }
> 239:         return this.and(indexInRange0(offset, limit));

Not related to this patch, but I also see a possibility of following ideal transformations:-
        maskAll(true).allTrue() => true 
        maskAll(false).anyTrue() => false

-------------

PR: https://git.openjdk.org/jdk/pull/12064