RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange()
Jatin Bhateja
jbhateja at openjdk.org
Wed Feb 1 14:23:02 UTC 2023
On Wed, 18 Jan 2023 08:58:42 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
> The Vector API `"indexInRange(int offset, int limit)"` is used
> to compute a vector mask whose lanes are set to true if the
> index of the lane is inside the range specified by the `"offset"`
> and `"limit"` arguments, otherwise the lanes are set to false.
>
> There are two special cases for this API:
> 1) If `"offset >= 0 && offset >= limit"`, all the lanes of the
> generated mask are false.
> 2) If` "offset >= 0 && limit - offset >= vlength"`, all the
> lanes of the generated mask are true. Note that `"vlength"` is
> the number of vector lanes.
>
> For such special cases, we can simply use `"maskAll(false|true)"`
> to implement the API. Otherwise, the original comparison with
> `"iota" `vector is needed. And for further optimization, we have
> optimal instruction supported by SVE (i.e. whilelo [1]), which
> can implement the API directly if the `"offset >= 0"`.
>
> As a summary, to optimize the API, we can use the if-else branches
> to handle the specific cases in java level and intrinsify the
> remaining case by C2 compiler:
>
>
> public VectorMask<E> indexInRange(int offset, int limit) {
> if (offset < 0) {
> return this.and(indexInRange0Helper(offset, limit));
> } else if (offset >= limit) {
> return this.and(vectorSpecies().maskAll(false));
> } else if (limit - offset >= length()) {
> return this.and(vectorSpecies().maskAll(true));
> }
> return this.and(indexInRange0(offset, limit));
> }
>
>
> The last part (i.e. `"indexInRange0"`) in the above implementation
> is expected to be intrinsified by C2 compiler if the necessary IRs
> are supported. Otherwise, it will fall back to the original API
> implementation (i.e. `"indexInRange0Helper"`). Regarding to the
> intrinsifaction, the compiler will generate `"VectorMaskGen"` IR
> with "limit - offset" as the input if the current platform supports
> it. Otherwise, it generates `"VectorLoadConst + VectorMaskCmp"` based
> on `"iota < limit - offset"`.
>
> For the following java code which uses `"indexInRange"`:
>
>
> static final VectorSpecies<Double> SPECIES =
> DoubleVector.SPECIES_PREFERRED;
> static final int LENGTH = 1027;
>
> public static double[] da;
> public static double[] db;
> public static double[] dc;
>
> private static void func() {
> for (int i = 0; i < LENGTH; i += SPECIES.length()) {
> var m = SPECIES.indexInRange(i, LENGTH);
> var av = DoubleVector.fromArray(SPECIES, da, i, m);
> av.lanewise(VectorOperators.NEG).intoArray(dc, i, m);
> }
> }
>
>
> The core code generated with SVE 256-bit vector size is:
>
>
> ptrue p2.d ; maskAll(true)
> ...
> LOOP:
> ...
> sub w11, w13, w14 ; limit - offset
> cmp w14, w13
> b.cs LABEL-1 ; if (offset >= limit) => uncommon-trap
> cmp w11, #0x4
> b.lt LABEL-2 ; if (limit - offset < vlength)
> mov p1.b, p2.b
> LABEL-3:
> ld1d {z16.d}, p1/z, [x10] ; load vector masked
> ...
> cmp w14, w29
> b.cc LOOP
> ...
> LABEL-2:
> whilelo p1.d, x16, x10 ; VectorMaskGen
> ...
> b LABEL-3
> ...
> LABEL-1:
> uncommon-trap
>
>
> Please note that if the array size `LENGTH` is aligned with
> the vector size 256 (i.e. `LENGTH = 1024`), the branch "LABEL-2"
> will be optimized out by compiler and it becomes another
> uncommon-trap.
>
> For NEON, the main CFG is the same with above. But the compiler
> intrinsification is different. Here is the code:
>
>
> sub x10, x10, x12 ; limit - offset
> scvtf d16, x10
> dup v16.2d, v16.d[0] ; replicateD
>
> mov x8, #0xd8d0
> movk x8, #0x84cb, lsl #16
> movk x8, #0xffff, lsl #32
> ldr q17, [x8], #0 ; load the "iota" const vector
> fcmgt v18.2d, v16.2d, v17.2d ; mask = iota < limit - offset
>
>
> Here is the performance data of the new added benchmark on an ARM
> SVE 256-bit platform:
>
>
> Benchmark (size) Before After Units
> IndexInRangeBenchmark.byteIndexInRange 1024 11203.697 41404.431 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 1027 2365.920 8747.004 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024 1227.505 6092.194 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027 351.215 1156.683 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1024 1468.876 11032.580 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1027 699.645 2439.671 ops/ms
> IndexInRangeBenchmark.intIndexInRange 1024 2842.187 11903.544 ops/ms
> IndexInRangeBenchmark.intIndexInRange 1027 689.866 2547.424 ops/ms
> IndexInRangeBenchmark.longIndexInRange 1024 1394.135 5902.973 ops/ms
> IndexInRangeBenchmark.longIndexInRange 1027 355.621 1189.458 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1024 5521.468 21578.340 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1027 1264.816 4640.504 ops/ms
>
>
> And the performance data with ARM NEON:
>
>
> Benchmark (size) Before After Units
> IndexInRangeBenchmark.byteIndexInRange 1024 4026.548 15562.880 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 1027 305.314 576.559 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024 289.224 2244.080 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027 39.740 76.499 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1024 675.264 4457.470 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1027 79.918 144.952 ops/ms
> IndexInRangeBenchmark.intIndexInRange 1024 740.139 4014.583 ops/ms
> IndexInRangeBenchmark.intIndexInRange 1027 78.608 147.903 ops/ms
> IndexInRangeBenchmark.longIndexInRange 1024 400.683 2209.551 ops/ms
> IndexInRangeBenchmark.longIndexInRange 1027 41.146 69.599 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1024 1821.736 8153.546 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1027 158.810 243.205 ops/ms
>
>
> The performance improves about `3.5x ~ 7.5x` on the vector size aligned
> (1024 size) benchmarks both with NEON and SVE. And it improves about
> `3.5x/1.8x` on the vector size not aligned (1027 size) benchmarks with
> SVE/NEON respectively. We can also observe the similar improvement on
> the x86 platforms.
>
> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-
src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java line 236:
> 234: } else if (offset >= limit) {
> 235: return vectorSpecies().maskAll(false);
> 236: } else if (limit - offset >= length()) {
Can you move this else if check at the top, this is the most general case appearing in the loop and hence two extra uncommon trap jumps before it for special cases may penalize this.
src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java line 239:
> 237: return this;
> 238: }
> 239: return this.and(indexInRange0(offset, limit));
Not related to this patch, but I also see a possibility of following ideal transformations:-
maskAll(true).allTrue() => true
maskAll(false).anyTrue() => false
-------------
PR: https://git.openjdk.org/jdk/pull/12064
More information about the hotspot-compiler-dev
mailing list