RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange() [v3]
Xiaohong Gong
xgong at openjdk.org
Tue Feb 7 09:54:45 UTC 2023
On Tue, 7 Feb 2023 09:51:19 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> The Vector API `"indexInRange(int offset, int limit)"` is used
>> to compute a vector mask whose lanes are set to true if the
>> index of the lane is inside the range specified by the `"offset"`
>> and `"limit"` arguments, otherwise the lanes are set to false.
>>
>> There are two special cases for this API:
>> 1) If `"offset >= 0 && offset >= limit"`, all the lanes of the
>> generated mask are false.
>> 2) If` "offset >= 0 && limit - offset >= vlength"`, all the
>> lanes of the generated mask are true. Note that `"vlength"` is
>> the number of vector lanes.
>>
>> For such special cases, we can simply use `"maskAll(false|true)"`
>> to implement the API. Otherwise, the original comparison with
>> `"iota" `vector is needed. And for further optimization, we have
>> optimal instruction supported by SVE (i.e. whilelo [1]), which
>> can implement the API directly if the `"offset >= 0"`.
>>
>> As a summary, to optimize the API, we can use the if-else branches
>> to handle the specific cases in java level and intrinsify the
>> remaining case by C2 compiler:
>>
>>
>> public VectorMask<E> indexInRange(int offset, int limit) {
>> if (offset < 0) {
>> return this.and(indexInRange0Helper(offset, limit));
>> } else if (offset >= limit) {
>> return this.and(vectorSpecies().maskAll(false));
>> } else if (limit - offset >= length()) {
>> return this.and(vectorSpecies().maskAll(true));
>> }
>> return this.and(indexInRange0(offset, limit));
>> }
>>
>>
>> The last part (i.e. `"indexInRange0"`) in the above implementation
>> is expected to be intrinsified by C2 compiler if the necessary IRs
>> are supported. Otherwise, it will fall back to the original API
>> implementation (i.e. `"indexInRange0Helper"`). Regarding to the
>> intrinsifaction, the compiler will generate `"VectorMaskGen"` IR
>> with "limit - offset" as the input if the current platform supports
>> it. Otherwise, it generates `"VectorLoadConst + VectorMaskCmp"` based
>> on `"iota < limit - offset"`.
>>
>> For the following java code which uses `"indexInRange"`:
>>
>>
>> static final VectorSpecies<Double> SPECIES =
>> DoubleVector.SPECIES_PREFERRED;
>> static final int LENGTH = 1027;
>>
>> public static double[] da;
>> public static double[] db;
>> public static double[] dc;
>>
>> private static void func() {
>> for (int i = 0; i < LENGTH; i += SPECIES.length()) {
>> var m = SPECIES.indexInRange(i, LENGTH);
>> var av = DoubleVector.fromArray(SPECIES, da, i, m);
>> av.lanewise(VectorOperators.NEG).intoArray(dc, i, m);
>> }
>> }
>>
>>
>> The core code generated with SVE 256-bit vector size is:
>>
>>
>> ptrue p2.d ; maskAll(true)
>> ...
>> LOOP:
>> ...
>> sub w11, w13, w14 ; limit - offset
>> cmp w14, w13
>> b.cs LABEL-1 ; if (offset >= limit) => uncommon-trap
>> cmp w11, #0x4
>> b.lt LABEL-2 ; if (limit - offset < vlength)
>> mov p1.b, p2.b
>> LABEL-3:
>> ld1d {z16.d}, p1/z, [x10] ; load vector masked
>> ...
>> cmp w14, w29
>> b.cc LOOP
>> ...
>> LABEL-2:
>> whilelo p1.d, x16, x10 ; VectorMaskGen
>> ...
>> b LABEL-3
>> ...
>> LABEL-1:
>> uncommon-trap
>>
>>
>> Please note that if the array size `LENGTH` is aligned with
>> the vector size 256 (i.e. `LENGTH = 1024`), the branch "LABEL-2"
>> will be optimized out by compiler and it becomes another
>> uncommon-trap.
>>
>> For NEON, the main CFG is the same with above. But the compiler
>> intrinsification is different. Here is the code:
>>
>>
>> sub x10, x10, x12 ; limit - offset
>> scvtf d16, x10
>> dup v16.2d, v16.d[0] ; replicateD
>>
>> mov x8, #0xd8d0
>> movk x8, #0x84cb, lsl #16
>> movk x8, #0xffff, lsl #32
>> ldr q17, [x8], #0 ; load the "iota" const vector
>> fcmgt v18.2d, v16.2d, v17.2d ; mask = iota < limit - offset
>>
>>
>> Here is the performance data of the new added benchmark on an ARM
>> SVE 256-bit platform:
>>
>>
>> Benchmark (size) Before After Units
>> IndexInRangeBenchmark.byteIndexInRange 1024 11203.697 41404.431 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 1027 2365.920 8747.004 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1024 1227.505 6092.194 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1027 351.215 1156.683 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 1024 1468.876 11032.580 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 1027 699.645 2439.671 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 1024 2842.187 11903.544 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 1027 689.866 2547.424 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 1024 1394.135 5902.973 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 1027 355.621 1189.458 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 1024 5521.468 21578.340 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 1027 1264.816 4640.504 ops/ms
>>
>>
>> And the performance data with ARM NEON:
>>
>>
>> Benchmark (size) Before After Units
>> IndexInRangeBenchmark.byteIndexInRange 1024 4026.548 15562.880 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 1027 305.314 576.559 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1024 289.224 2244.080 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 1027 39.740 76.499 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 1024 675.264 4457.470 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 1027 79.918 144.952 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 1024 740.139 4014.583 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 1027 78.608 147.903 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 1024 400.683 2209.551 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 1027 41.146 69.599 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 1024 1821.736 8153.546 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 1027 158.810 243.205 ops/ms
>>
>>
>> The performance improves about `3.5x ~ 7.5x` on the vector size aligned
>> (1024 size) benchmarks both with NEON and SVE. And it improves about
>> `3.5x/1.8x` on the vector size not aligned (1027 size) benchmarks with
>> SVE/NEON respectively. We can also observe the similar improvement on
>> the x86 platforms.
>>
>> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-
>
> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
>
> Rename the indexInRange API and simply the benchmarks
Hi Paul,
I updated the API name as you suggested and simply the benchmarks by removing the calling to masked `fromArray()/intoArray()` APIs.
Here are the benchmark result compared with jdk/master with ARM NEON:
Benchmark (size) Mode Cnt Before After Units
IndexInRangeBenchmark.byteIndexInRange 7 thrpt 5 164957.447 188954.757 ops/ms
IndexInRangeBenchmark.byteIndexInRange 256 thrpt 5 28373.131 60895.091 ops/ms
IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 28290.365 55573.807 ops/ms
IndexInRangeBenchmark.byteIndexInRange 512 thrpt 5 15695.618 49147.370 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 7 thrpt 5 58926.711 87837.117 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 256 thrpt 5 2558.505 17795.100 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 2521.995 5309.487 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 512 thrpt 5 1289.556 8882.959 ops/ms
IndexInRangeBenchmark.floatIndexInRange 7 thrpt 5 113429.518 114530.506 ops/ms
IndexInRangeBenchmark.floatIndexInRange 256 thrpt 5 5681.129 31686.156 ops/ms
IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 5614.762 13659.272 ops/ms
IndexInRangeBenchmark.floatIndexInRange 512 thrpt 5 2897.391 17796.357 ops/ms
IndexInRangeBenchmark.intIndexInRange 7 thrpt 5 50990.391 125139.575 ops/ms
IndexInRangeBenchmark.intIndexInRange 256 thrpt 5 8444.632 31090.867 ops/ms
IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 8349.075 20258.705 ops/ms
IndexInRangeBenchmark.intIndexInRange 512 thrpt 5 4525.218 17555.370 ops/ms
IndexInRangeBenchmark.longIndexInRange 7 thrpt 5 77003.438 89592.650 ops/ms
IndexInRangeBenchmark.longIndexInRange 256 thrpt 5 3669.537 17455.742 ops/ms
IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 3672.086 11150.989 ops/ms
IndexInRangeBenchmark.longIndexInRange 512 thrpt 5 1883.831 8832.311 ops/ms
IndexInRangeBenchmark.shortIndexInRange 7 thrpt 5 159881.634 185593.426 ops/ms
IndexInRangeBenchmark.shortIndexInRange 256 thrpt 5 16762.736 50486.836 ops/ms
IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 16490.397 35110.418 ops/ms
IndexInRangeBenchmark.shortIndexInRange 512 thrpt 5 8815.322 31113.907 ops/ms
And the result with SVE 512-bit vector size:
Benchmark (size) Mode Cnt Before After Units
IndexInRangeBenchmark.byteIndexInRange 7 thrpt 5 48977.004 62712.3874 ops/ms
IndexInRangeBenchmark.byteIndexInRange 256 thrpt 5 28005.444 36067.6281 ops/ms
IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 26833.661 33337.5660 ops/ms
IndexInRangeBenchmark.byteIndexInRange 512 thrpt 5 18621.850 26251.4372 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 7 thrpt 5 31556.967 63184.8951 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 256 thrpt 5 4394.624 22536.9730 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4390.727 13714.7822 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 512 thrpt 5 2358.633 15654.2022 ops/ms
IndexInRangeBenchmark.floatIndexInRange 7 thrpt 5 31507.582 62985.8334 ops/ms
IndexInRangeBenchmark.floatIndexInRange 256 thrpt 5 7873.270 25331.0291 ops/ms
IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 7733.960 22011.2921 ops/ms
IndexInRangeBenchmark.floatIndexInRange 512 thrpt 5 4392.090 21542.3555 ops/ms
IndexInRangeBenchmark.intIndexInRange 7 thrpt 5 55291.415 62846.4699 ops/ms
IndexInRangeBenchmark.intIndexInRange 256 thrpt 5 12580.224 25637.0236 ops/ms
IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 12815.614 23283.9921 ops/ms
IndexInRangeBenchmark.intIndexInRange 512 thrpt 5 7737.667 21611.9642 ops/ms
IndexInRangeBenchmark.longIndexInRange 7 thrpt 5 46632.264 63072.6243 ops/ms
IndexInRangeBenchmark.longIndexInRange 256 thrpt 5 6664.042 22541.1474 ops/ms
IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 6294.857 16994.0206 ops/ms
IndexInRangeBenchmark.longIndexInRange 512 thrpt 5 3446.688 15689.5675 ops/ms
IndexInRangeBenchmark.shortIndexInRange 7 thrpt 5 43243.398 63971.3060 ops/ms
IndexInRangeBenchmark.shortIndexInRange 256 thrpt 5 17997.651 27081.8088 ops/ms
IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 16572.132 30804.5928 ops/ms
IndexInRangeBenchmark.shortIndexInRange 512 thrpt 5 10211.183 21771.9652 ops/ms
Similar gains can also be observed on x86 different systems. From the result, we can see the performance doesn't have too much gap between the 256/259 array sizes.
-------------
PR: https://git.openjdk.org/jdk/pull/12064
More information about the hotspot-compiler-dev
mailing list